97 Commits

Author SHA1 Message Date
Alex Vandiver
f91bbee395 prometheus: Move weblate metrics to pushgateway.
We leave the weblate metrics server and add a cron job to fetch its
contents and post them to the pushgateway.
2025-10-22 11:57:02 -07:00
Alex Vandiver
25d156cc9c prometheus: Add a pushgateway.
Prometheus declares metrics to be "stale" if they go 5min without an
update[^1], and removes them -- which makes longer scrape intervals
impossible.

Add a pushgateway, which caches output from such jobs.

[^1]: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness
2025-10-22 11:57:02 -07:00
Alex Vandiver
d87d307e53 kandra: Skip Grafana JWT role sync; we set roles in the UI. 2025-10-22 13:24:12 +00:00
Alex Vandiver
b61b495112 puppet: Update dependencies. 2025-10-21 08:28:18 -07:00
Alex Vandiver
456354575e kandra: Add a mirror of #signups to CZO. 2025-10-03 10:47:16 -07:00
Anders Kaseorg
40a022dcc3 zephyr: Remove Zephyr mirroring support.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-09-16 11:18:18 -07:00
Alex Vandiver
82b40ac6e2 puppet: Add an exporter to fetch Weblate stats, for graphing.
This is done synchronously, despite taking ~60s.  We can move it to a
background thread later if that's an issue, but generally Prometheus
is tolerant to exporters taking a while to come back with results.
2025-08-14 13:18:56 -07:00
Alex Vandiver
2b1e8c28b4 nagios: Make queue warnings message CZO. 2025-06-18 12:29:57 -07:00
Alex Vandiver
dfadc9fa39 puppet: Remove zephyr configuration and hosts. 2025-06-13 16:54:26 -07:00
Alex Vandiver
1d72e38e62 process_exporter: Monitor the standalone email_server. 2025-05-20 07:03:23 -07:00
Alex Vandiver
a6a5fc246a kandra: Swap in zulip::local_mailserver for postfix. 2025-05-19 16:39:44 -07:00
Alex Vandiver
7a8a8f5f23 puppet: Use Service for PostgreSQL restarts.
Using pg_ctlcluster leaves systemctl thinking the process aborted; and
not all instances (e.g. Docker) have systemctl.
2025-05-16 11:33:20 -07:00
Alex Vandiver
47238ed74c kandra: Remove PostgreSQL upgrade key. 2025-05-14 09:16:39 -07:00
Alex Vandiver
1b9a8dbc52 kandra: Allow PostgreSQL primary ssh access for PostgreSQL upgrade.
This allows scripting of the whole upgrade process; these grants are
temporary.
2025-05-09 10:05:19 -07:00
Alex Vandiver
a959f71208 kandra: Support installing multiple ssh keys from multiple secrets. 2025-05-09 10:05:19 -07:00
Alex Vandiver
614fdee5ca kandra: Add a pattern grouping for /api/v1/tus/.
We already had one for `/api/v1/tus/...`, but not the bare
`/api/v1/tus/` which is used to initiate the upload.
2025-05-05 12:13:16 -07:00
Alex Vandiver
309876796e kandra: Add a couple useful command-line tools. 2025-04-11 11:25:58 -07:00
Alex Vandiver
b11cbbab01 smokescreen: Move metrics port from the default 9810, to 4760.
This prevents errors if Smokescreen is running on a host with more
than 10 Tornado shards.
2025-04-11 10:29:26 -07:00
Alex Vandiver
9073ad3cc5 kandra: Add missing variable used in the template. 2025-04-02 09:55:21 -07:00
Alex Vandiver
258003a1c8 kandra: Extract the Teleport host CA to disk. 2025-04-02 09:55:21 -07:00
Alex Vandiver
5bda1ac521 kandra: Include the new grok monitoring exporter for all frontend hosts. 2025-03-28 10:50:03 -04:00
Alex Vandiver
840fa74854 kandra: Add a grok exporter to parse nginx logfiles.
This provides access logging metrics to Prometheus.  For cardinality
reasons, we cannot (nor would we want to) put every request path into
its own label value -- but we do separate out the most-frequent access
paths (as well as some low-frequency but high-interest ones) into
their own label values.

In order to differentiate accesses to https://zulip.com/ from
https://example.zulipchat.com/ (both of which appear at path `/`), we
use a `grok_exporter.realm_names_regex` value in `zulip.conf`, which
is expected to be set to match the hostname of all possible realms.
2025-03-27 21:56:54 -07:00
Alex Vandiver
bd54f0363e kandra: Update prometheus configuration.
This pulls in the more complete production Prometheus configuration.
2025-03-27 21:56:54 -07:00
Alex Vandiver
b0292ffe37 puppet: Exclude pgroonga from unattended-upgrades. 2025-03-24 11:42:50 -07:00
Alex Vandiver
eca77631b4 kandra: Rename onboarding-video to navigation-tour-video. 2025-03-13 09:54:44 -07:00
Alex Vandiver
fa3983ac46 kandra: Add /static/onboarding-video/ static content. 2025-03-05 22:37:19 -08:00
Alex Vandiver
412826e08b kandra: Add zulip-notify hooks for CZO. 2025-03-05 13:57:10 -08:00
Alex Vandiver
c5200e8b05 deliver_scheduled_emails: Use a queue, instead of infinite retries.
`deliver_scheduled_emails` tries to deliver the email synchronously,
and if it fails, it retries after 10 seconds.  Since it does not track
retries, and always tries the earliest-scheduled-but-due message
first, the worker will not make forward progress if there is a
persistent failure with that message, and will retry indefinitely.
This can result in excessive network or email delivery charges from
the remote SMTP server.

Switch to delivering emails via a new queue worker.  The
`deliver_scheduled_emails` job now serves only to pull deferred jobs
out of the table once they are due, insert them into RabbitMQ, and
then delete them.  This limits the potential for head-of-queue
failures to failures inserting into RabbitMQ, which is more reasonable
than failures speaking to a complex external system we do not control.
Retries and any connections to the SMTP server are left to the
RabbitMQ consumer.

We build a new RabbitMQ queue, rather than use the existing
`email_senders` queue, because that queue is expected to be reasonably
low-latency, for things like missed message notifications.  The
`send_future_email` codepath which inserts into ScheduledEmails is
also (ab)used to digest emails, which are extremely bursty in their
frequency -- and a large burst could significantly delay emails behind
it in the queue.

The new queue is explicitly only for messages which were not initiated
by user actions (e.g., invitation reminders, digests, new account
follow-ups) which are thus not latency-sensitive.

Fixes: #32463.
2025-03-04 16:09:25 -08:00
Alex Vandiver
c84c76f3a1 puppet: Upgrade external dependencies. 2025-02-21 14:38:31 -08:00
Alex Vandiver
3e421d71ba kandra: Update Teleport CA. 2025-02-21 10:16:33 -08:00
Alex Vandiver
ce34ebdfed kandra: Add a hook to notify of local schema drift from merge-base. 2025-02-19 12:43:00 -08:00
Alex Vandiver
af4fa75b66 puppet: Upgrade version of aws tool. 2025-01-28 17:43:08 -08:00
Anders Kaseorg
19b8cde27f ruff: Fix PLC0206 Extracting value from dictionary without calling .items().
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-12-21 21:06:53 -08:00
Alex Vandiver
f325e15439 nagios: Switch staging hosts to not page, but send a zulip. 2024-10-08 16:55:07 -07:00
Alex Vandiver
87ee167726 puppet: Allow tusd to be exposed on non-localhost.
This allows its /metrics endpoint to be monitored.
2024-09-27 15:06:47 -07:00
Alex Vandiver
0c7d83f7da kandra: Use vector to plumb SES logs into S3. 2024-09-26 11:19:45 -07:00
Alex Vandiver
60759ab5fb kandra: Use generic "vector" process, not dedicated "akamai" process.
This makes the Vector configuration extensible, to allow it to be used
not just for ingesting Akamai logs.
2024-09-26 11:19:45 -07:00
Alex Vandiver
77a121082b kandra: Add localhost access to internal APIs on port 80.
This parallels 02d3fb7666.
2024-09-25 10:08:27 -07:00
Alex Vandiver
64a16dd9b4 kandra: We do not serve staging from staging.zulip.com:80.
It is not configured in the ALB's rules, nor does the ALB ever forward
to port 80.
2024-09-09 15:17:19 -07:00
Alex Vandiver
6cc480e2b5 kandra: Ensure that we do not error out creating standby.signal.
Force us to have restored the `/var/lib/postgresql/14/main` directory
before we try to create `standby.signal` in it.
2024-08-20 14:03:51 -07:00
Alex Vandiver
49e5722a9f kandra: Match the logic in zulip::profile::postgresql. 2024-08-20 14:03:51 -07:00
Alex Vandiver
9e58780d46 puppet: Fix indentation. 2024-08-20 14:03:51 -07:00
Alex Vandiver
c0237e3d58 kandra: Correct group ownership and permissions to match package's.
This reverts b000328ba5 and re-applies
a759d26a32.  It is likely that the
upstream packages are simply inconsistent.
2024-08-20 10:04:40 -07:00
Alex Vandiver
fe0a7b3b77 kandra: Drop support for 20.04 hosts. 2024-08-20 10:03:59 -07:00
Mateusz Mandera
aaca394813 presence: Remove the queue worker. 2024-07-31 16:46:42 -07:00
Alex Vandiver
bf807eabd7 nagios: Make zephyr mirroring not page.
It is often flaky for reasons we cannot affect, and this is no longer
an important part of the Zulip Cloud service.
2024-07-26 14:24:50 -07:00
Anders Kaseorg
f3c67103cf memcached_exporter: Explicitly close memcached connection.
Fixes warnings like “ResourceWarning: unclosed <socket.socket fd=5,
family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0,
laddr=('127.0.0.1', 39702), raddr=('127.0.0.1', 11211)>” with warnings
enabled.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-16 14:50:09 -07:00
Alex Vandiver
2e38f426f4 upload: Generate thumbnails when images are uploaded.
A new table is created to track which path_id attachments are images,
and for those their metadata, and which thumbnails have been created.
Using path_id as the effective primary key lets us ignore if the
attachment is archived or not, saving some foreign key messes.

A new worker is added to observe events when rows are added to this
table, and to generate and store thumbnails for those images in
differing sizes and formats.
2024-07-16 13:22:15 -07:00
Anders Kaseorg
0fa5e7f629 ruff: Fix UP035 Import from collections.abc, typing instead.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-13 22:28:22 -07:00
Anders Kaseorg
531b34cb4c ruff: Fix UP007 Use X | Y for type annotations.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-13 22:28:22 -07:00