Commit Graph

1838 Commits

Author SHA1 Message Date
Alex Vandiver
662c946635 puppet: Stop using an unnecessary concat.
This was added in 6975417acf, to support `zmirror` deployments,
which are no longer necessary.
2025-08-29 09:45:28 -07:00
Alex Vandiver
72ea340f80 puppet: Skip client reloads during weekly restarts.
Weekly restarts do not deploy any new client code, so there is no
reason to push clients to reload.
2025-08-19 23:39:38 -07:00
Alex Vandiver
82b40ac6e2 puppet: Add an exporter to fetch Weblate stats, for graphing.
This is done synchronously, despite taking ~60s.  We can move it to a
background thread later if that's an issue, but generally Prometheus
is tolerant to exporters taking a while to come back with results.
2025-08-14 13:18:56 -07:00
Alex Vandiver
2909d4917c send_zulip_update_announcements: Log updates in the post-deploy hook. 2025-08-06 09:52:45 -07:00
Alex Vandiver
a5a5791794 management: Skip hourly crons during deploys.
This is most important for `send_zulip_update_announcements`, which
can race with the version run as a post-deploy hook.  However, all of
these crons can tolerate being slightly delayed, and there's little
benefit to them taking CPU or possibly hitting odd borderline race
conditions when the deploy is in progress.

For safety, we only trust the deploy lockfile if it was created
within the last hour -- deploys should not take more than an hour, and
failing to ever run hourly crons is much worse than perhaps running
them during a real very-long deploy.
2025-07-15 09:07:36 -07:00
Alex Vandiver
43a7035745 docker: Start local mailserver on port 25. 2025-07-02 10:21:08 -07:00
Anders Kaseorg
d64ab7abf7 process_fts_update: Fix S608 Possible SQL injection vector.
Although this code was not actually vulnerable as written, we never
want to be disabling this Ruff rule, in order to discourage later
introduction of vulnerabilities.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-06-29 23:14:11 -07:00
Alex Vandiver
4f67c1dbf1 wal-g-exporter: Handle empty backup list, for new buckets. 2025-06-18 16:47:22 -07:00
Alex Vandiver
2b1e8c28b4 nagios: Make queue warnings message CZO. 2025-06-18 12:29:57 -07:00
Alex Vandiver
483ced0b15 puppet: Upgrade dependencies. 2025-06-18 11:44:32 -07:00
Alex Vandiver
dfadc9fa39 puppet: Remove zephyr configuration and hosts. 2025-06-13 16:54:26 -07:00
Tim Abbott
0ec07fe4c8 queue: Allow sharding user_activity worker.
This follows the existing patterns for the sharded mobile
notifications worker.
2025-06-06 10:33:20 -07:00
Alex Vandiver
1d72e38e62 process_exporter: Monitor the standalone email_server. 2025-05-20 07:03:23 -07:00
Alex Vandiver
c6e0f0b436 email-mirror: Remove HTTP interface. 2025-05-19 16:39:44 -07:00
Alex Vandiver
a6a5fc246a kandra: Swap in zulip::local_mailserver for postfix. 2025-05-19 16:39:44 -07:00
Alex Vandiver
1f0cfd4662 email-mirror: Add a standalone server that processes incoming email.
Using postfix to handle the incoming email gateway complicates things
a great deal:

- It cannot verify that incoming email addresses exist in Zulip before
  accepting them; it thus accepts mail at the `RCPT TO` stage which it
  cannot handle, and thus must reject after the `DATA`.

- It is built to handle both incoming and outgoing email, which
  results in subtle errors (1c17583ad5, 79931051bd, a53092687e,
  #18600).

- Rate-limiting happens much too late to avoid denial of
  service (#12501).

- Mis-configurations of the HTTP endpoint can break incoming
  mail (#18105).

Provide a replacement SMTP server which accepts incoming email on port
25, verifies that Zulip can accept the address, and that no
rate-limits are being broken, and then adds it directly to the
relevant queue.

Removes an incorrect comment which implied that missed-message
addresses were only usable once.  We leave rate-limiting to only
channel email addresses, since missed-message addresses are unlikely
to be placed into automated systems, as channel email addresses are.

Also simplifies #7814 somewhat.
2025-05-19 16:39:44 -07:00
Alex Vandiver
f06253a62b puppet: Fix comment to point to the updated docs.
This was moved in 44c4b93f6e.
2025-05-19 12:58:51 -07:00
Alex Vandiver
cc665cb9b7 process_fts_updates: Update comments for TODO being actually done. 2025-05-19 12:58:51 -07:00
Alex Vandiver
09f1195fb5 process_fts_updates: Use 3 queries, instead of BATCH_SIZE + 2.
This provides at least a 10x speedup.

We could use a CTE to fetch from, and lock, `fts_update_log` at the
same time as performing the `zerver_message` update -- however, this
would mean that the number of returned rows would be less than
BATCH_SIZE if some rows in `zerver_message` no longer existed, causing
premature termination of the outer loop.

We lock the `zerver_message` rows in ascending ID order, to prevent
deadlocks.  We use `ctid` values[^1] to transfer the rows from the
inner query to the outer update; this is safe, as the rows are locked
from updates.

[^1]: https://www.postgresql.org/docs/current/ddl-system-columns.html#DDL-SYSTEM-COLUMNS-CTID
2025-05-19 12:58:51 -07:00
Alex Vandiver
f2f94d30fe process_fts_updates: Only DELETE if rows were processed. 2025-05-19 12:58:51 -07:00
Alex Vandiver
eeb485f894 process_fts_updates: After getting notifies, process them all.
Every insert into `fts_update_log` triggers a `NOTIFY`; processing a
batch of 1000 for every NOTIFY that we get thus results in a lot of
extra processing of no rows.

Instead, we clear all notifies, and repeat the pattern of processing
rows until the batch comes up short.  This may still result in extra
wake-ups, in the event that a new NOTIFY occurs after we clear them,
and that new row is processed by us in `update_all_rows`.  However,
these wake-ups are necessary for correctness, and this change will
result in drastically fewer extra checks.
2025-05-19 12:58:51 -07:00
Alex Vandiver
fd898dafae process_fts_updates: Use row locking and transactions. 2025-05-19 12:58:51 -07:00
Alex Vandiver
7a8a8f5f23 puppet: Use Service for PostgreSQL restarts.
Using pg_ctlcluster leaves systemctl thinking the process aborted; and
not all instances (e.g. Docker) have systemctl.
2025-05-16 11:33:20 -07:00
Alex Vandiver
9def655564 puppet: Do not bother manually symlinking hunspell dictionaries.
This code dates back to 57b52310639a; however, this has been handled
by `postgresql-common` adding a post-install trigger to call
`pg_updatedicts` for each new PostgreSQL version, since
`postgresql-common` version 153 (February 2014).
2025-05-16 11:33:20 -07:00
Alex Vandiver
e13f82f048 upgrade-postgresql: Use tags to partially-apply configuration.
This uses the same technique used in 840884ec89, to only apply select
parts of the Puppet configuration.  This is more correct, and simpler,
than attempting to chop out some base puppet roles, and hack around
the `purge => true` supervisor.d configuration.
2025-05-16 11:33:20 -07:00
Alex Vandiver
47238ed74c kandra: Remove PostgreSQL upgrade key. 2025-05-14 09:16:39 -07:00
Alex Vandiver
1c8cfff0bd postgres: Add knobs for min_wal_size / max_wal_size. 2025-05-12 11:12:45 -07:00
Alex Vandiver
b4c8d2e6f5 postgres: Add knobs for parallelism control. 2025-05-12 11:12:45 -07:00
Alex Vandiver
933871e825 postgres: Double maintenance_work_mem to 1/8th, but cap at 2GiB.
This is based on annotated.conf[^1].

[^1]: c4ebbf1420/postgresql.10.simple.conf (L122-L133)
2025-05-12 11:12:45 -07:00
Alex Vandiver
8850c3f85c postgres: Tune effective_cache_size upwards, to 75%.
The previous 10/16ths is an odd value; various suggestions are around
75% of RAM, and in practice this is closer to what is observed.
2025-05-12 11:12:45 -07:00
Alex Vandiver
49c307e136 postgres: Remove explicit wal_buffers setting.
The value, if unset, defaults to[^1]:

> a size equal to 1/32nd (about 3%) of shared_buffers, but not less
> than 64kB nor more than the size of one WAL segment, typically 16MB

Letting it choose this, by default, but allowing an override via
`zulip.conf`, seems preferable to hard-coding an arbitrary value.

[^1]: https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-BUFFERS
2025-05-12 11:12:45 -07:00
Alex Vandiver
45b6c5931a puppet: Remove unnecessary use of scope[...]. 2025-05-12 11:12:45 -07:00
Alex Vandiver
a4ddc0a2d5 postgres: Remove lower-than-suggested checkpoint_completion_target.
This defaults to 0.9, and lowering it is not generally suggested[^1].

[^1]: https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-CHECKPOINT-COMPLETION-TARGET
2025-05-12 11:12:45 -07:00
Alex Vandiver
f47d150fb5 puppet: Allow overrides of computed PostgreSQL config settings.
The four precomputed memory values gain overrides in `zulip.conf` with
the same names, for more precise manual tuning.
2025-05-12 11:12:45 -07:00
Alex Vandiver
5bc75a94be puppet: Base PostgreSQL tuning on if the frontend is on the same host.
The previous thresholds were set based on the assumption that 50% of
the memory on the host was taken up by Django and friends, leaving
only 50% for the database host.

On separate database hosts, this leads to a too-small default value
for shared_buffers, which is generally suggested to be 25% of your
database-dedicated memory.

Check if there is a `zulip::app_frontend_base` as part of the deploy,
and use that to adjust if we halve `$zulip::common::total_memory_mb`
or not for purposes of calculating PostgreSQL memory settings.  We
also provide a `postgresql.memory` override, if necessary.
2025-05-12 11:12:45 -07:00
Alex Vandiver
1b9a8dbc52 kandra: Allow PostgreSQL primary ssh access for PostgreSQL upgrade.
This allows scripting of the whole upgrade process; these grants are
temporary.
2025-05-09 10:05:19 -07:00
Alex Vandiver
a959f71208 kandra: Support installing multiple ssh keys from multiple secrets. 2025-05-09 10:05:19 -07:00
Alex Vandiver
614fdee5ca kandra: Add a pattern grouping for /api/v1/tus/.
We already had one for `/api/v1/tus/...`, but not the bare
`/api/v1/tus/` which is used to initiate the upload.
2025-05-05 12:13:16 -07:00
Alex Vandiver
21eff33875 puppet: Upgrade tusd to 2.8.0. 2025-04-11 12:13:07 -07:00
Alex Vandiver
309876796e kandra: Add a couple useful command-line tools. 2025-04-11 11:25:58 -07:00
Alex Vandiver
b11cbbab01 smokescreen: Move metrics port from the default 9810, to 4760.
This prevents errors if Smokescreen is running on a host with more
than 10 Tornado shards.
2025-04-11 10:29:26 -07:00
Alex Vandiver
5f783ed5ad nginx: Relay the same Host: header that nginx saw.
Unilaterally adding the port can cause CSRF failures when the port is
a default port, and thus optional.  Switch to providing the exact
`Host` header that the original request contained.
2025-04-09 10:44:12 -07:00
Anders Kaseorg
80b607c8cb install: Remove PostgreSQL 13 support.
PostgreSQL 13 reaches end of life on November 13, 2025, and Django 5.2
does not support it.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-04-07 17:41:55 -07:00
Anders Kaseorg
818742c62b install: Support PostgreSQL 17.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-04-07 16:42:19 -07:00
Alex Vandiver
e2e0c72a80 nginx: Serve full app from localhost.
Some deployments choose to wrap Zulip's nginx in an outer proxy -- for
example, to do custom TLS termination.  In such deployments, the outer
proxy is routing to `127.0.0.1:80`; b4fb22ba1b breaks these
configurations, as it switches the `127.0.0.1:80` listener to only
serving `/api/internal/` paths.

Switch to serving the whole application over `127.0.0.1:80`.
2025-04-04 10:58:09 -07:00
Alex Vandiver
d2772a0a38 postgresql: Remove inaccurate comment.
PostgreSQL 10 and later can adjust their TLS configuration without a
restart, with `SELECT pg_reload_conf()` or `pg_ctlcluster 16 main
reload`.
2025-04-02 13:07:31 -07:00
Alex Vandiver
9073ad3cc5 kandra: Add missing variable used in the template. 2025-04-02 09:55:21 -07:00
Alex Vandiver
258003a1c8 kandra: Extract the Teleport host CA to disk. 2025-04-02 09:55:21 -07:00
Alex Vandiver
23e8eb5c7c nginx: Use cache slicing to prevent thundering herds for video thumbs.
This prevents a thundering herd for videos -- if a very large video is
posted to a channel with many active clients, all of them
simultaneously request it, to provide the in-feed preview image.
While these requests come with a `Range` header which is intended to
limit the request to just the first couple MB, nginx ignores this
header when making its request to the upstream -- so it can obtain and
cache the whole file locally.  This results in multiple competing
requests for the whole content from S3, all racing to store the
content in the cache.

Use cache slicing to split the content cache into chunks of 5MB; the
cache is filled one slice at a time, as needed based on the byte
ranges that clients request.  Clients making requests without a
`Range` header are provided with the content transparently stitched
together from the individual slices.

The slice size of 5MB is chosen to encompass more 95% of file
uploads (saving an extra trip to the origin) while also being large
enough to be able to provide video thumbnails in a single slice, as
well as not take too much time to obtain from the upstream.
2025-04-01 13:35:28 -07:00
Alex Vandiver
4e26705fbc nginx: Tell the backend service what port we listen on.
The `$host` nginx variable is _not_ the unadulterated `Host`
header (which would be `$http_host`) -- it is that header, *without
the port*, with a fallback to the `server_name` which processed the
request.

This means that backend services are not aware of the port that the
request came in on, unless they derive that from reading
`nginx_listen_port` in `/etc/zulip/zulip.conf`, or similar.
Specifically, this caused `tusd`, on deploys with non-standard
`nginx_listen_port`, to generate a `Location` header which left off
the port, and as such attempted a CORS check when retrieving metadata
about the just-uploaded file, which failed.

Add the port to the `Host` header we pass to `tusd` and other backend
services.
2025-04-01 09:28:18 -07:00