Commit Graph

1855 Commits

Author SHA1 Message Date
Alex Vandiver
25d156cc9c prometheus: Add a pushgateway.
Prometheus declares metrics to be "stale" if they go 5min without an
update[^1], and removes them -- which makes longer scrape intervals
impossible.

Add a pushgateway, which caches output from such jobs.

[^1]: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness
2025-10-22 11:57:02 -07:00
Alex Vandiver
d87d307e53 kandra: Skip Grafana JWT role sync; we set roles in the UI. 2025-10-22 13:24:12 +00:00
Alex Vandiver
b61b495112 puppet: Update dependencies. 2025-10-21 08:28:18 -07:00
Alex Vandiver
6249c2f970 sharding: Make nginx shard map owned by zulip:zulip.
This will make it easier to manipulate in restart-server.
2025-10-07 13:19:05 -07:00
Shubham Padia
9fa09f8f87 help: Make canonical URL for the root to be /help/.
We also make sure that /help/ is accessible when using the dev server.

We add vite as a dependency. v7 of vite is the latest version but using
that causes type problems for PluginOptions since astro uses vite v6.
Therefore we pin vite to v6 for now.

This commit also runs `pnpm dedupe` for `tinyglobby` pinning it to
0.2.15 instead of the previous 0.2.14 in the lockfile.
2025-10-06 22:21:39 -07:00
Alex Vandiver
456354575e kandra: Add a mirror of #signups to CZO. 2025-10-03 10:47:16 -07:00
Alex Vandiver
85d50d09d4 postgresql: Default random_page_cost to 1.1.
The upstream PostgreSQL default is 4, which is more appropriate for
spinning disks.  In general, production deploys almost always use
SSDs; adjust the Zulip default value to a better value for those.
2025-09-25 13:40:02 -07:00
Nikita Tarasov
c75535a154 puppet: Use memory limits from cgroup, if set.
This fixes behaviour where a Docker image with a memory limit would
immediately OOM, as it would not detect the container's memory limit,
only the host's.

Fixes: #29794
2025-09-18 16:38:37 -04:00
Anders Kaseorg
40a022dcc3 zephyr: Remove Zephyr mirroring support.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-09-16 11:18:18 -07:00
Shubham Padia
57b7ada2e4 help: Pre compress astro build files before serving.
This way nginx doesn't have to compress it every time.
2025-09-11 10:50:10 -04:00
Shubham Padia
1a34016e4a help: Cache pagefind hashed files aggressively.
Pagefind files with the extensions `pf_fragment`, `pf_mindex` and
`pf_meta` are hashed in pagefind and thus we can cache it agressively.

Other files in pagefind are not hashed, although they only change
between different versions, we can think about a caching strategy for it
later.
2025-09-11 10:50:10 -04:00
Shubham Padia
e0811dc2a4 help: No URL should have a trailing slash.
Pages with trailing slash will be redirected to ones without a trailing
slash.
2025-09-11 10:50:10 -04:00
Shubham Padia
55b19b6383 help: Move /help/_astro/ cache headers config inside /help block.
Having a separate block was redundant since moving /help/_astro/ inside
location /help means that it can inherit its alias and we don't have to
re-declare it.
2025-09-11 10:50:10 -04:00
Alex Vandiver
4739c4b057 puppet: Add options for allow/deny rules in Smokescreen.
Fixes: #20490
2025-09-10 17:37:49 -07:00
Alex Vandiver
94a0f02acc puppet: Add a daily cron job to re-sync subscriber counts.
Default installs will re-examine every stream, since the process is
relatively fast.  Large installs have the option to only examine
streams which have the potential to have changed subscriber counts.
2025-09-10 12:07:42 -07:00
Shubham Padia
7db29f29de help: Serve help center build without relative links for zulip.com.
We have a copy of help center with relative links disabled which is
reserved for root domains without an organisation on the root domain.
Ideally, we should have some logic to determine whether we are on such
a root domain or not. For practical short term purposes, since this
type of documentation is mainly useful for zulip.com, we add an
exception for zulip.com.
2025-09-03 09:28:15 -07:00
Shubham Padia
246fbf6d29 help: Add production config for the new help center.
Fixes #35131.

Most of the nginx config is copied from
https://docs.astro.build/en/recipes/docker/#nginx.

We ideally should be generating two builds for Zulip cloud and serve a
different one depending on whether the subdomain is same as the root
domain or not. We can look into this as an immediate followup while this
commit helps things get ready for a test deploy on CZO.
2025-09-03 09:28:15 -07:00
Alex Vandiver
662c946635 puppet: Stop using an unnecessary concat.
This was added in 6975417acf, to support `zmirror` deployments,
which are no longer necessary.
2025-08-29 09:45:28 -07:00
Alex Vandiver
72ea340f80 puppet: Skip client reloads during weekly restarts.
Weekly restarts do not deploy any new client code, so there is no
reason to push clients to reload.
2025-08-19 23:39:38 -07:00
Alex Vandiver
82b40ac6e2 puppet: Add an exporter to fetch Weblate stats, for graphing.
This is done synchronously, despite taking ~60s.  We can move it to a
background thread later if that's an issue, but generally Prometheus
is tolerant to exporters taking a while to come back with results.
2025-08-14 13:18:56 -07:00
Alex Vandiver
2909d4917c send_zulip_update_announcements: Log updates in the post-deploy hook. 2025-08-06 09:52:45 -07:00
Alex Vandiver
a5a5791794 management: Skip hourly crons during deploys.
This is most important for `send_zulip_update_announcements`, which
can race with the version run as a post-deploy hook.  However, all of
these crons can tolerate being slightly delayed, and there's little
benefit to them taking CPU or possibly hitting odd borderline race
conditions when the deploy is in progress.

For safety, we only trust the deploy lockfile if it was created
within the last hour -- deploys should not take more than an hour, and
failing to ever run hourly crons is much worse than perhaps running
them during a real very-long deploy.
2025-07-15 09:07:36 -07:00
Alex Vandiver
43a7035745 docker: Start local mailserver on port 25. 2025-07-02 10:21:08 -07:00
Anders Kaseorg
d64ab7abf7 process_fts_update: Fix S608 Possible SQL injection vector.
Although this code was not actually vulnerable as written, we never
want to be disabling this Ruff rule, in order to discourage later
introduction of vulnerabilities.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-06-29 23:14:11 -07:00
Alex Vandiver
4f67c1dbf1 wal-g-exporter: Handle empty backup list, for new buckets. 2025-06-18 16:47:22 -07:00
Alex Vandiver
2b1e8c28b4 nagios: Make queue warnings message CZO. 2025-06-18 12:29:57 -07:00
Alex Vandiver
483ced0b15 puppet: Upgrade dependencies. 2025-06-18 11:44:32 -07:00
Alex Vandiver
dfadc9fa39 puppet: Remove zephyr configuration and hosts. 2025-06-13 16:54:26 -07:00
Tim Abbott
0ec07fe4c8 queue: Allow sharding user_activity worker.
This follows the existing patterns for the sharded mobile
notifications worker.
2025-06-06 10:33:20 -07:00
Alex Vandiver
1d72e38e62 process_exporter: Monitor the standalone email_server. 2025-05-20 07:03:23 -07:00
Alex Vandiver
c6e0f0b436 email-mirror: Remove HTTP interface. 2025-05-19 16:39:44 -07:00
Alex Vandiver
a6a5fc246a kandra: Swap in zulip::local_mailserver for postfix. 2025-05-19 16:39:44 -07:00
Alex Vandiver
1f0cfd4662 email-mirror: Add a standalone server that processes incoming email.
Using postfix to handle the incoming email gateway complicates things
a great deal:

- It cannot verify that incoming email addresses exist in Zulip before
  accepting them; it thus accepts mail at the `RCPT TO` stage which it
  cannot handle, and thus must reject after the `DATA`.

- It is built to handle both incoming and outgoing email, which
  results in subtle errors (1c17583ad5, 79931051bd, a53092687e,
  #18600).

- Rate-limiting happens much too late to avoid denial of
  service (#12501).

- Mis-configurations of the HTTP endpoint can break incoming
  mail (#18105).

Provide a replacement SMTP server which accepts incoming email on port
25, verifies that Zulip can accept the address, and that no
rate-limits are being broken, and then adds it directly to the
relevant queue.

Removes an incorrect comment which implied that missed-message
addresses were only usable once.  We leave rate-limiting to only
channel email addresses, since missed-message addresses are unlikely
to be placed into automated systems, as channel email addresses are.

Also simplifies #7814 somewhat.
2025-05-19 16:39:44 -07:00
Alex Vandiver
f06253a62b puppet: Fix comment to point to the updated docs.
This was moved in 44c4b93f6e.
2025-05-19 12:58:51 -07:00
Alex Vandiver
cc665cb9b7 process_fts_updates: Update comments for TODO being actually done. 2025-05-19 12:58:51 -07:00
Alex Vandiver
09f1195fb5 process_fts_updates: Use 3 queries, instead of BATCH_SIZE + 2.
This provides at least a 10x speedup.

We could use a CTE to fetch from, and lock, `fts_update_log` at the
same time as performing the `zerver_message` update -- however, this
would mean that the number of returned rows would be less than
BATCH_SIZE if some rows in `zerver_message` no longer existed, causing
premature termination of the outer loop.

We lock the `zerver_message` rows in ascending ID order, to prevent
deadlocks.  We use `ctid` values[^1] to transfer the rows from the
inner query to the outer update; this is safe, as the rows are locked
from updates.

[^1]: https://www.postgresql.org/docs/current/ddl-system-columns.html#DDL-SYSTEM-COLUMNS-CTID
2025-05-19 12:58:51 -07:00
Alex Vandiver
f2f94d30fe process_fts_updates: Only DELETE if rows were processed. 2025-05-19 12:58:51 -07:00
Alex Vandiver
eeb485f894 process_fts_updates: After getting notifies, process them all.
Every insert into `fts_update_log` triggers a `NOTIFY`; processing a
batch of 1000 for every NOTIFY that we get thus results in a lot of
extra processing of no rows.

Instead, we clear all notifies, and repeat the pattern of processing
rows until the batch comes up short.  This may still result in extra
wake-ups, in the event that a new NOTIFY occurs after we clear them,
and that new row is processed by us in `update_all_rows`.  However,
these wake-ups are necessary for correctness, and this change will
result in drastically fewer extra checks.
2025-05-19 12:58:51 -07:00
Alex Vandiver
fd898dafae process_fts_updates: Use row locking and transactions. 2025-05-19 12:58:51 -07:00
Alex Vandiver
7a8a8f5f23 puppet: Use Service for PostgreSQL restarts.
Using pg_ctlcluster leaves systemctl thinking the process aborted; and
not all instances (e.g. Docker) have systemctl.
2025-05-16 11:33:20 -07:00
Alex Vandiver
9def655564 puppet: Do not bother manually symlinking hunspell dictionaries.
This code dates back to 57b52310639a; however, this has been handled
by `postgresql-common` adding a post-install trigger to call
`pg_updatedicts` for each new PostgreSQL version, since
`postgresql-common` version 153 (February 2014).
2025-05-16 11:33:20 -07:00
Alex Vandiver
e13f82f048 upgrade-postgresql: Use tags to partially-apply configuration.
This uses the same technique used in 840884ec89, to only apply select
parts of the Puppet configuration.  This is more correct, and simpler,
than attempting to chop out some base puppet roles, and hack around
the `purge => true` supervisor.d configuration.
2025-05-16 11:33:20 -07:00
Alex Vandiver
47238ed74c kandra: Remove PostgreSQL upgrade key. 2025-05-14 09:16:39 -07:00
Alex Vandiver
1c8cfff0bd postgres: Add knobs for min_wal_size / max_wal_size. 2025-05-12 11:12:45 -07:00
Alex Vandiver
b4c8d2e6f5 postgres: Add knobs for parallelism control. 2025-05-12 11:12:45 -07:00
Alex Vandiver
933871e825 postgres: Double maintenance_work_mem to 1/8th, but cap at 2GiB.
This is based on annotated.conf[^1].

[^1]: c4ebbf1420/postgresql.10.simple.conf (L122-L133)
2025-05-12 11:12:45 -07:00
Alex Vandiver
8850c3f85c postgres: Tune effective_cache_size upwards, to 75%.
The previous 10/16ths is an odd value; various suggestions are around
75% of RAM, and in practice this is closer to what is observed.
2025-05-12 11:12:45 -07:00
Alex Vandiver
49c307e136 postgres: Remove explicit wal_buffers setting.
The value, if unset, defaults to[^1]:

> a size equal to 1/32nd (about 3%) of shared_buffers, but not less
> than 64kB nor more than the size of one WAL segment, typically 16MB

Letting it choose this, by default, but allowing an override via
`zulip.conf`, seems preferable to hard-coding an arbitrary value.

[^1]: https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-BUFFERS
2025-05-12 11:12:45 -07:00
Alex Vandiver
45b6c5931a puppet: Remove unnecessary use of scope[...]. 2025-05-12 11:12:45 -07:00
Alex Vandiver
a4ddc0a2d5 postgres: Remove lower-than-suggested checkpoint_completion_target.
This defaults to 0.9, and lowering it is not generally suggested[^1].

[^1]: https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-CHECKPOINT-COMPLETION-TARGET
2025-05-12 11:12:45 -07:00