c5200e8b05 switched `digest_emails` from sending emails by inserting
into the ScheduledEmail table, and being processed later by
`deliver_scheduled_emails`, to inserting into the
`deferred_email_senders` RabbitMQ queue. This moved it from being in
an unmonitored table, to a monitored queue.
This slightly improved throughput -- but began paging, since the
backlog was now in a monitored form. Increase the paging thresholds
to not page for expected behaviour.
The cron jobs are potentially wrapped by Sentry, which logs "cron
failures" and sends emails. We would like those failures to only be
when the cron job itself failed to run successfully -- not when the
underlying metric is outside of its normal range. We would like to
differentiate a failure of the monitoring infrastructure from a
failure of what it is monitoring.
Swap to return 0 on everything except "unknown" results.
`deliver_scheduled_emails` tries to deliver the email synchronously,
and if it fails, it retries after 10 seconds. Since it does not track
retries, and always tries the earliest-scheduled-but-due message
first, the worker will not make forward progress if there is a
persistent failure with that message, and will retry indefinitely.
This can result in excessive network or email delivery charges from
the remote SMTP server.
Switch to delivering emails via a new queue worker. The
`deliver_scheduled_emails` job now serves only to pull deferred jobs
out of the table once they are due, insert them into RabbitMQ, and
then delete them. This limits the potential for head-of-queue
failures to failures inserting into RabbitMQ, which is more reasonable
than failures speaking to a complex external system we do not control.
Retries and any connections to the SMTP server are left to the
RabbitMQ consumer.
We build a new RabbitMQ queue, rather than use the existing
`email_senders` queue, because that queue is expected to be reasonably
low-latency, for things like missed message notifications. The
`send_future_email` codepath which inserts into ScheduledEmails is
also (ab)used to digest emails, which are extremely bursty in their
frequency -- and a large burst could significantly delay emails behind
it in the queue.
The new queue is explicitly only for messages which were not initiated
by user actions (e.g., invitation reminders, digests, new account
follow-ups) which are thus not latency-sensitive.
Fixes: #32463.
The refactoring in 4e28e1d3ff incorrectly switched a check for
`if args.from_git` into `if NEW_ZULIP_MERGE_BASE`, which is
incorrect -- the merge-base is always defined, it may just match the
version. This led to errors when installing from tarball, without a
git repo.
Since the run_hooks command was already set up to take a `--from-git`
argument, but was ignoring it, pass down that flag from
upgrade-zulip-stage-3 when necessary, and swap the run_hooks logic
back to basing the version-resolution logic on that flag.
When flushing caches, we want to ensure that even processes which may
have a wrong cache-key-prefix know to fetch the latest data from the
database. This is complicated by the cache-key-prefixes being stored
on disk, and thus checking that every cache delete is not sufficiently
performant.
We store the list of cache-key-prefixes in the cache, itself, with no
prefix. This cache is updated when a new cache-key is written, and is
also allowed to lapse after 24 hours. Updating this global cache
entry on new prefix creation ensures that even a
not-yet-restarted-into deployment will have its caches appropriately
purged if changes are made to the underlying data.
However, this both adds a cache-get, as well as multiplies the size of
all cache clears; for large bulk clears (e.g. for stream renames,
which clear the cache for all message-ids in them) this may prove
untenable.
The old /srv/zulip-npm-cache system has been unused for two
years (Zulip Server ≥ 7.0). We can just delete this directory.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
This consolidates the list of stale migration to
`lib/migration_status.py` as `STALE_MIGRATIONS`.
This is a prep work to make the migration status tool at
`migration_status.py` be able to clean its output of these migrations
too.
5308fbdeac split out `zulip::postgresql_client`, and 80ef38757a
made it no longer depend on `zulip::postgresql_common`, but directly
on `zulipconf('postgresql', 'version', undef)`. However, the
installer depended on recognizing `zulip::postgresql_common` in the
list of pulled-in classes to know that we needed to keep the
`postgresql.version` setting in `/etc/zulip.conf`.
Update the installer to also recognize `zulip::postgresql_client` as a
class which tells us to keep `postgresql.version` in our settings.
This provides significant size savings:
| Emoji set | png size | webp size | webp/png percent |
| ----------- | -------- | --------- | ---------------- |
| google-blob | 1968954 | 1373350 | 69.75% |
| twitter | 2972820 | 2149672 | 72.31% |
| google | 3455270 | 2327834 | 67.37% |
Since these are the largest assets that we ship to clients, it is
worth shaving off every byte we can.
`setup_path()` previously only checked that some `zulip-py3-venv` was
the `sys.prefix`, not that it was the one associated with this
deployment. When `uwsgi` is started, it is started from `bin/uwsgi`
within a `zulip-py3-venv` virtualenv, and as such sets
`sys.executable` to that, resulting in uwsgi workers picking up the
library path of that virtualenv. On first start, `sys.path` thus
already matches the expected virtualenv, and the `setup_path` in
`zproject.wsgi` does nothing.
If a rolling restart was later done into a deployment with a different
virtualenv, the `zproject.wsgi` call to `setup_path()` did not change
`sys.path` to the new virtualenv, since it was already running within
_a_ virtualenv. This led to dependency version mismatches, and
potentially even more disastrous consequences if the old (but still
erroneously in use) virtualenv was later garbage-collected.
PR #26771 was a previous attempt to resolve this, but failed due to
not thinking of the uwsgi binary itself as possibly providing a
virtualenv path. We leave the `chdir` hooks from that in-place, since
it cannot hurt for the "master" uwsgi process to be chdir'd to `/`,
and the `hook-post-fork` `chdir` is reasonable as well.
Resolve the virtualenv in `setup_path()`, and activate it if it
differs from the one that is currently active. To be sure that no
other old virtualenvs are used, we also filter out any paths which
appear to be from other Zulip virtualenvs.
We need this check when switching between branches without `help-beta`
package. `node_modules` will be removed when working on a non `help-beta`
branch, but if `node_modules/.pnpm/lock.yaml` has not been updated by that
branch, we will end up in a situation where we might not have `node_modules`
even when we run the provision command.
We might not need this check when the `help-beta` initial folder
has been merged for a week or two, where almost all active PRs
would have been rebased upon main, making switching branches an ease.
There's no need for sharding, but this allows one to spend a bit of
extra memory to reduce image-processing latency when bursts of images
are uploaded at once.
A new table is created to track which path_id attachments are images,
and for those their metadata, and which thumbnails have been created.
Using path_id as the effective primary key lets us ignore if the
attachment is archived or not, saving some foreign key messes.
A new worker is added to observe events when rows are added to this
table, and to generate and store thumbnails for those images in
differing sizes and formats.