The claim in the comment from c8ec3dfcf6, that we can and should use
the current deploy's venv, misses one key case -- when upgrading the
operating system, the current deploy's venv is unworkable, since it
was configured for a previous version of Python. As such, any attempt
to load Django to verify the version of PostgreSQL it is talking to
must happen after the venv is configured.
Move the database version check into
`scripts/lib/check-database-compatibility`, which also moves it after
the new venv is configured.
Because we no longer reliably know, at `apt-get upgrade` time, what
version of PostgreSQL is installed, we hold all versions of the
pgroonga packages.
(cherry picked from commit 1accc6929e)
Updating the pgroonga package is not sufficient to upgrade the
extension in PostgreSQL -- an `ALTER EXTENSION pgroonga UPDATE` must
explicitly be run[^1]. Failure to do so can lead to unexpected behavior,
including crashes of PostgreSQL.
Expand on the existing `pgroonga_setup.sql.applied` file, to track
which version of the PostgreSQL extension has been configured. If the
file exists but is empty, we run `ALTER EXTENSION pgroonga UPDATE`
regardless -- if it is a no-op, it still succeeds with a `NOTICE`:
```
zulip=# ALTER EXTENSION pgroonga UPDATE;
NOTICE: version "3.0.8" of extension "pgroonga" is already installed
ALTER EXTENSION
```
The simple `ALTER EXTENSION` is sufficient for the
backwards-compatible case[^1] -- which, for our usage, is every
upgrade since 0.9 -> 1.0. Since version 1.0 was released in 2015,
before pgroonga support was added to Zulip in 2016, we can assume for
the moment that all pgroonga upgrades are backwards-compatible, and
not bother regenerating indexes.
Fixes: #25989.
[^1]: https://pgroonga.github.io/upgrade/
(cherry picked from commit c8ec3dfcf6)
These hooks are run immediately around the critical section of the
upgrade. If the upgrade fails for preparatory reasons, the pre-deploy
hook may not be run; if it fails during the upgrade, the post-deploy
hook will not be run. Hooks are called from the CWD of the new
deploy, with arguments of the old version and the new version. If
they exit with non-0 exit code, the deploy aborts.
Corepack manages multiple per-project version of Yarn and PNPM, which
means we have to maintain less installation code, and could help us
switch away from Yarn 1 without making the system unusable for
development of other Yarn 1 projects.
https://nodejs.org/api/corepack.html
The Unicode spaces in the timerender test resulted from an ICU
upgrade: https://github.com/nodejs/node/pull/45068.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
If there is a syntax error in `settings.py`, `restart-server` should
provide a reasonable message about this. It did so prior to
af08bcdb3f, becausde any invocation `./manage.py` without
`--skip-checks` will verify `settings.py`, among several other checks.
After af08bcdb3f, there are no `./manage.py` calls in most restarts,
which fa77be6e6c took further.
Add an explicit `./manage.py check` in the default case.
upgrade-zulip-stage-2 overrides this by passing `--skip-checks`, for
performance. This also means that `upgrade-zulip-from-git` itself
picks up the same `--skip-checks` flag, since it inherits the same
flag parsing, though that is perhaps of dubious utility.
`postgresql-14.4` is a notable upgrade in the PostgreSQL series, as it
fixes potential database corruption from `CREATE INDEX CONCURRENTLY`
statements which are run while rows are modified[1]. However, it also
requires an upgrade from `libllvm9` to `libllvm10`, which means it is
not installed by a mere `apt-get upgrade`.
Add the `--with-new-pkgs` flag to all of the potentially relevant
`apt-get upgrade` calls, so that this (and similar) packages are
upgraded successfully.
[1]: https://www.postgresql.org/docs/release/14.4/
This adds a --skip-restart which makes `deployments/next` in a state
where it can be restarted into, but holds off on conducting that
restart.
This requires many of the same guarantees as `--skip-tornado`, in
terms of there being no Puppet or database schema changes between the
versions. Enforce those with `--skip-restart`, and also broaden both
flags to prevent other, less common changes which nonetheless
potentially might affect the other deploy.
Because Tornado and Django use memcached as a shared cache for
checking session information, they must agree on the prefix used to
store those values.
Subsequent commits will work to ensure that it is always _safe_ to
share that cache.
These are expensive, and moving them to one explicit call early has
considerable time savings in the critical period:
```
$ hyperfine './manage.py fill_memcached_caches' './manage.py fill_memcached_caches --skip-checks'
Benchmark #1: ./manage.py fill_memcached_caches
Time (mean ± σ): 5.264 s ± 0.146 s [User: 4.885 s, System: 0.344 s]
Range (min … max): 5.119 s … 5.569 s 10 runs
Benchmark #2: ./manage.py fill_memcached_caches --skip-checks
Time (mean ± σ): 3.090 s ± 0.089 s [User: 2.853 s, System: 0.214 s]
Range (min … max): 2.950 s … 3.204 s 10 runs
Summary
'./manage.py fill_memcached_caches --skip-checks' ran
1.70 ± 0.07 times faster than './manage.py fill_memcached_caches'
```
As a consequence:
• Bump minimum supported Python version to 3.8.
• Move Vagrant environment to Ubuntu 20.04, which has Python 3.8.
• Move CI frontend tests to Ubuntu 20.04.
• Move production build test to Ubuntu 20.04.
• Move 3.4 upgrade test to Ubuntu 20.04.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We previously used restart-server if puppet was run, as a nod to the
fact that `supervisor reread && supervisor update` will _start_
service groups that were modified, even if they were previously
stopped; this is because they are marked as `autostart=true`, which is
honored on service change.
However, upgrades want to run while there are no services running. If
puppet is run, explicitly set the server as potentially being "up", so
that a `shutdown_server()` before migrations, if they exist, will stop
services.
This was added in c770bdaa3a, and we
have not added any realm-internal bots since
c770bdaa3a.
Speed up the critical period during upgrades by skipping this step.
This was only used for upgrading from Zulip < 1.9.0, which is no
longer possible because Zulip < 2.1.0 had no common supported
platforms with current main.
If we ever want this optimization for a future migration, it would be
better implemented using Django merge migrations.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Attempting to "upgrade" from `main` to 4.x should abort; Django does
not prevent running old code against the new database (though it
likely errors at runtime), and `./manage.py migrate` from the old
version during the "upgrade" does not downgrade the database, since
the migrations are entirely missing in that directory, so don't get
reversed.
Compare the list of applied migrations to the list of on-disk
migrations, and abort if there are applied migrations which are not
found on disk.
Fixes: #19284.
The RabbitMQ docs state ([1]):
RabbitMQ nodes and CLI tools (e.g. rabbitmqctl) use a cookie to
determine whether they are allowed to communicate with each
other. [...] The cookie is just a string of alphanumeric
characters up to 255 characters in size. It is usually stored in a
local file.
...and goes on to state (emphasis ours):
If the file does not exist, Erlang VM will try to create one with
a randomly generated value when the RabbitMQ server starts
up. Using such generated cookie files are **appropriate in
development environments only.**
The auto-generated cookie does not use cryptographic sources of
randomness, and generates 20 characters of `[A-Z]`. Because of a
semi-predictable seed, the entropy of this password is thus less than
the idealized 26^20 = 94 bits of entropy; in actuality, it is 36 bits
of entropy, or potentially as low as 20 if the performance of the
server is known.
These sizes are well within the scope of remote brute-force attacks.
On provision, install, and upgrade, replace the default insecure
20-character Erlang cookie with a cryptographically secure
255-character string (the max length allowed).
[1] https://www.rabbitmq.com/clustering.html#erlang-cookie
5c450afd2d, in ancient history, switched from `check_call` to
`check_output` and throwing away its result.
Use check_call, so that we show the steps to (re)starting the server.
As a consequence:
• Bump minimum supported Python version to 3.7.
• Move Vagrant environment to Debian 10, which has Python 3.7.
• Move CI frontend tests to Debian 10.
• Move production build test to Debian 10.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We've had a number of unhappy reports of upgrades failing due to
webpack requiring too much memory. While the previous commit will
likely fix this issue for everyone, it's worth improving the error
message for failures here.
We avoid doing the stop+retry ourselves, because that could cause an
outage in a production system if webpack fails for another reason.
Fixes#20105.
Since the upgrade to Webpack 5, we've been seeing occasional reports
that servers with roughly 4GiB of RAM were getting OOM kills while
running webpack.
Since we can't readily optimize the memory requirements for webpack
itself, we should raise the RAM requirements for doing the
lower-downtime upgrade strategy.
Fixes#20231.