Commit Graph

1228 Commits

Author SHA1 Message Date
Alex Vandiver
93d2c77225 restore-backup: Provide flags to leave settings.py and zulip.conf as-is. 2022-07-21 11:45:41 -07:00
Alex Vandiver
5cd22c2c80 restore-backup: Only extract /etc/zulip once.
This is already handled in the earlier block; there is no need to
extract it twice.
2022-07-21 11:45:41 -07:00
Alex Vandiver
808838597a restore-backup: Switch to run() to check exit codes. 2022-07-21 11:45:41 -07:00
Alex Vandiver
639d42c59f restore-backup: Ensure it is run as root. 2022-07-21 11:45:41 -07:00
Alex Vandiver
9d3fb85897 install/upgrade: Allow new packages during apt-get upgrade.
`postgresql-14.4` is a notable upgrade in the PostgreSQL series, as it
fixes potential database corruption from `CREATE INDEX CONCURRENTLY`
statements which are run while rows are modified[1].  However, it also
requires an upgrade from `libllvm9` to `libllvm10`, which means it is
not installed by a mere `apt-get upgrade`.

Add the `--with-new-pkgs` flag to all of the potentially relevant
`apt-get upgrade` calls, so that this (and similar) packages are
upgraded successfully.

[1]: https://www.postgresql.org/docs/release/14.4/

(cherry picked from commit a35af3f38b)
2022-06-21 11:22:39 -07:00
Alex Vandiver
70aed5e26c upgrade-zulip-from-git: init, then add remote.
30457ecd02 removed the `--mirror` from
initial clones, but did not add back `--bare`, which `--mirror`
implies.  This leads to `/srv/zulip.git` having a working tree in it,
with a `/srv/zulip.git/.git` directory.

This is mostly harmless, and since the bug was recent, not worth
introducing additional complexity into the upgrade process to handle.

Calling `git clone --bare`, however, would clone the refs into
`refs/heads/`, not the `refs/remotes/origin/` we want.  Instead, use
`git init --bare`, followed by `git remote add origin`.  The remote
will be fetched by the usual `git fetch --all --prune` which is below.

(cherry picked from commit 5bdc4b3562)
2022-06-20 11:01:27 -07:00
Alex Vandiver
30ef55ca6c upgrade-zulip-from-git: Check fetch refspecs, not mirror flag.
While the `remote.origin.mirror` boolean being set is a very good
proxy for having been cloned with `--mirror`, is technically only used
when pushing into the remote[1].  What we care about is if fetches
from this remote will overwrite `refs/heads/`, or all of `refs/` --
the latter of which is most likely, from having run `git clone
--bare`.

Detect either of these fetch refspecs, and not the mirror flag.  We
let the upgrade process error out if `remote.origin.fetch` is unset,
as that represents an unexpected state.  We ignore failures to unset
the `remote.origin.mirror` flag, in case it is not set already.

[1]: https://git-scm.com/docs/git-config#Documentation/git-config.txt-remoteltnamegtmirror

(cherry picked from commit 1639792e9e)
2022-06-20 11:01:23 -07:00
Alex Vandiver
09bd546210 upgrade-zulip-from-git: Stop mirroring the remote.
The local `/srv/zulip.git` directory has been cloned with `--mirror`
since it was first created as a local cache in dc4b89fb08.  This
made some sense at the time, since it was purely a cache of the
remote, and not a home to local branches of its own.

That changed in 3f83b843c2, when we began using `git worktree`,
which caused the `deployment-...` branches to begin being stored in
`/src/zulip.git`.  This caused intermixing of local and remote
branches.

When 02582c6956 landed, the addition of `--prune` caused all but the
most recent deployment branch to be deleted upon every fetch --
leaving previous deployments with non-existent branches checked out:

```
zulip@example-prod-host:~/deployments/last$ git status
On branch deployment-2022-04-15-23-07-55

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   .browserslistrc
	new file:   .codecov.yml
	new file:   .codespellignore
	new file:   .editorconfig
[...snip list of every file in repo...]
```

Switch `/srv/zulip.git` to no longer be a `--mirror` cache of the
origin.  We reconfigure the remote to drop `remote.origin.mirror`, and
delete all refs under `refs/pulls/` and `refs/heads/`, while
preserving any checked-out branches.  `refs/pulls/`, if the remote is
the canonical upstream, contains _tens of thousands_ of refs, so
pruning those refs trims off 20% of the repository size.

Those savings require a `git gc --prune=now`, otherwise the dangling
objects are ejected from the packfiles, which would balloon the
repository up to more than three times its previous size.  Repacking
the repository is reasonable, in general, after removing such a large
number of refs -- and the `--prune=now` is safe and will not lose
data, as the `--mirror` was good at ensuring that the repository could
not be used for any local state.

The refname in the upgrade process was previously resolved from the
union of local and remote refs, since they were in the same namespace.
We instead now only resolve arguments as tags, then origin branches;
this means that stale local branches will be skipped.  Users who want
to deploy from local branches can use `--remote-url=.`.

Because the `scripts/lib/upgrade-zulip-from-git` file is "stage 1" and
run from the old version's code, this will take two invocations of
`upgrade-zulip-from-git` to take effect.

Fixes #21901.

(cherry picked from commit 30457ecd02)
2022-06-20 11:01:08 -07:00
Alex Vandiver
8619f858f6 upgrade: Add --skip-restart which preps but does not restart.
This adds a --skip-restart which makes `deployments/next` in a state
where it can be restarted into, but holds off on conducting that
restart.

This requires many of the same guarantees as `--skip-tornado`, in
terms of there being no Puppet or database schema changes between the
versions.  Enforce those with `--skip-restart`, and also broaden both
flags to prevent other, less common changes which nonetheless
potentially might affect the other deploy.

(cherry picked from commit 6337f17923)
2022-06-20 11:00:14 -07:00
Alex Vandiver
97f49cc555 upgrade: Enforce that --skip-tornado does not have Puppet or DB changes.
(cherry picked from commit 86a4e64726)
2022-06-20 11:00:14 -07:00
Alex Vandiver
096e7af06d upgrade: Copy cache prefix with --skip-tornado.
Because Tornado and Django use memcached as a shared cache for
checking session information, they must agree on the prefix used to
store those values.

Subsequent commits will work to ensure that it is always _safe_ to
share that cache.

(cherry picked from commit ef7c2ea0ea)
2022-06-20 11:00:14 -07:00
Alex Vandiver
e6f52eb2a0 upgrade: Only run Django system checks once, explicitly.
These are expensive, and moving them to one explicit call early has
considerable time savings in the critical period:

```
$ hyperfine './manage.py fill_memcached_caches' './manage.py fill_memcached_caches --skip-checks'
Benchmark #1: ./manage.py fill_memcached_caches
  Time (mean ± σ):      5.264 s ±  0.146 s    [User: 4.885 s, System: 0.344 s]
  Range (min … max):    5.119 s …  5.569 s    10 runs

Benchmark #2: ./manage.py fill_memcached_caches --skip-checks
  Time (mean ± σ):      3.090 s ±  0.089 s    [User: 2.853 s, System: 0.214 s]
  Range (min … max):    2.950 s …  3.204 s    10 runs

Summary
  './manage.py fill_memcached_caches --skip-checks' ran
    1.70 ± 0.07 times faster than './manage.py fill_memcached_caches'
```

(cherry picked from commit fa77be6e6c)
2022-06-20 11:00:14 -07:00
Alex Vandiver
51ff34083e restart-server: Treat as a start if nothing is running.
Treating the restart as a start is important in reducing the critical
period during upgrades -- we call restart even when we suspect the
services are stopped, because puppet has a small possibility of
placing them in indeterminate state.  However, restart orders the
workers first, then tornado/django, which prolongs the outage.

Recognize when no services are currently started, and switch to acting
like a start, not a restart, which places tornado/django first.

(cherry picked from commit 3928606886)
2022-06-20 11:00:14 -07:00
Alex Vandiver
41038c3510 stop-server: Only stop services if they exist and are running.
This hides ugly output if the services were already stopped:

```
2022-03-25 23:26:04,165 upgrade-zulip-stage-2: Stopping Zulip...
process-fts-updates: ERROR (not running)
zulip-django: ERROR (not running)
zulip_deliver_scheduled_emails: ERROR (not running)
zulip_deliver_scheduled_messages: ERROR (not running)

Zulip stopped successfully!
```

Being able to skip having to shell out to `supervisorctl`, if all
services are already stopped is also a significant performance
improvement.

(cherry picked from commit 3717c329b8)
2022-06-20 11:00:14 -07:00
Alex Vandiver
25c87d9823 upgrade: Check with zulip-puppet-apply to see if we can skip it.
(cherry picked from commit 2e5a079ef4)
2022-06-20 11:00:14 -07:00
Alex Vandiver
14e60fd203 zulip-puppet-apply: Make --force --noop have an exit code.
(cherry picked from commit ecfc23bd0b)
2022-06-20 11:00:14 -07:00
Alex Vandiver
236508f61e zulip-puppet-apply: Factor out the --noop returncode logic.
(cherry picked from commit c91725bfb5)
2022-06-20 11:00:14 -07:00
Alex Vandiver
4bbcfd0499 upgrade: Skip the pre-work if the server is already stopped.
This optimization makes sense if the server is already running, but if
it is already stopped, it is just prolonging the downtime.

(cherry picked from commit b15d8e0118)
2022-06-20 11:00:14 -07:00
Alex Vandiver
80bf880d6f upgrade: Fill caches before the critical period, if possible.
(cherry picked from commit 05af4b0a11)
2022-06-20 11:00:14 -07:00
Alex Vandiver
7039f1d182 upgrade: Move puppet class renames earlier.
These do not need to happen during the critical period when the server
is stopped.

(cherry picked from commit 2f7068ffbb)
2022-06-20 11:00:14 -07:00
Anders Kaseorg
cf86e7b3d8 apt-repos: Remove now-unneeded Ubuntu 21.10 repository on 22.04.
Followup to commit f8957863a2 (#22055).

Signed-off-by: Anders Kaseorg <anders@zulip.com>
(cherry picked from commit 98ed6248e3)
2022-05-26 10:35:49 -07:00
Anders Kaseorg
472e216cec Revert "apt-repos: Downgrade PostgreSQL to dodge PGroonga regression."
This reverts commit 9c8d2b7be3 (#21115).

The PostgreSQL fix was released 2022-05-12.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
(cherry picked from commit f8957863a2)
2022-05-18 17:43:03 -07:00
Anders Kaseorg
1593ab6082 install: Resupport Ubuntu 22.04.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
(cherry picked from commit e952641013)
2022-05-16 12:05:23 -07:00
Anders Kaseorg
3bc1ad05f7 zulip-puppet-apply: Work around broken Puppet on Ubuntu 22.04.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
(cherry picked from commit 25c87cc7da)
2022-05-16 12:05:23 -07:00
Alex Vandiver
fb9cdf0f56 compare-settings-to-template: Handle prod_settings_template renaming.
(cherry picked from commit 3476f63dca)
2022-04-28 15:46:22 -07:00
Alex Vandiver
df80303a64 compare-settings-to-template: Simplify and dedent logic.
(cherry picked from commit b6b6faa404)
2022-04-28 15:46:22 -07:00
Alex Vandiver
d7fb2292eb compare-settings-to-template: Fetch 100 per pagination.
(cherry picked from commit d205050ab0)
2022-04-28 15:46:22 -07:00
Alex Vandiver
827d1d9d3b compare-settings-to-template: Paginate through all tags.
The default page size is 30, which means this only goes back to 4.6 at
present, due to starting with `shared-...` and old `enterprise-...`
tags.

(cherry picked from commit d79776f80d)
2022-04-28 15:46:22 -07:00
Anders Kaseorg
4d0ddf483d actions: Delete zerver.lib.actions.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
(cherry picked from commit cc30ed8ec7)
2022-04-15 10:08:19 -07:00
Alex Vandiver
c7df68eb48 check-database-compatibility: Sort and prettify output.
(cherry picked from commit 09860dc284)
2022-04-06 14:16:55 -07:00
Alex Vandiver
58e29a9ca0 supervisor: 'foo:*' also matches 'foo'.
7c4293a7d3 switched to checking if the
service was already running, and use `supervisorctl start` if it was
not.

Unfortunately, `list_supervisor_processes("zulip-tornado:*")` did not
include `zulip-tornado`, and as such a non-sharded process was always
considered to _not_ be running, and was thus started, not restarted.
Starting an already-started service is a no-op, and thus non-sharded
tornado processes were never restarted.

The observed behaviour is that requests to the tornado process attempt
to load the user from the cache, with a different prefix from Django,
and immediately invalidate the session and eject the user back to the
login page.

Fix the `list_supervisor_processes` logic to match without the
trailing `:*`.

(cherry picked from commit 65e19c4fbd)
2022-04-01 17:25:29 -07:00
Alex Vandiver
266dbad737 check-database-compatibility: Ignore squashed and renamed migrations.
Fixes: #21596.
(cherry picked from commit eb31681934)
2022-04-01 17:03:38 -07:00
Alex Vandiver
fc0d5fcfd5 upgrade: Mark puppet as having started the server.
We previously used restart-server if puppet was run, as a nod to the
fact that `supervisor reread && supervisor update` will _start_
service groups that were modified, even if they were previously
stopped; this is because they are marked as `autostart=true`, which is
honored on service change.

However, upgrades want to run while there are no services running.  If
puppet is run, explicitly set the server as potentially being "up", so
that a `shutdown_server()` before migrations, if they exist, will stop
services.

(cherry picked from commit 0af00a3233)
2022-03-31 17:36:57 -07:00
Alex Vandiver
00382078ad upgrade: Move the shutdown_server calls to where they are relevant.
shutdown_server is a noop if the server is already stopped; placing
these in each block makes the logic more apparent.

(cherry picked from commit e9596637e7)
2022-03-31 17:36:55 -07:00
Anders Kaseorg
55882fb343 python: Use modern set comprehension syntax.
Generated by pyupgrade.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-25 10:45:12 -07:00
Anders Kaseorg
1f68c73e66 supervisor: Update superseded super(C, self) syntax to superior super().
Generated by pyupgrade.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-25 10:45:12 -07:00
Anders Kaseorg
2762121162 python: Convert last type comments to annotations.
We had skipped these in #14693 so we could keep generating a friendly
error on Python 3.5, but we gave that up in #19801.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-24 20:32:39 -07:00
Alex Vandiver
d7b59c86ce puppet: Build wal-g from source for aarch64.
Since wal-g does not provide binaries for aarch64, build them from
source.  While building them from source for arm64 would better ensure
that build process is tested, the build process takes 7min and 700M of
temp files, which is an unacceptable cost; we thus only build on
aarch64.

Since the wal-g build process uses submodules, which are not in the
Github export, we clone the full wal-g repository.  Because the
repository is relatively small, we clone it anew on each new version,
rather than attempt to manage the remotes.

Fixes #21070.
2022-03-22 15:02:35 -07:00
Alex Vandiver
a4d0f03319 scripts: Switch to stop-server/restart-server.
stop-server and restart-server address all services which talk to the
database, and are thus more correct than restarting or stopping
everything in supervisor.

This is possible now that the previous commit ensures that the zulip
user can read the zulip installation directory during
`create-database`; previously, that directory was still owned by root
when `create-database` was run, whereas now it is in
`~zulip/deployments/`.
2022-03-21 16:33:28 -07:00
Alex Vandiver
c0cc98c6a8 install: Re-order final steps.
Move database creation to immediately before database initialization;
this means it happens in a directory readable by the `zulip` user, as
well as placing it alongside similar operations.  It removes the check
for the `zulip::postgresql_common` Puppet class; instead it keeps the
check for `--no-init-db`, and switches to require
`zulip::app_frontend_base`.  This is a behavior change for any install
of `zulip::postgresql_common`-only classes, but that is not a common
form -- and such installs likely already pass `--no-init-db` because
they are warm spare replicas.

As a result, all non-`zulip::app_frontend_base` installs now skip
database initialization, even without `--no-init-db`.  This is clearly
correct for, e.g. Redis-only hosts, and makes clearer that the
frontend, not the database host, is responsible for database
initialization.
2022-03-21 16:33:28 -07:00
Alex Vandiver
394f1eadde setup: Rename postgresql-init-db to create-database.
The old name was confusingly similar to initialize-database.
2022-03-21 16:33:28 -07:00
Anders Kaseorg
7d4b02738d install-node: Upgrade Node.js from 16.14.0 to 16.14.1.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-17 15:24:46 -07:00
Anders Kaseorg
84e91a6e33 configure-rabbitmq: Use rabbitmqctl await_online_nodes.
rabbitmqctl ping only checks that the Erlang process is registered
with epmd.  There’s a window after that where the rabbit app is still
starting inside it.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-14 16:26:05 -07:00
Alex Vandiver
52d363cada upgrade: Skip re-checking of new bots on upgrade.
This was added in c770bdaa3a, and we
have not added any realm-internal bots since
c770bdaa3a.

Speed up the critical period during upgrades by skipping this step.
2022-03-14 14:14:53 -07:00
Alex Vandiver
d26a15b14d setup-apt-repo: Make hashes file not contain full path.
Using an absolute `ZULIP_SCRIPTS` path when computing sha245sums
results in a set of hashes which varies based on the path that the
script is called as.  This means that each deploy _always_ has
`setup-apt-repo --verify` fail, since it is a different base path.

Make all paths passed to sha256sum be relative to the repository root,
ensuring they can be compared across runs.
2022-03-12 17:24:19 -08:00
Alex Vandiver
7c4293a7d3 restart-server: Check if service is running before restart, vs start.
In some instances (e.g. during upgrades) we run `restart-server` and
not `start-server`, even though we expect the server to most likely
already be stopped.  `supervisorctl restart servicename` if the
service is stopped produces the perhaps-alarming message:

```
restart-server: Restarting servicename
servicename: ERROR (not running)
servicename: started
```

This may cause operators to worry that something is broken, when it is
not.

Check if the service is already running, and switch from "restart" to
"start" in cases where it is not.

The race condition here is safe -- if the service transitions from
stopped to started between the check and the `start` call, it will
merely output:
```
servicename: ERROR (already started)
```
...and continue, as that has exit status 0.

If the service transitions from started to stopped between the check
and the `restart` call, we are merely back in the current case, where
it outputs:
```
servicename: ERROR (not running)
servicename: started
```

In none of these cases does a call to "restart" fail to result in the
service being stopped and then started.
2022-03-09 14:42:15 -08:00
Anders Kaseorg
646e466341 install: Desupport Ubuntu 22.04 for now.
Ubuntu 22.04 pushed a post-feature-freeze update to Python 3.10,
breaking virtual environments in a Debian patch
(https://bugs.launchpad.net/ubuntu/+source/python3.10/+bug/1962791).
Also, our antique version of Tornado doesn’t work in 3.10, and we’ll
need to do some work to upgrade that.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-07 11:46:07 -08:00
Anders Kaseorg
60e943b92e install-node: Upgrade Node.js from 16.13.2 to 16.14.0.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-01 23:09:46 -08:00
Anders Kaseorg
de1fb2b8d0 check-database-compatibility: Ignore guardian, django.contrib.sites.
We can safely ignore the presence of the extra tables that could be
left behind in the database from when we had these installed (before
Zulip 1.7.0 and 2.0.0, respectively).

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2022-03-01 10:30:23 -08:00
Tim Abbott
98a05257ea scripts: Print names of missing migrations in compatibility check.
This will make it much easier to debug any situations where this
happens.
2022-02-28 11:09:52 -08:00