These are expensive, and moving them to one explicit call early has
considerable time savings in the critical period:
```
$ hyperfine './manage.py fill_memcached_caches' './manage.py fill_memcached_caches --skip-checks'
Benchmark #1: ./manage.py fill_memcached_caches
Time (mean ± σ): 5.264 s ± 0.146 s [User: 4.885 s, System: 0.344 s]
Range (min … max): 5.119 s … 5.569 s 10 runs
Benchmark #2: ./manage.py fill_memcached_caches --skip-checks
Time (mean ± σ): 3.090 s ± 0.089 s [User: 2.853 s, System: 0.214 s]
Range (min … max): 2.950 s … 3.204 s 10 runs
Summary
'./manage.py fill_memcached_caches --skip-checks' ran
1.70 ± 0.07 times faster than './manage.py fill_memcached_caches'
```
Treating the restart as a start is important in reducing the critical
period during upgrades -- we call restart even when we suspect the
services are stopped, because puppet has a small possibility of
placing them in indeterminate state. However, restart orders the
workers first, then tornado/django, which prolongs the outage.
Recognize when no services are currently started, and switch to acting
like a start, not a restart, which places tornado/django first.
This hides ugly output if the services were already stopped:
```
2022-03-25 23:26:04,165 upgrade-zulip-stage-2: Stopping Zulip...
process-fts-updates: ERROR (not running)
zulip-django: ERROR (not running)
zulip_deliver_scheduled_emails: ERROR (not running)
zulip_deliver_scheduled_messages: ERROR (not running)
Zulip stopped successfully!
```
Being able to skip having to shell out to `supervisorctl`, if all
services are already stopped is also a significant performance
improvement.
These have more accurate timestamps, and have user information --
but are harder to parse, and will not show requests when Django or
Tornado is stopped.
This is a script to search nginx log files by server hostname or
client IP address, and output matching lines, all while skipping
common and less-interesting request lines.
As a consequence:
• Bump minimum supported Python version to 3.8.
• Move Vagrant environment to Ubuntu 20.04, which has Python 3.8.
• Move CI frontend tests to Ubuntu 20.04.
• Move production build test to Ubuntu 20.04.
• Move 3.4 upgrade test to Ubuntu 20.04.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
We previously used restart-server if puppet was run, as a nod to the
fact that `supervisor reread && supervisor update` will _start_
service groups that were modified, even if they were previously
stopped; this is because they are marked as `autostart=true`, which is
honored on service change.
However, upgrades want to run while there are no services running. If
puppet is run, explicitly set the server as potentially being "up", so
that a `shutdown_server()` before migrations, if they exist, will stop
services.
7c4293a7d3 switched to checking if the
service was already running, and use `supervisorctl start` if it was
not.
Unfortunately, `list_supervisor_processes("zulip-tornado:*")` did not
include `zulip-tornado`, and as such a non-sharded process was always
considered to _not_ be running, and was thus started, not restarted.
Starting an already-started service is a no-op, and thus non-sharded
tornado processes were never restarted.
The observed behaviour is that requests to the tornado process attempt
to load the user from the cache, with a different prefix from Django,
and immediately invalidate the session and eject the user back to the
login page.
Fix the `list_supervisor_processes` logic to match without the
trailing `:*`.
We had skipped these in #14693 so we could keep generating a friendly
error on Python 3.5, but we gave that up in #19801.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Since wal-g does not provide binaries for aarch64, build them from
source. While building them from source for arm64 would better ensure
that build process is tested, the build process takes 7min and 700M of
temp files, which is an unacceptable cost; we thus only build on
aarch64.
Since the wal-g build process uses submodules, which are not in the
Github export, we clone the full wal-g repository. Because the
repository is relatively small, we clone it anew on each new version,
rather than attempt to manage the remotes.
Fixes#21070.