Commit Graph

74 Commits

Author SHA1 Message Date
Alex Vandiver
0115fa9c60 start-server/restart-server: Drop privileges if necessary.
Rather than tell the user to re-run the command as `zulip` instead of
`root`, do the privilege-dropping ourselves.
2024-02-07 12:33:00 -08:00
Anders Kaseorg
d257002ad8 scripts: Use setup_path in restart-server, stop-server.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-10-12 12:28:41 -07:00
Anders Kaseorg
c43629a222 ruff: Fix PLW1510 subprocess.run without explicit check argument.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-08-17 17:05:34 -07:00
Alex Vandiver
f4683de742 puppet: Switch the rolling_restart setting to use the bool values.
2c5fc1827c standardized which values are "true"; use them.
2023-05-11 15:54:15 -07:00
Alex Vandiver
dd5dbcabcb start-server: More gracefully handle only starting part of the server.
While the previous commit handles the common case of all of the server
being started already, it still produces ERROR output lines from
supervisorctl when most of the server is already running.  Take the
case where one worker is stopped:

```
$ supervisorctl stop zulip-workers:zulip_events_deferred_work
zulip-workers:zulip_events_deferred_work: stopped
$ ./scripts/start-server
2023-04-04 15:50:28,505 start-server: Running syntax and database checks
System check identified no issues (15 silenced).
2023-04-04 15:50:31,977 start-server: Starting Tornado process on port 9800
zulip-tornado:zulip-tornado-port-9800: ERROR (already started)
2023-04-04 15:50:32,283 start-server: Starting Tornado process on port 9801
zulip-tornado:zulip-tornado-port-9801: ERROR (already started)
2023-04-04 15:50:32,592 start-server: Starting django server
zulip-django: ERROR (already started)
2023-04-04 15:50:33,340 start-server: Starting workers
zulip-workers:zulip_events_deferred_work: started
zulip_deliver_scheduled_emails: ERROR (already started)
zulip_deliver_scheduled_messages: ERROR (already started)
process-fts-updates: ERROR (already started)
2023-04-04 15:50:34,659 start-server: Done!
Zulip started successfully!
```

More gracefully handle these cases:
```
$ ./scripts/start-server
2023-04-04 15:52:39,815 start-server: Running syntax and database checks
System check identified no issues (15 silenced).
2023-04-04 15:52:43,270 start-server: Starting Tornado process on port 9800
2023-04-04 15:52:43,287 start-server: zulip-tornado:zulip-tornado-port-9800 already started!
2023-04-04 15:52:43,287 start-server: Starting Tornado process on port 9801
2023-04-04 15:52:43,300 start-server: zulip-tornado:zulip-tornado-port-9801 already started!
2023-04-04 15:52:43,300 start-server: Starting django server
2023-04-04 15:52:43,316 start-server: zulip-django already started!
2023-04-04 15:52:43,793 start-server: Starting workers
zulip-workers:zulip_events_deferred_work: started
2023-04-04 15:52:45,111 start-server: Done!
Zulip started successfully!
```
2023-04-04 10:58:56 -07:00
Alex Vandiver
cb097760b9 start-server: Make start-server a clean explicit no-op if already running.
Currently, the output from `start-server` if the server is already
running is potentially confusing, since it says ERROR several times:

```
$ ./scripts/start-server
2023-04-04 15:35:12,737 start-server: Running syntax and database checks
System check identified no issues (15 silenced).
2023-04-04 15:35:16,211 start-server: Starting Tornado process on port 9800
zulip-tornado:zulip-tornado-port-9800: ERROR (already started)
2023-04-04 15:35:16,528 start-server: Starting Tornado process on port 9801
zulip-tornado:zulip-tornado-port-9801: ERROR (already started)
2023-04-04 15:35:16,844 start-server: Starting django server
zulip-django: ERROR (already started)
2023-04-04 15:35:17,605 start-server: Starting workers
zulip_deliver_scheduled_emails: ERROR (already started)
zulip_deliver_scheduled_messages: ERROR (already started)
process-fts-updates: ERROR (already started)
2023-04-04 15:35:18,923 start-server: Done!
```

Catch the simple common case where all of the services are already
running, and output a clearer success message:

```
$ ./scripts/start-server
2023-04-04 15:39:52,367 start-server: Running syntax and database checks
System check identified no issues (15 silenced).
2023-04-04 15:39:55,857 start-server: Zulip is already started; nothing to do!
```
2023-04-04 10:58:56 -07:00
Anders Kaseorg
4eda29bd86 ruff: Fix RUF005 Consider spread instead of concatenation.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2023-01-26 10:16:30 -08:00
Alex Vandiver
f5f6a3789b restart-server: Default to running config and database checks.
If there is a syntax error in `settings.py`, `restart-server` should
provide a reasonable message about this.  It did so prior to
af08bcdb3f, becausde any invocation `./manage.py` without
`--skip-checks` will verify `settings.py`, among several other checks.
After af08bcdb3f, there are no `./manage.py` calls in most restarts,
which fa77be6e6c took further.

Add an explicit `./manage.py check` in the default case.
upgrade-zulip-stage-2 overrides this by passing `--skip-checks`, for
performance.  This also means that `upgrade-zulip-from-git` itself
picks up the same `--skip-checks` flag, since it inherits the same
flag parsing, though that is perhaps of dubious utility.
2022-10-14 13:10:46 -07:00
Alex Vandiver
fa77be6e6c upgrade: Only run Django system checks once, explicitly.
These are expensive, and moving them to one explicit call early has
considerable time savings in the critical period:

```
$ hyperfine './manage.py fill_memcached_caches' './manage.py fill_memcached_caches --skip-checks'
Benchmark #1: ./manage.py fill_memcached_caches
  Time (mean ± σ):      5.264 s ±  0.146 s    [User: 4.885 s, System: 0.344 s]
  Range (min … max):    5.119 s …  5.569 s    10 runs

Benchmark #2: ./manage.py fill_memcached_caches --skip-checks
  Time (mean ± σ):      3.090 s ±  0.089 s    [User: 2.853 s, System: 0.214 s]
  Range (min … max):    2.950 s …  3.204 s    10 runs

Summary
  './manage.py fill_memcached_caches --skip-checks' ran
    1.70 ± 0.07 times faster than './manage.py fill_memcached_caches'
```
2022-05-22 14:52:38 -07:00
Alex Vandiver
3928606886 restart-server: Treat as a start if nothing is running.
Treating the restart as a start is important in reducing the critical
period during upgrades -- we call restart even when we suspect the
services are stopped, because puppet has a small possibility of
placing them in indeterminate state.  However, restart orders the
workers first, then tornado/django, which prolongs the outage.

Recognize when no services are currently started, and switch to acting
like a start, not a restart, which places tornado/django first.
2022-05-22 14:52:38 -07:00
Alex Vandiver
7c4293a7d3 restart-server: Check if service is running before restart, vs start.
In some instances (e.g. during upgrades) we run `restart-server` and
not `start-server`, even though we expect the server to most likely
already be stopped.  `supervisorctl restart servicename` if the
service is stopped produces the perhaps-alarming message:

```
restart-server: Restarting servicename
servicename: ERROR (not running)
servicename: started
```

This may cause operators to worry that something is broken, when it is
not.

Check if the service is already running, and switch from "restart" to
"start" in cases where it is not.

The race condition here is safe -- if the service transitions from
stopped to started between the check and the `start` call, it will
merely output:
```
servicename: ERROR (already started)
```
...and continue, as that has exit status 0.

If the service transitions from started to stopped between the check
and the `restart` call, we are merely back in the current case, where
it outputs:
```
servicename: ERROR (not running)
servicename: started
```

In none of these cases does a call to "restart" fail to result in the
service being stopped and then started.
2022-03-09 14:42:15 -08:00
Alex Vandiver
2066860ab6 start-server: Start auxiliary services, if they exist.
Services like go-camo and smokescreen are not stopped in stop-server,
since they are upgraded and restarted by puppet application.  As such,
they also do not appear in start-server, despite the server relying on
them to be running to function properly.

Ensure those services are started, by starting them in start-server,
if they are configured in supervisor on the host.
2022-01-26 12:39:54 -08:00
Alex Vandiver
88c3f560ae supervisor: Add a filter for only(-not)-running. 2022-01-26 12:39:54 -08:00
Alex Vandiver
7243c3c73d scripts: Re-implement list_supervisor_processes using API. 2022-01-26 12:39:54 -08:00
Alex Vandiver
6218ed91c2 puppet: Use lazy-apps and uwsgi control sockets for rolling reloads.
Restarting the uwsgi processes by way of supervisor opens a window
during which nginx 502's all responses.  uwsgi has a configuration
called "chain reloading" which allows for rolling restart of the uwsgi
processes, such that only one process at once in unavailable; see
uwsgi documentation ([1]).

The tradeoff is that this requires that the uwsgi processes load the
libraries after forking, rather than before ("lazy apps"); in theory
this can lead to larger memory footprints, since they are not shared.
In practice, as Django defers much of the loading, this is not as much
of an issue.  In a very basic test of memory consumption (measured by
total memory - free - caches - buffers; 6 uwsgi workers), both
immediately after restarting Django, and after requesting `/` 60 times
with 6 concurrent requests:

                      |  Non-lazy  |  Lazy app  | Difference
    ------------------+------------+------------+-------------
    Fresh             |  2,827,216 |  2,870,480 |   +43,264
    After 60 requests |  3,332,284 |  3,409,608 |   +77,324
    ..................|............|............|.............
    Difference        |   +505,068 |   +539,128 |   +34,060

That is, "lazy app" loading increased the footprint pre-requests by
43MB, and after 60 requests grew the memory footprint by 539MB, as
opposed to non-lazy loading, which grew it by 505MB.  Using wsgi "lazy
app" loading does increase the memory footprint, but not by a large
percentage.

The other effect is that processes may be served by either old or new
code during the restart window.  This may cause transient failures
when new frontend code talks to old backend code.

Enable chain-reloading during graceful, puppetless restarts, but only
if enabled via a zulip.conf configuration flag.

Fixes #2559.

[1]: https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html#chain-reloading-lazy-apps
2022-01-05 14:48:52 -08:00
Alex Vandiver
fb3368b482 restart-server: Factor out argparser, to allow reuse. 2021-12-31 11:17:14 -08:00
Alex Vandiver
939d2e2705 scripts: Only stop/start existing tornado processes.
Stopping both `zulip-tornado` and `zulip-tornado:*` causes errors on
deploys with tornado sharding, as the plain `zulip-tornado` service
does not exist.

Pass `zulip-tornado:*`, which matches both plain `zulip-tornado`, as
well as the sharded `zulip-tornado:zulip-tornado-port-9800` cases.
2021-12-08 14:06:06 -08:00
Alex Vandiver
c9bb2c16cc restart-server: Add a --skip-tornado.
Tornado restarts are the most user-visible; provide a means to restart
everything but them, for changes which are known to not affect
Tornado.
2021-08-04 10:57:53 -07:00
Alex Vandiver
16691110a6 scripts: Only stop/restart zulip_deliver_scheduled_* processes if known.
Running `supervisorctl stop` or `supervisorctl restart` on a process
name which is not known is an error:

```
$ supervisorctl stop nonexistent-process
nonexistent-process: ERROR (no such process)
$ echo $?
1
```

ef6d0ec5ca moved
zulip_deliver_scheduled_* out of the `workers:` group.  Since upgrades
run `stop-server` before applying puppet, the list of processes at
that time is from the previous version of Zulip, so may not have the
new `zulip_deliver_scheduled_*` names -- and the `stop-server` will
hence fail.

If the upgrade is not applying puppet, it will `restart-server`. At
that point, the old names will still be in the configuration, so
relying on the current `superisorctl status` is the best gauge of what
exists to restart.

In short, only ever stop/start/restart the `zulip_deliver_scheduled_*`
processes if `supervisorctl status` knows about them already.
2021-07-09 10:04:53 -07:00
Alex Vandiver
85a9c0982a zulip_tools: Extract out list_supervisor_processes. 2021-07-09 10:04:53 -07:00
Gaurav Pandey
af08bcdb3f management: Delete send_stats command.
This command is part of a statsd infrastructure that we stopped
supporting years ago. Its only purpose for some time has been to
provide sample code for how the restart script might trigger a
notification to a graphing system, which doesn't justify maintaining
it.

Fixes part of #18898.
2021-06-25 09:13:48 -07:00
Alex Vandiver
d51272cc3d puppet: Remove zulip_deliver_scheduled_* from zulip-workers:*.
Staging and other hosts that are `zulip::app_frontend_base` but not
`zulip::app_frontend_once` do not have a
/etc/supervisor/conf.d/zulip/zulip-once.conf and as such do not have
`zulip_deliver_scheduled_emails` or `zulip_deliver_scheduled_messages`
and thus supervisor will fail to reload.

Making the contents of `zulip-workers` contingent on if the server is
_also_ a `-once` server is complicated, and would involve using Concat
fragments, which severely limit readability.

Instead, expel those two from `zulip-workers`; this is somewhat
reasonable, since they are use an entirely different codepath from
zulip_events_*, using the database rather than RabbitMQ for their
queuing.
2021-06-14 17:12:59 -07:00
Tim Abbott
de47feab43 scripts: Fix check for services running when upgrading.
When upgrading from a pre-4.0 release, scripts/stop-server logic would
check whether supervisord configuration files were present to
determine what it needed to restart, but only considered paths to
those files that are introduced in Zulip 4.0.

Fixed #18493.
2021-05-13 18:57:19 -07:00
Robert Imschweiler
534d78232c scripts: Add {start,stop,restart}-server support for postgresql role.
During the upgrade process of a postgresql-only Zulip installation,
(`puppet_classes = zulip::profile::postgresql` in
`/etc/zulip/zulip.conf`) either `scripts/start-server` or
`scripts/stop-server` fail because they try to handle supervisor
services that are not available (e.g. Tornado) since only
`/etc/supervisor/conf.d/zulip/zulip_db.conf` is present and not
`/etc/supervisor/conf.d/zulip/zulip.conf`.

While this wasn't previously supported, it's a pretty reasonable thing
to do, and can be readily supported by just adding a few conditionals.
2021-05-07 09:41:05 -07:00
Anders Kaseorg
9d57fa9759 puppet: Use pgrep -x to avoid accidental matches.
Matching the full process name (-x without -f) or full command
line (-xf) is less prone to mistakes like matching a random substring
of some other command line or pgrep matching itself.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-05-07 08:54:41 -07:00
Anders Kaseorg
405bc8dabf requirements: Remove Thumbor.
Thumbor and tc-aws have been dragging their feet on Python 3 support
for years, and even the alphas and unofficial forks we’ve been running
don’t seem to be maintained anymore.  Depending on these projects is
no longer viable for us.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-05-06 20:07:32 -07:00
Alex Vandiver
daabc52a78 restart-server: Reorder supervisorctl calls for less downtime.
Instead of taking the "onion" approach, where all services are
stopped, and then started back up again, default to a rolling restart
across all processes.  This draws out how long the overall "restart"
takes, but minimizes the time that any of the services are down.  This
minimizes user-visible impact and queue buildup.

In cases where speed is more important than minimal impact (for
example, there is already a current outage), a --less-graceful flag is
provided, which brings the services down more suddenly, and back up in
a still-correct order.
2021-04-30 16:47:15 -07:00
Alex Vandiver
ec12a6128a scripts: Add a start-server as well.
In general, `./scripts/restart-server` will already work in any
circumstance where the server is already stopped and needs to be
started.  However, it will output a couple minor warnings, and it is
not readily obvious that it *will* work correctly.

Add an alias for `restart-server` named `start-server`, for
parallelism with `stop-server`, which omits the steps of
`restart-server` which would stop the server first.
2021-04-21 10:24:08 -07:00
Alex Vandiver
31169526ec scripts: Say "Zulip" rather than "Application". 2021-04-21 10:24:08 -07:00
Alex Vandiver
0de8357820 scripts: Fix path to additional Zulip supervisor files.
The path which contains all of the Zulip supervisor files changed in
3ab9b31d2f to make it easier to purge
now-unwanted supervisor configuration files.  However, the paths that
the zulip upgrade process, and restart-server, look at were not
adjusted.

Fix the supervisor configuration file paths.
2021-04-21 10:24:08 -07:00
Anders Kaseorg
6e4c3e41dc python: Normalize quotes with Black.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Anders Kaseorg
11741543da python: Reformat with Black, except quotes.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2021-02-12 13:11:19 -08:00
Alex Vandiver
2a12fedcf1 tornado: Remove explicit tornado_processes setting; compute it.
We can compute the intended number of processes from the sharding
configuration.  In doing so, also validate that all of the ports are
contiguous.

This removes a discrepancy between `scripts/lib/sharding.py` and other
parts of the codebase about if merely having a `[tornado_sharding]`
section is sufficient to enable sharding.  Having behaviour which
changes merely based on if an empty section exists is surprising.

This does require that a (presumably empty) `9800` configuration line
exist, but making that default explicit is useful.

After this commit, configuring sharding can be done by adding to
`zulip.conf`:

```
[tornado_sharding]
9800 =              # default
9801 = other_realm
```

Followed by running `./scripts/refresh-sharding-and-restart`.
2020-09-18 15:13:40 -07:00
Alex Vandiver
efdaa58c24 supervisor: Use more specific process_name than "port-9800".
Making this include "zulip-tornado" makes it clearer in supervisor
logs.  Without this, one only sees:
```
2020-09-14 03:43:13,788 INFO waiting for port-9807 to stop
2020-09-14 03:43:14,466 INFO stopped: port-9807 (exit status 1)
2020-09-14 03:43:14,469 INFO spawned: 'port-9807' with pid 24289
2020-09-14 03:43:15,470 INFO success: port-9807 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
```
2020-09-14 22:17:51 -07:00
Alex Vandiver
dc58dec231 restart-server: Start services in opposite order from stop.
`supervisorctl` starts and stops its arguments sequentially, in the
order they are passed[1].  Start them in the opposite order from the
order in which they were stopped -- this puts the dependencies first,
and the most core services (`zulip-django`) last.

While the only "dependency" here is currently thumbor, this sets us up
in case others are added later.

[1] https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisorctl.py#L782
2020-09-14 16:27:15 -07:00
Anders Kaseorg
b4597a8ca8 python: Elide default for store_{true,false} argparse arguments.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-03 16:17:14 -07:00
Anders Kaseorg
1ded51aa9d python: Replace list literal concatenation with * unpacking.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-02 11:15:41 -07:00
Anders Kaseorg
a5dbab8fb0 python: Remove redundant dest for argparse arguments.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-09-02 11:04:10 -07:00
Anders Kaseorg
5dc9b55c43 python: Manually convert more percent-formatting to f-strings.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-14 23:27:22 -07:00
Anders Kaseorg
365fe0b3d5 python: Sort imports with isort.
Fixes #2665.

Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.

Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start.  I expect this change will increase pressure for us to split
those files, which isn't a bad thing.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-11 16:45:32 -07:00
Anders Kaseorg
67e7a3631d python: Convert percent formatting to Python 3.6 f-strings.
Generated by pyupgrade --py36-plus.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-06-10 15:02:09 -07:00
Anders Kaseorg
333f7d16c9 logging: Pass more format arguments to logging.
Commit bdc365d0fe (#14852) missed this
because of https://github.com/returntocorp/semgrep/issues/831.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2020-05-26 11:42:23 -07:00
Tim Abbott
0f1bdcc46f restart-server: Restart Tornado processes individually.
After some testing, I've confirmed that this seems to behave
significantly better in terms of the number of failed requests due to
Tornado being the process of restarting compared with the previous
version, as each individual process is only down for a short time,
rather than all of them being down at once.
2020-03-27 06:23:34 -07:00
Anders Kaseorg
ea6934c26d dependencies: Remove WebSockets system for sending messages.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013.  We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not.  Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.

While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.

This WebSockets code path had a lot of downsides/complexity,
including:

* The messy hack involving constructing an emulated request object to
  hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
  needs and must be provisioned independently from the rest of the
  server).
* A duplicate check_send_receive_time Nagios test specific to
  WebSockets.
* The requirement for users to have their firewalls/NATs allow
  WebSocket connections, and a setting to disable them for networks
  where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
  been poorly maintained, and periodically throws random JavaScript
  exceptions in our production environments without a deep enough
  traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
  server restart, and especially for large installations like
  zulipchat.com, resulting in extra delay before messages can be sent
  again.

As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients.  We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.

If we later want WebSockets, we’ll likely want to just use Django
Channels.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-01-14 22:34:00 -08:00
Anders Kaseorg
8d91bebf95 restart-server: Warn if the shell’s PWD goes through an updated symlink.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2019-09-21 12:02:15 -07:00
Harshit Bansal
50ef91bb08 scripts: Add argparse option to restart-zerver for --fill-cache.
Nowm unless you specify `--fill-cache`, memcached caches will not be
pre-filled after a server restart. This will be helpful when someone
is in a hurry (e.g. if the server is down right now, or if he/she
testing a configuration change in a newly setup server), it's best to
just restart without pre-filling the cache.

Fixes: #10900.
2019-01-14 15:20:01 -08:00
Anders Kaseorg
a694c3cafd scripts/restart-server: Avoid shelling out for ln.
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
2018-11-28 17:26:54 -08:00
Tim Abbott
5a56925495 restart-server: Fix restarting server with multiple tornado processes.
Previously, we unconditionally tried to restart the Tornado process
name corresponding to the historically always-true case of a single
Tornado process.  This resulted in Tornado not being automatically
restarted on a production deployment on servers with more than one
Tornado process configured.
2018-11-27 17:20:05 -08:00
Tim Abbott
1a0e9fe2f9 restart-server: Restart tornado early.
This dramatically reduces the Tornado downtime when restarting a Zulip
server, which is generally the most significant source of user-facing
bad experiences.
2018-10-16 15:04:07 -07:00
Abhilash Verma
0e2322a322 logging: Show timestamp in UTC in non-django production scripts.
Done in pair programming with @aero31aero.

Fixes #9678.
2018-08-20 12:52:40 -07:00