mirror of
https://github.com/zulip/zulip.git
synced 2025-11-15 19:31:58 +00:00
docs: Clean up troubleshooting guide.
This article is definitely still below our polish goals, but this is also definitely an improvement.
This commit is contained in:
@@ -109,15 +109,6 @@ If one of these services is not installed or functioning correctly,
|
|||||||
Zulip will not work. Below we detail some common configuration
|
Zulip will not work. Below we detail some common configuration
|
||||||
problems and how to resolve them:
|
problems and how to resolve them:
|
||||||
|
|
||||||
* An AMQPConnectionError traceback or error running rabbitmqctl
|
|
||||||
usually means that RabbitMQ is not running; to fix this, try:
|
|
||||||
```
|
|
||||||
service rabbitmq-server restart
|
|
||||||
```
|
|
||||||
If RabbitMQ fails to start, the problem is often that you are using
|
|
||||||
a virtual machine with broken DNS configuration; you can often
|
|
||||||
correct this by configuring `/etc/hosts` properly.
|
|
||||||
|
|
||||||
* If your browser reports no webserver is running, that is likely
|
* If your browser reports no webserver is running, that is likely
|
||||||
because nginx is not configured properly and thus failed to start.
|
because nginx is not configured properly and thus failed to start.
|
||||||
nginx will fail to start if you configured SSL incorrectly or did
|
nginx will fail to start if you configured SSL incorrectly or did
|
||||||
@@ -138,6 +129,15 @@ problems and how to resolve them:
|
|||||||
attempt. For more on this issue, see the [Django release notes on Host header
|
attempt. For more on this issue, see the [Django release notes on Host header
|
||||||
poisoning](https://www.djangoproject.com/weblog/2013/feb/19/security/#s-issue-host-header-poisoning)
|
poisoning](https://www.djangoproject.com/weblog/2013/feb/19/security/#s-issue-host-header-poisoning)
|
||||||
|
|
||||||
|
* An AMQPConnectionError traceback or error running rabbitmqctl
|
||||||
|
usually means that RabbitMQ is not running; to fix this, try:
|
||||||
|
```
|
||||||
|
service rabbitmq-server restart
|
||||||
|
```
|
||||||
|
If RabbitMQ fails to start, the problem is often that you are using
|
||||||
|
a virtual machine with broken DNS configuration; you can often
|
||||||
|
correct this by configuring `/etc/hosts` properly.
|
||||||
|
|
||||||
### Disabling unattended upgrades
|
### Disabling unattended upgrades
|
||||||
|
|
||||||
```eval_rst
|
```eval_rst
|
||||||
@@ -149,31 +149,34 @@ problems and how to resolve them:
|
|||||||
server will have its postgres server upgraded (and thus restarted).
|
server will have its postgres server upgraded (and thus restarted).
|
||||||
```
|
```
|
||||||
|
|
||||||
When one of the services Zulip depends on (postgres, memcached, redis,
|
Restarting one of the system services that Zulip uses (`postgres`,
|
||||||
rabbitmq) is restarted, that services will disconnect everything using
|
`memcached`, `redis`, or `rabbitmq`) will drop the connections that
|
||||||
them (like the Zulip server), and every operation that Zulip does
|
Zulip processes have to the service, resulting in future operations on
|
||||||
which uses that service will throw an exception (and send you an error
|
those connections throwing errors.
|
||||||
report email).
|
|
||||||
|
|
||||||
Zulip is designed to recover from service outages like this by
|
Zulip is designed to recover from system service downtime by creating
|
||||||
re-initializing its connection to the service in question. However,
|
new connections once the system service is back up, so the Zulip
|
||||||
some of Zulip's queue processors can be idle for hours or days on a
|
outage will end once the system service finishes restarting. But
|
||||||
low-traffic server, and a given queue processor won't re-initialize
|
you'll get a bunch of error emails during the system service outage
|
||||||
its connection until that process gets an error. This means that
|
whenever one of the Zulip server's ~20 workers attempts to access the
|
||||||
after e.g. `postgres` is restarted by unattended-upgrades, you're
|
system service.
|
||||||
likely to get a series of ~20 error emails spread over the next few
|
|
||||||
hours about the issue as each Zulip process tries to access the
|
|
||||||
database, fails, sends an error email, and then reconnects.
|
|
||||||
|
|
||||||
These apparently "random errors" can be confusing and might cause you
|
An unplanned outage will also result in an annoying (and potentially
|
||||||
to worry incorrectly about the stability of the Zulip software, which
|
confusing) trickle of error emails over the following hours or days.
|
||||||
in fact the problem is that Ubuntu automatically upgraded and then
|
These emails happen because a worker only learns its connection was
|
||||||
restarted key Zulip dependencies, without anyone restarting Zulip's
|
dropped when it next tries to access the connection (at which point
|
||||||
owns services.
|
it'll send an error email and make a new connection), and several
|
||||||
|
workers are commonly idle for periods of hours or days at a time.
|
||||||
|
|
||||||
Instead, we recommend installing updates for these services manually,
|
You can prevent this trickle when doing a planned upgrade by
|
||||||
and then restarting the Zulip server with
|
restarting the Zulip server with
|
||||||
`/home/zulip/deployments/current/scripts/restart-server` afterwards.
|
`/home/zulip/deployments/current/scripts/restart-server` after
|
||||||
|
installing system package updates to `postgres`, `memcached`,
|
||||||
|
`rabbitmq`, or `redis`.
|
||||||
|
|
||||||
|
Few system administrators enjoy outages at random times (even if only
|
||||||
|
brief) or the resulting distribution of error emails, which is why we
|
||||||
|
recommend disabling `unattended-upgrades`.
|
||||||
|
|
||||||
## Monitoring
|
## Monitoring
|
||||||
|
|
||||||
@@ -192,7 +195,8 @@ standard stuff:
|
|||||||
especially for the database and where uploads are stored.
|
especially for the database and where uploads are stored.
|
||||||
* Service uptime and standard monitoring for the [services Zulip
|
* Service uptime and standard monitoring for the [services Zulip
|
||||||
depends on](#troubleshooting-services). Most monitoring software
|
depends on](#troubleshooting-services). Most monitoring software
|
||||||
has standard plugins for `nginx`, `postgres`.
|
has standard plugins for `nginx`, `postgres`, `redis`, `rabbitmq`,
|
||||||
|
and `memcached`, and those will work well with Zulip.
|
||||||
* `supervisorctl status` showing all services `RUNNING`.
|
* `supervisorctl status` showing all services `RUNNING`.
|
||||||
* Checking for processes being OOM killed.
|
* Checking for processes being OOM killed.
|
||||||
|
|
||||||
@@ -212,46 +216,35 @@ tarballs).
|
|||||||
The Nagios plugins used by that configuration are installed
|
The Nagios plugins used by that configuration are installed
|
||||||
automatically by the Zulip installation process in subdirectories
|
automatically by the Zulip installation process in subdirectories
|
||||||
under `/usr/lib/nagios/plugins/`. The following is a summary of the
|
under `/usr/lib/nagios/plugins/`. The following is a summary of the
|
||||||
various Nagios plugins included with Zulip and what they check:
|
useful Nagios plugins included with Zulip and what they check:
|
||||||
|
|
||||||
Application server and queue worker monitoring:
|
Application server and queue worker monitoring:
|
||||||
|
|
||||||
* `check_send_receive_time` (sends a test message through the system
|
* `check_send_receive_time`: Sends a test message through the system
|
||||||
between two bot users to check that end-to-end message sending works)
|
between two bot users to check that end-to-end message sending
|
||||||
|
works. An effective end-to-end check for Zulip's Django and Tornado
|
||||||
* `check_rabbitmq_consumers` and `check_rabbitmq_queues` (checks for
|
systems being healthy.
|
||||||
rabbitmq being down or the queue workers being behind)
|
* `check_rabbitmq_consumers` and `check_rabbitmq_queues`: Effective
|
||||||
|
checks for Zulip's RabbitMQ-based queuing systems being healthy.
|
||||||
* `check_queue_worker_errors` (checks for errors reported by the queue
|
* `check_worker_memory`: Monitors for memory leaks in queue workers.
|
||||||
workers)
|
* `check_email_deliverer_backlog` and `check_email_deliverer_process`:
|
||||||
|
Monitors for whether scheduled outgoing emails (e.g. invitation
|
||||||
* `check_worker_memory` (monitors for memory leaks in queue workers)
|
reminders) are being sent properly.
|
||||||
|
|
||||||
* `check_email_deliverer_backlog` and `check_email_deliverer_process`
|
|
||||||
(monitors for whether scheduled outgoing emails are being sent)
|
|
||||||
|
|
||||||
Database monitoring:
|
Database monitoring:
|
||||||
|
|
||||||
* `check_postgres_replication_lag` (checks streaming replication is up
|
* `check_fts_update_log`: Checks whether full-text search updates are
|
||||||
to date).
|
being processed properly or getting backlogged.
|
||||||
|
* `check_postgres`: General checks for database health.
|
||||||
* `check_postgres` (checks the health of the postgres database)
|
* `check_postgres_backup`: Checks status of postgres backups.
|
||||||
|
* `check_postgres_replication_lag`: Checks whether postgres streaming
|
||||||
* `check_postgres_backup` (checks backups are up to date; see above)
|
replication is up to date.
|
||||||
|
|
||||||
* `check_fts_update_log` (monitors for whether full-text search updates
|
|
||||||
are being processed)
|
|
||||||
|
|
||||||
Standard server monitoring:
|
Standard server monitoring:
|
||||||
|
|
||||||
* `check_website_response.sh` (standard HTTP check)
|
* `check_website_response.sh`: Basic HTTP check.
|
||||||
|
* `check_debian_packages`: Checks whether the system is behind on `apt
|
||||||
* `check_debian_packages` (checks apt repository is up to date)
|
upgrade`.
|
||||||
|
|
||||||
**Note**: While most commands require no special permissions,
|
|
||||||
`check_email_deliverer_backlog`, requires the `nagios` user to be in
|
|
||||||
the `zulip` group, in order to access `SECRET_KEY` and thus run
|
|
||||||
Zulip management commands.
|
|
||||||
|
|
||||||
If you're using these plugins, bug reports and pull requests to make
|
If you're using these plugins, bug reports and pull requests to make
|
||||||
it easier to monitor Zulip and maintain it in production are
|
it easier to monitor Zulip and maintain it in production are
|
||||||
@@ -259,7 +252,7 @@ encouraged!
|
|||||||
|
|
||||||
## Memory leak mitigation
|
## Memory leak mitigation
|
||||||
|
|
||||||
As a measure to mitigate the impact of potential memory leaks in one
|
As a measure to mitigate the potential impact of any future memory
|
||||||
of the Zulip daemons, the service automatically restarts itself
|
leak bugs in one of the Zulip daemons, Zulip service automatically
|
||||||
every Sunday early morning. See `/etc/cron.d/restart-zulip` for the
|
restarts itself every Sunday early morning. See
|
||||||
precise configuration.
|
`/etc/cron.d/restart-zulip` for the precise configuration.
|
||||||
|
|||||||
Reference in New Issue
Block a user