Alex Vandiver
120de1dca9
zephyr: Write out unix timestamp in check, as check_cron_file expects.
...
A follow-up fix to 8bc26aab08 .
2022-06-30 11:12:26 -07:00
Alex Vandiver
8577adcf2e
cron: Remove unused STATE_FILE environment variable.
2022-06-22 12:07:38 -07:00
Alex Vandiver
8bc26aab08
nagios: Switch check_user_zephyr_mirror_liveness to run via cron.
...
This check loads Django, and as such must be run as the zulip user.
Repeat the same pattern used elsewhere in nagios, of writing a state
file, which is read by `check_cron_file`.
2022-06-22 12:07:38 -07:00
Alex Vandiver
41deef40cf
nagios: Switch to generic check_cron_file for queues and consumers.
...
These share a common root; 91da4bd59b duplicated the code, but
didn't move the existing uses to the new utility.
2022-06-22 12:07:38 -07:00
Alex Vandiver
8fbde9b8c5
nagios: Only run check_fts_update_log on one PostgreSQL host.
...
The data is the same in the table in all replicas -- there is no need
to alert on all of them.
2022-06-22 12:07:38 -07:00
Alex Vandiver
499284d2fd
nagios: Split postgresql into primary and replica.
...
Replication checks should only run on primary and replicas, not
standalone hosts; while `autovac_freeze` currently only runs on
primary hosts, it functions identically on replicas, and is fine to
run there.
Make `autovac_freeze` run on all `postgresql` hosts, and make
standalone hosts no longer `postgres_primary`, so they do not fail the
replication tests.
2022-06-22 12:07:38 -07:00
Alex Vandiver
38e435347b
nagios: Add missing queue consumer checks.
2022-06-22 12:07:38 -07:00
Alex Vandiver
e01a4242aa
nagios: Sort queue consumer checks.
2022-06-22 12:07:38 -07:00
Alex Vandiver
2c90c7a010
nagios: Switch check_remote_arg_string queue checks to consumer checks.
...
These style of checks just look for matching process names using
`check_remote_arg_string`, which dates to 8edbd64bb8 . These were
added because the original two (`missedmessage_emails` and
`slow_queries`) did not create consumers, instead polling for events.
Switch these to checking the queue consumer counts that the
`check-rabbitmq-consumers` check is already writing out. Since the
`missedmessage_emails` was _already_ checked via the consumer check, a
duplicate is not added.
2022-06-22 12:07:38 -07:00
Alex Vandiver
f48d543d9b
nagios: Make and use a "rabbitmq-consumer-service" template service.
2022-06-22 12:07:38 -07:00
Alex Vandiver
775a084d0f
nagios: Add a catchall "other" set.
2022-06-22 12:07:38 -07:00
Alex Vandiver
83c82c8e15
nagios: Adjust load alerting by hostgroup.
...
Even the `pageable_servers` group did not page for high load -- in
part because what was "high" depends on the servers. Set slightly
better limits based on server role.
2022-06-22 12:07:38 -07:00
Alex Vandiver
2a14aa5180
nagios: Add a fullstack hostgroup.
...
This will be used to apply checks only to czo.
2022-06-22 12:07:38 -07:00
Alex Vandiver
b5ecfc327f
nagios: Remove unnecessary web hostgroup.
...
This had identical membership to `frontends`.
2022-06-22 12:07:38 -07:00
Alex Vandiver
4be9025212
nagios: Remove redundant postgresql hostgroup.
...
This is implied by `postgresql_primary`.
2022-06-22 12:07:38 -07:00
Alex Vandiver
d9d0014fb4
nagios: Rename zmirror_main into zmirror hostgroup.
...
`zmirror` itself was `zmirror_main` + `zmirrorp` but was unused; we
consistently just use the term `zmirror` for the non-personals server,
so use it as the hostgroup name.
2022-06-22 12:07:38 -07:00
Alex Vandiver
70c36985b4
nagios: Remove frontends from redis group.
...
The Redis nagios checks themselves are done against `redis` +
`frontends` groups, so there is no need to misleadingly place
`frontends` in the `redis` hostgroup.
2022-06-22 12:07:38 -07:00
Alex Vandiver
08127086bc
nagios: Remove misleading "staging_frontends" from standalone.
...
No services are tested for the `staging_frontends` hostgroup, so this
does not alter the checks.
2022-06-22 12:07:38 -07:00
Alex Vandiver
d804de871d
nagios: Move staging and prod hostgroups adjacent.
2022-06-22 12:07:38 -07:00
Alex Vandiver
4c17f2bccc
nagios: The frontends hostgroup now includes prod and staging frontends.
...
This lets the config file remove some repetition.
2022-06-22 12:07:38 -07:00
Alex Vandiver
1e81775fa0
nagios: Drop unhelpful hostgroup comment.
2022-06-22 12:07:38 -07:00
Alex Vandiver
7b584401ac
nagios: Reformat hostgroups.
2022-06-22 12:07:38 -07:00
Alex Vandiver
93bcb86345
nagios: Reorder service checks.
2022-06-22 12:07:38 -07:00
Alex Vandiver
eaaa2fbff8
nagios: Use canonical "hostgroup_name" consistently.
2022-06-22 12:07:38 -07:00
Alex Vandiver
e8996b53a5
nagios: Remove unused has_swap hostgroup.
2022-06-22 12:07:38 -07:00
Alex Vandiver
33472ee9ff
nagios: Remove unused stats host set.
2022-06-22 12:07:38 -07:00
Alex Vandiver
bc4f4b4862
nagios: Make the pageable/not/flaky tri-state clearer.
2022-06-22 12:07:38 -07:00
Alex Vandiver
c74f195fba
nagios: Split AWS and non-AWS hosts, for ntp checks.
...
The non-AWS hosts cannot use the AWS ntp server for their check.
2022-06-22 12:07:38 -07:00
Alex Vandiver
872efdee58
nagios: Fold single- and multitornado_frontends back into frontends.
...
5abf4dee92 made this distinction, then multitornado_frontends was
never used; the singletornado_frontends alerting worked even for the
multiple-Tornado instances.
Remove the useless and misleading distinction.
2022-06-22 12:07:38 -07:00
Alex Vandiver
3741c1c034
puppet: Switch to checking time against the AWS timeserver.
...
Since this is what chrony is sync'ing to, it lessens the chance of
spurious firings of this alert.
See https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/
2022-05-31 22:57:32 -07:00
Alex Vandiver
7f6a77da31
puppet: Add a redis exporter.
2022-05-03 17:13:44 -07:00
Anders Kaseorg
e9ba9b0e0d
zulip-ec2-configure-interfaces: Remove.
...
Our current EC2 systems don’t have an interface named ‘eth0’, and if
they did, this script would do nothing but crash with ImportError
because we have never installed boto.utils for Python 3.
(The message of commit 2a4d851a7c made
an effort to document for future researchers why this script should
not have been blindly converted to Python 3. However, commit
2dc6d09c2a (#14278 ) was evidently
unresearched and untested.)
Signed-off-by: Anders Kaseorg <anders@zulip.com >
2022-05-03 02:25:59 -07:00
Anders Kaseorg
646a4d19a3
puppet: Remove quotes for enumerable values.
...
https://puppet.com/docs/puppet/7/style_guide.html#style_guide_module_design-quoting
“If a string is a value from an enumerable set of options, such as
present and absent, it SHOULD NOT be enclosed in quotes at all.”
Signed-off-by: Anders Kaseorg <anders@zulip.com >
2022-04-29 22:06:46 -07:00
Alex Vandiver
35db1ee435
puppet: Only include "app_service" section if there are apps.
...
This works around gravitational/teleport#12256 , but also produces config
files that are slightly cleaner.
2022-04-26 16:36:13 -07:00
Alex Vandiver
f6d27562fa
puppet: Configure chrony to use AWS-local NTP sources.
...
This prevents hosts from spewing traffic to random hosts across the
Internet.
2022-03-25 17:07:53 -07:00
Alex Vandiver
1bd5723cd2
puppet: Add a prometheus monitor for tornado processes.
2022-03-20 16:12:11 -07:00
Alex Vandiver
6b91652d9a
puppet: Open the grok_exporter port.
...
The complete grok_exporter configuration is not ready to be committed,
but this at least prepares the way for it.
2022-03-20 16:12:11 -07:00
Alex Vandiver
6558655fc6
puppet: Add rabbitmq prometheus plugin, and open the firewall.
2022-03-20 16:12:11 -07:00
Alex Vandiver
bdd2f35d05
puppet: Switch czo to using zulip_ops::app_frontend_monitoring.
...
This was clearly intended in f61ac4a28d
but never executed.
2022-03-20 16:12:11 -07:00
Alex Vandiver
17699bea44
puppet: postgresql_backups is auto-included if s3_backups_bucket is set.
...
Since 6496d43148 .
2022-03-20 16:12:11 -07:00
Alex Vandiver
bedc7c2986
puppet: Smokescreen is now auto-included in standalone.
...
Since c33562f0a8 .
2022-03-20 16:12:11 -07:00
Anders Kaseorg
b3260bd610
docs: Use Debian and Ubuntu version numbers over development codenames.
...
Signed-off-by: Anders Kaseorg <anders@zulip.com >
2022-02-23 12:04:24 -08:00
Alex Vandiver
788daa953b
puppet: Factor out $::architecture case statement for golang.
2022-02-15 12:04:37 -08:00
Anders Kaseorg
f6a701090c
setup-apt-repos: Don’t install lsb_release.
...
Signed-off-by: Anders Kaseorg <anders@zulip.com >
2022-02-14 16:38:53 -08:00
Alex Vandiver
e032b38661
puppet: Fix typo in uwsgi exporter dependency.
2022-02-08 15:17:17 -08:00
Alex Vandiver
3bbe5c1110
puppet: Put comments on iptables lines.
...
In addition to documenting the rules.v4 and rules.v6 files slightly,
these comments show up in `iptables -L`:
```
root@hostname:~# iptables -L INPUT
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere
LOGDROP all -- anywhere localhost/8
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh /* ssh */
ACCEPT tcp -- anywhere anywhere tcp dpt:3000 /* grafana */
ACCEPT tcp -- anywhere anywhere tcp dpt:9100 /* node_exporter */
LOGDROP all -- anywhere anywhere
```
2022-01-21 16:46:14 -08:00
Alex Vandiver
6bc5849ea8
puppet: Remove now-unused debathena apt repository.
2022-01-18 14:13:28 -08:00
Alex Vandiver
b3f07cc98d
puppet: Replace debathena zephyr package with equivalent puppet file.
2022-01-18 14:13:28 -08:00
Alex Vandiver
a6d7539571
puppet: Replace debathena krb5 package with equivalent puppet file.
2022-01-18 14:13:28 -08:00
Alex Vandiver
75224ea5de
puppet: python-dev is now purely virtual; install python2.7-dev.
2022-01-18 14:13:28 -08:00