The previous thresholds were set based on the assumption that 50% of
the memory on the host was taken up by Django and friends, leaving
only 50% for the database host.
On separate database hosts, this leads to a too-small default value
for shared_buffers, which is generally suggested to be 25% of your
database-dedicated memory.
Check if there is a `zulip::app_frontend_base` as part of the deploy,
and use that to adjust if we halve `$zulip::common::total_memory_mb`
or not for purposes of calculating PostgreSQL memory settings. We
also provide a `postgresql.memory` override, if necessary.
Unilaterally adding the port can cause CSRF failures when the port is
a default port, and thus optional. Switch to providing the exact
`Host` header that the original request contained.
Some deployments choose to wrap Zulip's nginx in an outer proxy -- for
example, to do custom TLS termination. In such deployments, the outer
proxy is routing to `127.0.0.1:80`; b4fb22ba1b breaks these
configurations, as it switches the `127.0.0.1:80` listener to only
serving `/api/internal/` paths.
Switch to serving the whole application over `127.0.0.1:80`.
This prevents a thundering herd for videos -- if a very large video is
posted to a channel with many active clients, all of them
simultaneously request it, to provide the in-feed preview image.
While these requests come with a `Range` header which is intended to
limit the request to just the first couple MB, nginx ignores this
header when making its request to the upstream -- so it can obtain and
cache the whole file locally. This results in multiple competing
requests for the whole content from S3, all racing to store the
content in the cache.
Use cache slicing to split the content cache into chunks of 5MB; the
cache is filled one slice at a time, as needed based on the byte
ranges that clients request. Clients making requests without a
`Range` header are provided with the content transparently stitched
together from the individual slices.
The slice size of 5MB is chosen to encompass more 95% of file
uploads (saving an extra trip to the origin) while also being large
enough to be able to provide video thumbnails in a single slice, as
well as not take too much time to obtain from the upstream.
The `$host` nginx variable is _not_ the unadulterated `Host`
header (which would be `$http_host`) -- it is that header, *without
the port*, with a fallback to the `server_name` which processed the
request.
This means that backend services are not aware of the port that the
request came in on, unless they derive that from reading
`nginx_listen_port` in `/etc/zulip/zulip.conf`, or similar.
Specifically, this caused `tusd`, on deploys with non-standard
`nginx_listen_port`, to generate a `Location` header which left off
the port, and as such attempted a CORS check when retrieving metadata
about the just-uploaded file, which failed.
Add the port to the `Host` header we pass to `tusd` and other backend
services.
This makes the `localhost.d` directory less of a lie, and decreases
the chances that local reconfigurations will break the 127.0.0.1:80
server which is used for IPC.
In cases where `nginx_http_only` is enabled, we respect
`nginx_listen_port` soas to not attempt to bind on port 80 if the
administrator was explicitly attempting to avoid that.
This provides access logging metrics to Prometheus. For cardinality
reasons, we cannot (nor would we want to) put every request path into
its own label value -- but we do separate out the most-frequent access
paths (as well as some low-frequency but high-interest ones) into
their own label values.
In order to differentiate accesses to https://zulip.com/ from
https://example.zulipchat.com/ (both of which appear at path `/`), we
use a `grok_exporter.realm_names_regex` value in `zulip.conf`, which
is expected to be set to match the hostname of all possible realms.
This generalizes from thumbnail_workers, to include any other queue.
We only additionally choose to document `email_senders_workers`,
however, since other queues are not guaranteed to work correctly with
multiple consumers.
This is already installed on a lot of systems, and is used indirectly
when upgrading Zulip from Git.
We previously removed this in
263212decf, I believe due to an
incorrect understanding of only makemessages needing it.
`deliver_scheduled_emails` tries to deliver the email synchronously,
and if it fails, it retries after 10 seconds. Since it does not track
retries, and always tries the earliest-scheduled-but-due message
first, the worker will not make forward progress if there is a
persistent failure with that message, and will retry indefinitely.
This can result in excessive network or email delivery charges from
the remote SMTP server.
Switch to delivering emails via a new queue worker. The
`deliver_scheduled_emails` job now serves only to pull deferred jobs
out of the table once they are due, insert them into RabbitMQ, and
then delete them. This limits the potential for head-of-queue
failures to failures inserting into RabbitMQ, which is more reasonable
than failures speaking to a complex external system we do not control.
Retries and any connections to the SMTP server are left to the
RabbitMQ consumer.
We build a new RabbitMQ queue, rather than use the existing
`email_senders` queue, because that queue is expected to be reasonably
low-latency, for things like missed message notifications. The
`send_future_email` codepath which inserts into ScheduledEmails is
also (ab)used to digest emails, which are extremely bursty in their
frequency -- and a large burst could significantly delay emails behind
it in the queue.
The new queue is explicitly only for messages which were not initiated
by user actions (e.g., invitation reminders, digests, new account
follow-ups) which are thus not latency-sensitive.
Fixes: #32463.
In some cases, it is not possible to configure the load-balancer to
add an X-Forwarded-Proto header. If Zulip is serving its traffic over
HTTP, it will rightly error out, since it cannot guarantee that its
response will be served over an encrypted connection.
Add a new `loadbalancer.rejects_http_requests` settings which serves
as a way for the operator to swear that the load-balancer will *never*
serve responses from Zulip over an unencrypted connection. In most
cases, this is because the load-balancer is configured to have port 80
always serve an HTTP 301 redirect to the same URL over HTTPS.
Properly configuring the proxy to send `X-Forwarded-Proto` is always a
better solution than using this configuration parameter, so use of
this should be viewed as a last resort.
Zulip instances without a database included, like the Docker image,
would not fail to use TLS properly, since `TLS_REQCERT` was not set in
`/etc/ldap/ldap.conf`. While there's a few other ways we could fix
this, just installing libldap-common on app frontend instances seems
like a good solution, and has no impact on other Zulip systems, and it
was already being installed through a "Recommends" tier apt dependency
indirectly from the PostgreSQL server package.
Fixeszulip/docker-zulip#454.
The nginx-to-uwsig-timeout defaults to 60s, which is exactly the same
as the current "harakiri" timeout configured in uwsgi (which limits
the length a request can run before the worker is terminated). This
causes a race, where if nginx hits its 60s before uwsgi, then we
return a 504; otherwise, we get a 502.
Make the nginx-to-uwsgi timeout explicit, and shorten the "harakiri"
timeout to be explicitly less than that. Document the 60s timeout,
which all outer reverse proxies must be set to _longer than_ in order
to have proper "onion" timeouts.