docs: Add a performance design document.

This document is mainly an answer to a set of questions other developers have been asking about Zulip's architecture and scalability design. It's intended for developers working on Zulip, to help with thinking about where to prioritize further efforts to optimize scalability.
2025-10-23 04:52:12 +00:00 · 2021-02-12 22:24:34 -08:00
parent 2ad879cdda
commit 150a07cdcd
3 changed files with 356 additions and 0 deletions
--- a/docs/production/requirements.md
+++ b/docs/production/requirements.md
@@ -216,6 +216,10 @@ Scalability is an area of active development, so if you're unsure
 whether Zulip is a fit for your organization or need further advice
 [contact Zulip support][contact-support].

+For readers interested in technical details around what features
+impact Zulip's scalability, this [performance and scalability design
+document](../subsystems/performance.md) may also be of interest.
+
 [s3-uploads]: ../production/upload-backends.html#s3-backend-configuration
 [streaming-replication]: ../production/export-and-import.html#postgresql-streaming-replication
 [contact-support]: https://zulip.com/help/contact-support
--- a/docs/subsystems/index.rst
+++ b/docs/subsystems/index.rst
@@ -15,6 +15,7 @@ Subsystems documentation
   pointer
   markdown
   caching
+   performance
   realms
   management-commands
   schema-migrations
--- a/docs/subsystems/performance.md
+++ b/docs/subsystems/performance.md
@@ -0,0 +1,351 @@
+# Performance and scalability
+
+This page aims to give some background to help prioritize work on the
+Zulip's server's performance and scalability.  By scalability, we mean
+the ability of the Zulip server on given hardware to handle a certain
+workload of usage without performance materially degrading.
+
+First, a few notes on philosophy.
+
+* We consider it an important technical goal for Zulip to be fast,
+  because that's an important part of user experience for a real-time
+  collaboration tool like Zulip.  Many UI features in the Zulip webapp
+  are designed to load instantly, because all the data required for
+  them is present in the initial HTTP request to, and both the Zulip
+  API and webapp are architected around strategy.
+* The Zulip database model and server implementation are carefully
+  designed to ensure that every common operation is efficient, with
+  automated tests designed to prevent the accidental introductions of
+  inefficient or excessive database queries.  We much prefer doing
+  design/implementation work to make requests fast over the operational
+  work of running 2-5x as much hardware to handle the same load.
+
+See also [scalability for production users](../production/requirements.html#scalability).
+
+## Load profiles
+
+Whenever one's thinking about scalability and performance, it's
+important to understand the load profiles for production uses.
+
+Zulip servers typically involve a mixture of two very different types
+of load profiles:
+* Open communities like open source projects, online classes,
+  etc. with large numbers of users, many of whom are idle.  (Many of
+  the others likely stopped by to ask a question, got it answered, and
+  then didn't need the community again for the next year).  Our own
+  [chat.zulip.org](../contributing/chat-zulip-org.md) is a good
+  example for this, with more than 15K total user accounts, of which
+  only several hundred have logged in during the last few weeks.
+  Zulip has many important optimizations, including [soft
+  deactivation](../subsystems/sending-messages.html#soft-deactivation)
+  to ensure idle users have minimal impact on both server-side
+  scalability and request latency.
+* Fulltime teams, like your typical corporate Zulip installation,
+  where most users are active for multiple hours a day and sending a
+  high volume of messages each.  This load profile is most important
+  for self-hosted servers, since many of those are used exclusively by
+  the employees of the organization running the server.
+
+The zulip.com load profile is effectively the sum of thousands of
+organizations from each of those two load profiles.
+
+## Major Zulip endpoints
+
+It's important to understand that Zulip has a handful of endpoints
+that result in the vast majority of all server load, and essentially
+every other endpoint is not important for scalability.  We still put
+effort into making sure those other endpoints are fast for latency
+reasons, but were they to be 10x faster (a huge optimization!), it
+wouldn't materially improve Zulip's scalability.
+
+For that reason, we organize this discussion of Zulip's scalability
+around the several specific endpoints that have a combination of
+request volume and cost that they are important to Zulip's
+scalability.
+
+That said, it is important to distinguish the load associated with an
+API endpoint from the load associated with a feature.  Almost any
+significant new feature is likely to result in its data being sent to
+the client in `page_params` or `GET /messages`, i.e. one of the
+endpoints important to scalability here. As a result, it is important
+to thoughtfully implement the data fetch code path for every feature.
+
+Further, a snappy user interface is one of Zulip's design goals, and
+so we care about the performance of any user-facing code path, even
+though many of them are not material to scalability of the server.
+But only the requests detailed below are worth thinking about
+optimizations with any cost in code readability to save a few
+milliseconds that would be invisible to the end user.
+
+In Zulip's documentation, our general rule is to primarily write facts
+that are likely to remain true for a long time.  While the numbers in
+this article will surely shift with time and hardware, we expect the
+rough sense of them (as well as the list of important endpoints) to
+remain constant for the foreseeable future.
+
+As a spoiler, there are two categories of endpoints that are important
+for scalability: those with extremely high request volumes, and those
+with moderately high request volumes that are also expensive.  We
+first discuss the two endpoints in the first category, and then
+proceed to discuss the rest.
+
+### Tornado
+
+Zulip's Tornado-based [real-time push
+system](../subsystems/events-system.md), and in particular `GET
+/events`, accounts for something like 50% of all HTTP requests to a
+production Zulip server.  Despite `GET /events` being extremely
+high-volume, the typical request takes 1-3ms to process, and doesn't
+use the database at all (though it will access `memcached` and
+`redis`), so they aren't a huge contributor to the overall CPU usage
+of the server, despite their high volume.
+
+Because these requests are so efficient from a total CPU usage
+perspective, Tornado is significantly less important than other
+services like Presence and Tornado for overall CPU usage of a Zulip
+installation.
+
+It's worth noting that most (~80%) Tornado requests end the
+longpolling via a `heartbeat` event, which are issued to idle
+connections after about a minute.  These `heartbeat` events are
+useless aside from avoiding problems with networks/proxies/NATs that
+are configured poorly and might kill HTTP connections that have been
+idle for a minute.  It's likely that with some strategy for detecting
+such situations, we could reduce their volume (and thus overall
+Tornado load) dramatically.
+
+Currently, Tornado is sharded by realm, which is sufficient for
+arbitrary scaling of the number of organizations on a multi-tenant
+system like zulip.com.  With a somewhat straightforward set of work,
+one could change this so sharding by `user_id` instead, which will
+eventually be important for individual large organizations with many
+thousands of concurrent users.
+
+### Presence
+
+`POST /users/me/presence` requests, which submit the current user's
+presence information and return the information for all other active
+users in the organization, account for about 36% of all HTTP requests
+on production Zulip servers.  See
+[presence](../subsystems/presence.md) for details on this system and
+how it's optimized.  For this article, it's important to know that
+presence is one of the most important scalability concerns for any
+chat system, because it cannot be cached long, and is structurally a
+quadratic problem.
+
+Because typical presence requests consume 10-50ms of server-side
+processing time (to fetch and send back live data on all other active
+users in the organization), and are such a high volume, presence is
+the single most important source of steady-state load for a Zulip
+server.  This is true for most other chat server implementations as
+well.
+
+There is an ongoing [effort to rewrite the data model for
+presence](https://github.com/zulip/zulip/pull/16381) that we expect to
+result in a substantial improvement in the per-request and thus total
+load resulting from presence requests.
+
+### Fetching page_params
+
+The request to generate the `page_params` portion of `GET /`
+(equivalent to the response from [GET
+/api/v1/register](https://zulip.com/api/register-queue) used by
+mobile/terminal apps) is one of Zulip's most complex and expensive.
+
+Zulip is somewhat unusual in webapps in sending essentially all of the
+data required for the entire Zulip webapp in this single request,
+which is part of why the Zulip webapp loads very quickly -- one only
+needs one round trip aside from cacheable assets (avatars, images, JS,
+CSS).  Data on other users in the organization, streams, supported
+emoji, custom profile fields, etc., is all included.  The nice thing
+about this model is that essentially every UI element in the Zulip
+client can be rendered immediately without paying latency to the
+server; this is critical to Zulip feeling performant even for users
+who have a lot of latency to the server.
+
+There are only a few exceptions where we fetch data in a separate AJAX
+request after page load:
+
+* Message history are managed separately; this is why the Zulip webapp will
+  first render the entire site except for the middle panel, and then a
+  moment later render the middle panel (showing the message history).
+* A few very rarely accessed data sets like [message edit
+  history](https://zulip.com/help/view-a-messages-edit-history) are
+  only fetched on demand.
+* A few data sets that are only required for administrative settings
+  pages is fetched only when loading those parts of the UI.
+
+Requests to `GET /` and `/api/v1/register` that fetch `page_params`
+are pretty rare -- something like 0.3% of total requests, but are
+important for scalability because (1) they are most expensive read
+requests the Zulip API supports and (2) they can come in a thundering
+herd around server restarts (as discussed in [fetching message
+history](#fetching-message-history).
+
+The cost for fetching `page_params` varies dramatically based
+primarily on the organization's size, varying from 90ms-300ms for a
+typical organization but potentially multiple seconds for large open
+organizations with 10,000s of users.  There is also smaller
+variability based on a individual user's personal data state,
+primarily in that having 10,000s of unread messages results in a
+somewhat expensive query to find which streams/topics those are in.
+
+We consider any organization having normal `page_params` fetch times
+greater than a second to be a bug, and there is ongoing work to fix that.
+
+It can help when thinking about this to imagine `page_params` as what
+in another webapp would have been 25 or so HTTP GET requests, each
+fetching data of a given type (users, streams, custom emoji, etc.); in
+Zulip, we just do all of those in a single API request.  In the
+future, we will likely move to a design that does much of the database
+fetching work for different features in parallel to improve latency.
+
+For organizations with 10K+ users and many default streams, the
+majority of time spent constructing `page_params` is spent marshalling
+data on which users are subscribed to which streams, which is an area
+of active optimization work.
+
+### Fetching message history
+
+Bulk requests for message content and metadata ([`GET
+/messages`](https://zulip.com/api/get-messages)) account for ~3% of
+total HTTP requests.  The zulip webapp has a few major reasons it does
+a large number of these requests:
+
+* Most of these requests are from users clicking into different views
+  -- to avoid certain subtle bugs, Zulip's webapp currently fetches
+  content from the server even when it has the history for the
+  relevant stream/topic cached locally.
+* When a browser opens the Zulip webapp, it will eventually fetch and
+  cache in the browser all messages newer than the oldest unread
+  message in a non-muted context.  This can be in total extremely
+  expensive for users with 10,000s of unread messages, resulting in a
+  single browser doing 100 of these requests.
+* When a new version of the Zulip server is deployed, every browser
+  will reload within 30 minutes to ensure they are running the latest
+  code.  For installations that deploy often like chat.zulip.org and
+  zulip.com, this can result in a thundering herd effect for both `/`
+  and `GET /messages`.  A great deal of care has been taking in
+  designing this [auto-reload
+  system](../subsystems/hashchange-system.html#server-initiated-reloads)
+  to spread most of that herd over several minutes.
+
+Typical requests consume 20-100ms to process, much of which is waiting
+to fetch message IDs from the database and then their content
+memcached.  While not large in an absolute sense, these requests are
+relatively expensive to most other Zulip endpoints.
+
+Some requests, like full-text search for commonly used words, can be
+more expensive, but they are sufficiently rare in an absolute sense so
+as to be immaterial to the overall scalability of the system.
+
+This server-side code path is already heavily optimized on a
+per-request basis.  However, we have technical designs for optimizing
+the overall frequency with which clients need to make these requests
+in two major ways:
+
+* Improving [client-side
+  caching](https://github.com/zulip/zulip/issues/15131) to allow
+  caching of narrows that the user has viewed in the current session,
+  avoiding repeat fetches of message content during a given session.
+* Adjusting the behavior for clients with 10,000s of unread messages
+  to not fetch as much old message history into the cache.  See [this
+  issue](https://github.com/zulip/zulip/issues/16697) for relevant
+  design work.
+
+Together, it is likely that these changes will reduce the total
+scalability cost of fetching message history dramatically.
+
+### User uploads
+
+Requests to fetch uploaded files (including user avatars) account for
+about 5% of total HTTP requests.  Zulip spends consistently ~10-15ms
+processing one of these requests (mostly authorization logic), before
+handing off delivery of the file to `nginx` or S3 (depending on the
+configured [upload backend](../production/upload-backends.md)).
+
+We can significantly reduce the volume of these requests in large
+production Zulip servers by fixing [Zulip's upload authorization
+strategy preventing browser caching of
+uploads](https://github.com/zulip/zulip/issues/13088), since many of
+them result from a single client downloading the same file repeatedly
+as it navigates around the app.
+
+### Sending and editing messages
+
+[Sending new messages](../subsystems/sending-messages.md) (including
+incoming webhooks) represents less than 0.5% of total request volume.
+That this number is small should not be surprising even though sending
+messages is intuitively the main feature of a chat service: a message
+sent to 50 users triggers ~50 `GET /events` requests.
+
+A typical message-send request takes 20-70ms, with more expensive
+requests typically resulting from [markdown
+rendering](../subsystems/markdown.md) of more complex syntax.  As a
+result, these requests are not material to Zulip's scalability.
+Editing messages and adding emoji reactions are very similar to
+sending them for the purposes of performance and scalability, since
+the same clients need to be notified, and lower in volume.
+
+That said, we consider the performance of these endpoints to be some
+of the most important for Zulip's user experience, since even with
+local echo, these are some of the places where any request processing
+latency is highly user-visible.
+
+Typing notifications are slightly higher volume than sending messages,
+but are also extremely cheap (~3ms).
+
+### Other endpoints
+
+Other API actions, like subscribing to a stream, editing settings,
+registering an account, etc., are vanishingly rare compared to the
+requests detailed above, fundamentally because almost nobody changes
+these things more than a few dozen times over the lifetime of their
+account, whereas everything above are things that a given user might
+do thousands of times.
+
+As a result, performance work on those requests is generally only
+important for latency reasons, not for optimizing the overall
+scalability of a Zulip server.
+
+## Queue processors and cron jobs
+
+The above doesn't cover all of the work that a production Zulip server
+does; various tasks like sending outgoing emails or recording the data
+that powers [/stats](https://zulip.com/help/analytics) are run by
+[queue processors](../subsystems/queuing.md) and cron jobs, not in
+response to incoming HTTP requests.  In practice, all of these have
+been written such that they are immaterial total load and thus
+architectual scalability, though we do from time to time need to do
+operational work to add additional queue processors for particularly
+high-traffic queues.  For all of our queue processors, any
+serialization requirements are at most per-user, and thus it would be
+straightforward to shard by `user_id` or `realm_id` if required.
+
+## Service scalability
+
+In addition to the above, which just focuses on the total amount of
+CPU work, it's also relevant to think about load on infrastructure
+services (memcached, redis, rabbitmq, and most importantly postgres),
+as well as queue processors (which might get backlogged).
+
+In practice, efforts to make an individual endpoint faster will very
+likely reduce the load on these services as well.  But it is worth
+considering that database time is a more precious resource than
+Python/CPU time (being harder to scale horizontally).
+
+Most optimizations to make an endpoint cheaper will start with
+optimizing the database queries and/or employing
+[caching](../subsystems/caching.md), and then continue as needed with
+profiling of the Python code and any memcached queries.
+
+For a handful of the critical code paths listed above, we further
+optimize by skipping the Django ORM (which has substantial overhead)
+for narrow sections; typically this is sufficient to result in the
+database query time dominating that spent by the Python application
+server process.
+
+Zulip's [server logs](../subsystems/logging.md) are designed to
+provide insight when a request consumes significant database or
+memcached resources, which is useful both in development and in
+production.