zulip

mirror of https://github.com/zulip/zulip.git synced 2025-11-06 15:03:34 +00:00

Author	SHA1	Message	Date
Alex Vandiver	b555d3f553	digest: Cache per-stream recent topics, rather than batching. The query plan for fetching recent messages from the arbitrary set of streams formed by the intersection of 30 random users can be quite bad, and can descend into a sequential scan on `zerver_recipient`. Worse, this work of pulling recent messages out is redone if the stream appears in the next batch of 30 users. Instead, pull the recent messages for a stream on a one-by-one basis, but cache them in an in-memory cache. Since digests are enqueued in 30-user batches but still one-realm-at-a-time, work will be saved both in terms of faster query plans whose results can also be reused across batches. This requires that we pull the stream-id to stream-name mapping for _all_ streams in the realm at once, but that is well-indexed and unlikely to cause performance issues -- in fact, it may be faster than pulling a random subset of the streams in the realm.	2023-09-13 13:25:59 -07:00
Alex Vandiver	f8a9779b54	digest: Rename get_slim_stream_map slightly and explain its name more.	2023-09-13 13:25:59 -07:00
Alex Vandiver	bca9821c89	digest: Rename get_recent_streams for clarity.	2023-09-13 13:25:59 -07:00
Alex Vandiver	524d4913b3	digest: Filter out users who have joined recently in SQL.	2023-09-13 13:25:59 -07:00
Alex Vandiver	d8668ab242	digest: Narrow the query by only fetching the sender full name.	2023-09-13 13:25:59 -07:00
Alex Vandiver	058a168bfe	digest: Rewrite target-user algorithm as one query. There is no reason to do this set manipulation in Python.	2023-09-13 13:25:59 -07:00
Alex Vandiver	584c202d36	digest: Remove unnecessary should_process_digest function.	2023-09-13 13:25:59 -07:00
Anders Kaseorg	2665a3ce2b	python: Elide unnecessary list wrappers. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-09-13 12:41:23 -07:00
Alex Vandiver	b94402152d	models: Always search Messages with a realm_id or id limit. Unless there is a limit on `id`, always provide a `realm_id` limit as well. We also notate which index is expected to be used in each query.	2023-09-11 15:00:37 -07:00
Anders Kaseorg	c2c96eb0cf	python: Annotate type aliases with TypeAlias. This is not strictly necessary but it’s clearer and improves mypy’s error messages. https://docs.python.org/3/library/typing.html#typing.TypeAlias https://mypy.readthedocs.io/en/stable/kinds_of_types.html#type-aliases Signed-off-by: Anders Kaseorg <anders@zulip.com>	2023-08-07 10:02:49 -07:00
Anders Kaseorg	77c15547e6	ruff: Fix C414 Unnecessary `list` call within `sorted()`. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-11-03 12:10:15 -07:00
Zixuan James Li	ab1bbdda65	typing: Broaden type annotations for QuerySet compatibility. To explain the rationale of this change, for example, there is `get_user_activity_summary` which accepts either a `Collection[UserActivity]`, where `QuerySet[T]` is not strictly `Sequence[T]` because its slicing behavior is different from the `Protocol`, making `Collection` necessary. Similarily, we should have `Iterable[T]` instead of `List[T]` so that `QuerySet[T]` will also be an acceptable subtype, or `Sequence[T]` when we also expect it to be indexed. Signed-off-by: Zixuan James Li <p359101898@gmail.com>	2022-07-07 11:27:42 -07:00
Mateusz Mandera	fcf82bf047	digest: Don't send emails to deactivated users, even if queued.	2022-04-15 14:32:55 -07:00
Anders Kaseorg	b0ce4f1bce	docs: Fix many spelling mistakes. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2022-02-07 18:51:06 -08:00
Anders Kaseorg	6e4c3e41dc	python: Normalize quotes with Black. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-02-12 13:11:19 -08:00
Anders Kaseorg	11741543da	python: Reformat with Black, except quotes. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2021-02-12 13:11:19 -08:00
Vishnu KS	3f4f16f4f1	digest: Remove comments from get_hot_topics. The code is self explanatory.	2021-02-09 10:35:47 -08:00
Vishnu KS	e9587900e6	digest: Use heapq.nlargest instead of sorted. nlargest is the natural fit for selecting n biggest items from an unsorted list. It's more readable as well as more efficent (even though we don't care much about the efficeny in this particular case).	2021-02-09 10:35:47 -08:00
Vishnu KS	738d759e6f	digest: Create MAX_HOT_TOPICS_TO_BE_INCLUDED_IN_DIGEST constant.	2021-02-09 10:35:47 -08:00
Vishnu KS	c0bd05b52d	digest: Check whether length of hot topics is 4. The length of hot topics would not exceed 4.	2021-02-09 10:35:47 -08:00
Vishnu KS	5c026d67e3	digest: Sort topics in descending order in get_hot_topics. We want topics with high diversity and large lengths. So they should be sorted with reverse=True. This bug seems to be introduced in `936171d258`	2021-02-09 10:35:47 -08:00
Alex Vandiver	d0f0c2f2ed	digest: Fix the structure that we enqueue across when digesting. This rename was missed in `bfa0bdf3d6`. Without this fix, digest messages fail to send.	2021-02-08 17:28:59 -08:00
Tim Abbott	5a02b33f2e	digest: Add a large block comment on correctness.	2021-01-17 11:37:59 -08:00
Steve Howell	1040fb7219	email digests: Remove handle_digest_email shim. The previous commit made it so we only call the shim in tests, so now we completely remove it.	2021-01-17 11:28:30 -08:00
Steve Howell	bfa0bdf3d6	email digests: Process users in chunks of 30. This should make the queue empty more quickly, because we do bulk queries to prevent database hops.	2021-01-17 11:28:30 -08:00
Steve Howell	e0b451730a	email digests: Extract get_new_streams. This makes us more efficient when handling multiple users. We don't have to keep sending the same two queries to the database. Note that as part of this we eliminated a failure mode for the obscure population of users from whom both `user.is_guest` and `user.can_access_public_streams()` returns False. We know this would have only affected Zephyr users (by looking at the code), and we know we don't actually process Zephyr users for email digests (or else we would have raised exceptions in the old code).	2021-01-17 11:28:30 -08:00
Steve Howell	23de94504f	email digests: Query streams for messages up front. This should save us many hops to the database when we process users in bulk.	2021-01-17 11:28:30 -08:00
Steve Howell	3662bf2dcb	minor: Rename stream_map -> user_stream_map.	2021-01-17 11:28:30 -08:00
Steve Howell	f8bbb7fea9	email digests: Use select_related("realm"). We mostly need realm_id, but when we go to build message lists, we need realm.uri. We could probably be more aggresive about using `only` here, but for now I am just trying to reduce hops to the database.	2021-01-17 11:28:29 -08:00
Steve Howell	bb56f0ec0e	minor: Move get_stream_map to module level. This is a pure code move.	2021-01-17 11:28:29 -08:00
Steve Howell	52e2d5a733	email digests: Avoid long_term_idle check. We want to exclude users with recent subscription activity from emails, regardless of whether the long_term_idle flag is set.	2021-01-17 11:28:29 -08:00
Steve Howell	162b372b93	email digests: Do one query for recent streams. This is another way to limit hops to the database when we process users in bulk.	2021-01-17 11:28:29 -08:00
Alex Vandiver	438d2aa632	digests: Ensure that the teaser_data can be JSON-serialized. Leaving this as a set means that it fails in zerver.lib.send_email when serializing into a ScheduledEmail object.	2020-12-15 11:44:50 -08:00
Steve Howell	e2e0f06b2a	email digests: Call get_recent_topics once per batch. Once we start processing digests in batch, this will let us amortize the expense of the message query over multiple users.	2020-11-16 08:59:29 -08:00
Steve Howell	428f0564a0	minor: Move context code down in the function. This will make a subsequent diff a bit less noisy.	2020-11-16 08:59:29 -08:00
Steve Howell	1d1e45e9ec	digests: Use UserActivityInterval for user activity. Note that we are much more efficient about finding active users here: - we do one query per realm (instead of per-user) - we pass the cutoff date to the database - we get back just a list of distinct ids	2020-11-16 08:59:29 -08:00
Steve Howell	b52f56080e	performance: Just get user_ids to queue digest emails.	2020-11-16 08:59:29 -08:00
Steve Howell	e13e5d104d	refactor: Only require user_id for inactive_since(). This function is going away completely soon. It is querying everybody's entire UserActivity history instead of passing the cutoff date to the database!	2020-11-16 08:59:29 -08:00
Steve Howell	d0260392f7	digests: Get user objects from the database. The query counts increase here for somewhat contrived reasons. The tests before this commit reflected a successful trip to the UserProfile cache, but that's not actually realistic in practice.	2020-11-16 08:59:29 -08:00
Steve Howell	e49a482baf	email digests: Make transactions atomic.	2020-11-16 08:59:28 -08:00
Steve Howell	cf6bcfb84a	digest emails: Exclude users who had recent digests. This code protects us in case we ever need to re-run email digests twice in the same day.	2020-11-16 08:59:28 -08:00
Steve Howell	4271442fba	email digests: Write RealmAuditLog rows.	2020-11-16 08:59:28 -08:00
Steve Howell	5da4332620	minor: Add order-by-id to digest message query. The order-by-id is now explicit, and I add comments to explain the select_related tables.	2020-11-06 10:05:46 -08:00
Steve Howell	936171d258	refactor: Extract DigestTopic class. This gets us away from a lot of dictionary soup.	2020-11-06 10:05:46 -08:00
Steve Howell	e8b6c56322	refactor: Simplify get_hot_topics(). The code we deleted here was no longer doing anything. Maybe the code was always dead, or maybe it was written during a time when topics_by_diversity and topics_by_length actually had different keys. But now it's clearly cruft. If we have 4 or more topics, then the code above it would already have populated the list with 4 elements, and the `if num_convos < 4` condition would evaluate to False. And if we had 3 or fewer topics, then we would have already put all possible topics into our result, and the `topics_by_diversity[num_convos:4]` slice would be empty. It's possible that we should just have a simple heuristic for topic hotness like `10*num_senders + messages`, so we don't have to maintain this fiddly function, and we can just do something like `topics_by_score[:4]`.	2020-11-06 10:05:46 -08:00
Steve Howell	c5dc9d386f	refactor: Use sets of stream_ids for email digests. I now use sets for stream_ids in more of the digest code. As part of this I replaced exclude_subscription_modified_streams with streams_recently_modified_for_user. It's easier for the caller to just ask for ids to delete from its callee than it is to pass in a set/list to mutate. The simpler boundary between the functions makes the tests easier to write--you can see the `filtered_streams` logic goes away in this diff. I also make the tests a bit more thorough by using combinations of Cordelia/Othello and Verona/Denmark to try to find multiple possible flaws. And I make the time intervals longer than 1s to avoid false negatives from slow CI boxes.	2020-11-05 17:42:43 -08:00
Steve Howell	88a57ed4ac	bulk digest: Get stream subscriptions in bulk. If we have multiple users, this reduces the amount of queries we need to do, because we get all subscriptions for all users in a single query to Subscription. For the single-user case, we are introducing an extra query hop, but the database is doing roughly the same work, because we are just breaking up this complex query into two hops: messages = select ... from message where recipient__type_id in ( select stream_id from subscription where ... ) Now it's more like: stream_ids = select stream_id from subscription where ... messages = select ... from message where recipient__type_id in stream_ids	2020-11-05 09:36:59 -08:00
Steve Howell	c83db37161	email digests: Introduce bulk methods for digest. Note that we are not changing anything semantically or algorithmically yet. The only overhead here for the single-user case is boxing and unboxing data into single-item dicts and lists. The interfaces for callers in the view and the queue processor remain the same for now.	2020-11-05 09:36:59 -08:00
Steve Howell	7c89e46731	minor: Clean up some code formatting.	2020-11-05 09:36:59 -08:00
Steve Howell	4bd02eea19	minor: Use user, not user_profile, in some digest code.	2020-11-05 09:36:59 -08:00

1 2 3 4

173 Commits