Commit Graph

385 Commits

Author SHA1 Message Date
Mateusz Mandera
a52bc4d71b slack: Handle integration bots with missing data.
We encountered the following two new cases with integration bots in
Slack imports:
1. Bots without the image_72 field in their data. Such bots should fall
   back to gravatar.
2. Bots whose bot_id is the sender of certain messages, but querying the
   bots.info endpoint returns bot_not_found error. We should create
   dummy accounts in place of such bots.
2025-05-16 13:06:28 -07:00
Mateusz Mandera
61b6e823b6 slack: Don't create duplicate reactions.
Just like in the prior commit we add similar code in `get_subscription`,
here we do this for reactions. If two Slack accounts, which will be
merged into one Zulip account, posted the same emoji reaction to a
message, we need to avoid duplication.
2025-05-16 13:06:28 -07:00
Mateusz Mandera
86f17026ac slack: Don't create duplicate subscriptions.
In 1517601e9d we added merging of accounts
with the same email address; but didn't prevent duplication of
subscriptions when two Slack accounts, merged into one Zulip account,
belong to a channel.
2025-05-16 13:06:28 -07:00
userAdityaa
354a16fb0a migration: Rename 'populate_db' Client to 'ZulipDataImport'.
This commit:

* Creates a migration to rename any existing Client with
name="populate_db" to "ZulipDataImport".
* Updates populate_db.py to use ZulipDataImport for new
message creation

These changes should make code to identify imported messages
considerably more readable.

Fixes #33909.
2025-05-08 12:18:34 -07:00
PieterCK
a228699108 slack_incoming: Use Slack text reformatters from data import.
This commit updates the Slack incoming webhook endpoint to use the same
Slack reformatting functions (`convert_to_zulip_markdown` and
`replace_links`) that are used for Slack data import and cleaning up any
duplicative functions.

This was previously not possible because the Slack reformatting regex in
`slack_message_conversion.py` could not handle these cases:

- Formatting applied to non-ASCII characters (e.g., emoji).

- Formatted strings separated by exactly one character.

- Formatted strings appearing immediately after a new line.

Fixes part of #31162.
2025-04-28 20:49:10 -07:00
PieterCK
c018911c5b slack_regex: Fix overlapping capture groups.
The Slack text regexes match specific characters before and after the
formatted string to ensure that they only match at word boundaries.

However, because the capture groups consume each matching character,
including the characters used to determine word boundaries, two
formatted strings separated by a single matching character result in one
string not being matched, as the trailing whitespace is already
consumed and cannot also match as the leading pre-match whitespace for
the next character.

Switch to a look-ahead regex for the trailing word boundary
characters. This is zero-width, and as such the next match can still
also consume the same characters.

This also fixes Slack webhook integrations'
`test_message_with_complex_formatted_mentions` which was previously
expecting false output.

Fixes part of #30827.
2025-04-28 20:49:10 -07:00
PieterCK
cfeb4ba731 slack_regex: Simplify Slack regex main capture group.
The inner capture group of Slack text regex is used to capture the
formatted text, so basically all characters but the formatting
characters like *, ~, and _.

It currently does this by specifying a range of characters to be
captured except the formatting characters. This unintentionally excluded
non-ASCII characters like emoji.

This commit simplifies the inner capture groups of the Slack text regex
to explicitly exclude formatting characters (e.g., *, ~, _) instead of
using an allowlist to not include them.

This change is part of the effort to make `convert_to_zulip_markdown`
compatible with output from `render_blocks` and `render_attachments`,
which may contain emoji.
2025-04-28 20:49:10 -07:00
PieterCK
ee34ccb8c2 slack_regex: Remove unnecessary inner capture group.
The inner capture groups of the Slack text regex tries to captures the
formatted string. For an unclear reason, we're currently using two
similar capture groups despite the fact that only using the second
capture groups would suffice.

This removes the first capture groups.
2025-04-28 20:49:10 -07:00
PieterCK
8992435caf slack_regex: Update Slack regex handle multiline strings.
This prep commit modifies the Slack text regex in
`slack_message_conversion.py` to use the `re.MULTILINE` flag capture
formatted strings that are adjacent to newline or end of line.

These kinds of strings are likely not uncommon to be generated by Slack
exporter itself and our Slack message block and attachment formatters
(`render_blocks` and `render_attachments`) also produces them.

This also fixes Slack webhook integration's
`test_message_with_complex_formatted_texts` which was previously
expecting false output.

Fixes part of #30827.
2025-04-28 20:49:10 -07:00
PieterCK
6c1818fa46 slack_regex: Avoid replacing all identical matches.
In `convert_markdown_syntax`, `re.finditer` is used to iterate over each
`match` found in the `text`. In each iteration, we create the Zulip
formatting equivalent of the current `match` and do a `text.replace()`
to replace all strings in `text` that matches the current `match`.

Since we're planning to add the `re.MULTILINE` flag to `re.finditer`, it
would be problematic if the first and last capture group of `match` are
zero-width characters (newlines). This is because
`convert_markdown_syntax` would reformat all identical strike-through
and bold formatted strings as many time as there are identical `match`es
of them.

Consider the following:
---
 original text: "*foo*\n*foo*"

 1. <re.Match object; span=(0, 5), match='*foo*'>
    1st replace -> "**foo**\n**foo**"

 2. <re.Match object; span=(6, 11), match='*foo*'>
    2nd replace -> "***foo***\n***foo***"
---

This commit uses `re.sub` instead of `re.finditer`, which does replace
operation only on the specific piece of the original text the `match` is
from.
2025-04-28 20:49:10 -07:00
PieterCK
946e4096fc slack_regex: Refactor how the capture groups are defined.
This updates the patterns to use a more unicode-aware pattern for their
first and last capture groups. The new patterns have the same
behaviours, but they're expressed in a more coherent way.

For example, the existing patterns lists what characters to look for,
skipping ceratin characters it don't want to match (e.g, closing
brackets & quote are skipped). The new pattern narrows down what it
looks for (whitespace, punctuation, symbols) and explicitly list what it
don't want to match (closing quote and bracket, etc).

This also refactors the `convert_markdown_syntax` to use the `regex`
module instead of the `re` module because the `regex` module has full
unicode support.
2025-04-28 20:49:10 -07:00
PieterCK
f33c5b9c1e slack_regex: Fix wrong comments. 2025-04-28 20:27:28 -07:00
PieterCK
d5e28bcd28 slack_import: Fix thread conversion condition.
Currently, threads in Slack direct messages will increment the
`thread_counter` variable inside the thread conversion logic. Since we
don't treat thread messages in Slack DMs differently than any other DM,
threads in DM will only falsely increment the thread topic names in
channels.

This adds a condition that checks if the Slack message is a DM or not
before executing the thread conversion logic.
2025-03-25 16:38:21 -07:00
Elsa Kihlberg Gawell
845f0d40e1 import_data: Make sure converted DMs don't have topic name.
Previously, `build_message` sets a message's topic name to the given
topic name, regardless of whether the message was a direct message (DM)
or a group direct message (GDM).

This change adds the `is_private` parameter to `build_message`. If
`is_private` is `True`, the `topic_name` will be overridden to an empty
string (""). Consequently, this also updates the third-party importers
to pass this parameter when calling `build_message`.

Co-authored-by: Pieter CK <pieterceka123@gmail.com>
2025-03-25 16:38:21 -07:00
Alex Vandiver
47d55c4b6f messages: Add an is_channel_message flag. 2025-03-18 09:34:11 -07:00
Anders Kaseorg
88a8087243 ruff: Fix PYI019 Use Self instead of custom TypeVar.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-03-14 17:31:50 -07:00
Tim Abbott
b249830715 rocketchat: Harden parsing of custom emoji exports.
Apparently some exports have a weird `ObjectId("idstring")` typed
value here, that stringify to the actual string.
2025-02-05 13:41:19 -08:00
Alex Vandiver
9b4b53ef29 slack: Mark content-type of imported attachments. 2025-01-31 14:29:57 -08:00
Alex Vandiver
33539568ae slack: Ensure a newline before attachment links.
The content `look![image](https://example.com)` does not render as a
link, nor an image upload (were it to `/user_uploads/...`).  The
`![...](...)` syntax is intended for inline images, but unsupported in
Zulip, and as such does not link or render as _anything_.

Ensure a newline between message content and any attachments.
2025-01-31 14:29:57 -08:00
Mateusz Mandera
f81e514d07 slack: Fetch workspace users from /users.list in the correct manner.
1. Fetching from the `/users.list` endpoint is supposed to use
   pagination. Slack will return at most 1000 results in a single
   request. This means that our Slack import system hasn't worked
   properly for workspaces with more than 1000 users. Users after the
   first 1000 would be considered by our tool as mirror dummies and thus
   created with is_active=False,is_mirror_dummy=True.
   Ref https://api.slack.com/methods/users.list

2. Workspaces with a lot of users, and therefore requiring the use of
   paginated requests to fetch them all, might also get us to run into
   Slack's rate limits, since we'll be doing repeating requests to the
   endpoint.
   Therefore, the API fetch needs to also handle rate limiting errors
   correctly.
   Per, https://api.slack.com/apis/rate-limits#headers, we can just read
   the retry-after header from the rsponse and wait the indicated number
   of seconds before repeating the requests. This is an easy approach to
   implement, so that's what we go with here.
2025-01-24 16:41:53 -08:00
Anders Kaseorg
653b0b0436 ruff: Partially reformat Python with Ruff 0.9 (2025 style).
These are the changes that are backwards compatible with the 2024
style.

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2025-01-14 09:42:16 -08:00
Alex Vandiver
9fa5ab951c mime_types: Move INLINE_MIME_TYPES to prevent future import loops. 2025-01-09 09:09:13 -08:00
Sahil Batra
b4b571d60c streams: Remove stream_post_policy field from Stream. 2025-01-07 12:17:37 -08:00
Anders Kaseorg
19b8cde27f ruff: Fix PLC0206 Extracting value from dictionary without calling .items().
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-12-21 21:06:53 -08:00
PieterCK
a746be807f slack_import: Make check_token_access more flexible.
Previously, the `check_token_access` function had a hardcoded
`required_parameters` variable because it was only used in the Slack
data importer. This commit refactors `required_parameters` into a
function parameter, enabling the function to check a Slack token’s scope
for other purposes, such as Slack webhook integration.

Additionally, this commit changes the Slack API call in
`check_token_access` from `teams.info` to `api.test`. The endpoint is
better suited for this purpose since we're only checking a token’s scope
using the response header here.
2024-12-18 16:11:31 -08:00
PieterCK
f988412394 slack_data_import: Support converting integration bot users.
Currently, we're unable to convert messages from Slack's integration
bots because this message subtype doesn't come from a Slack "user", that
is they don't have a Slack user profile.

This is a preparatory change to support converting Slack's integration
bot messages. This commit artificially creates Slack user data from the
integration bot's "profile" so that we can create a corresponding Zulip
user for them.

Part of #31311.
2024-12-16 13:09:57 -08:00
PieterCK
10946caa3d slack_data_import: Update how Slack user avatars are processed.
Previously, the Slack export converter can only process Slack's avatar
URL from Slack's "ca.slack-edge.com" server, which looks like this:

https://ca.slack-edge.com/T0CDRA6HM3P-U06NABE26M9-1173e04f818e-512

This commit adds support for converting any public downloadable image
URLs.

This is done to support importing Slack's integration bots and their
messages, which typically have PNG type file url:

https://avatars.slack-edge.com/2024-05-01/7057208497908_a4351f6deb91094eac4c_72.png
2024-12-16 13:09:57 -08:00
PieterCK
6543dadd0f slack_import: Refactor a convert_slack_formatting.
This prep commit extracts the logic for converting Slack formatted text
into Zulip friendly markdown from `convert_to_zulip_markdown` into a new
helper function, `convert_slack_formatting`.

This is done to make reformating logic be reusable else where such as in
the Slack webhook or Slack incoming webhook.
2024-12-06 13:11:44 -08:00
PieterCK
4b472611d0 slack_import: Refactor a convert_slack_workspace_mentions.
This prep commit extracts the logic for reformatting Slack mentions into
Zulip mentions from `convert_to_zulip_markdown` into a new helper
function, `convert_slack_workspace_mentions`.

This is done to make the reformatting logic be reusable else where such
as in the Slack webhook or Slack incoming webhook.
2024-12-06 13:11:44 -08:00
PieterCK
ff4d5daef4 slack_import: Reorder functions used in convert_to_zulip_markdown.
This prep commit moves `convert_to_zulip_markdown` and
`convert_markdown_syntax` so that they appear after the various
helper functions used in the main logic for converting Slack
messages to Zulip's formatting. This adjustment aligns with code
style conventions, improving readability.

The change also prepares the codebase for reusing Slack text
reformatting logic in the Slack incoming webhook.
2024-12-06 13:11:44 -08:00
PieterCK
0d7199b22e data_import: Add migration status file to converted exports.
This commit updates all third-party importer tools (Slack, Mattermost,
and Rocket Chat) in the `zerver/data_import` directory to also output a
migration_status.json file in their output tarball.

This is required because all importable tarball will be checked for
migration compatibility during import.

Fixes #28443.
2024-11-08 15:52:45 -08:00
Mateusz Mandera
420849ff6a slack: Call the correct resize_* function when importing realm icon.
For resizing the icon.png files, we use resize_avatar, not resize_logo.
This is pretty confusing - sure, for icons we use the same function as
for avatars, but we should have a proper name for the function called in
the icon context. So this commit also adds resize_realm_icon, and
changes the calls to resize_avatar in icon contexts to
resize_realm_icon.
2024-11-08 15:43:18 -08:00
Harsh
e468818d2b import: Remove skipping of too-long messages during import.
This commit eliminates the skipping of messages longer than 10K characters during the import process.
2024-11-07 16:04:14 -08:00
PieterCK
92437b4ab5 integrations: Refactor slack_incoming webhook.
This commit refactors `render_attachment` and `render_block`
out of slack_incoming.py to promote reusability.

The primary motivation for this refactor is to add support for
converting integration bots messages in Slack exports, which could use
the same functions.

Part of #31311.
2024-11-07 10:47:08 -08:00
Harsh
858fdeee39 import: Add function to normalize messages to import.
This adds `normalize_body_for_import` to normalize messages from
third-party importers by removing NUL bytes and also updates import
test files data to test this.

Fixes #31930.
2024-11-04 12:22:30 -08:00
Anders Kaseorg
42e1517255 email_notifications: Prevent html2text from mangling Unicode.
html2text mangles Unicode by default, with a --unicode-snob option to
disable it.  If I have to get called a “snob” for wanting to correctly
support non-English languages, then uh, I’ll take one for the team.

https://github.com/Alir3z4/html2text/blob/2024.2.26/html2text/config.py#L111-L150

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-10-30 09:49:23 -07:00
PieterCK
fc50736f4e slack_data_import: Fix incorrect hyperlink conversion.
Currently, Slack messages containing hyperlinks
(e.g.,<http://foo.com|Foo!>) are converted like
normal links. This commit reformats Slack
hyperlinks into Zulip-friendly markdown
(e.g., [Foo!](http://foo.com)).

Part of #32165.
2024-10-30 09:48:32 -07:00
Tim Abbott
010410c849 rocketchat: Validate custom emoji before larger data sets.
This is a data set that's relatively likely to have weird failures,
and also likely to be fairly small.
2024-10-17 12:25:18 -07:00
Tim Abbott
6e4da50577 rocketchat: Complete metadata verification before importing uploads.
This is not the best factored version of this, but it saves effort
changing the tests, and importantly should make failures involving
metadata only take a couple seconds rather than first doing a giant
BSON read before learning about them.
2024-10-17 12:25:18 -07:00
Tim Abbott
79b6f43d0e rocketchat: Move bson_code_options to a global variable.
This will make it a lot easier to only read files in when we actually
need them.
2024-10-17 12:25:18 -07:00
PieterCK
6289a551aa data_import: Add email validation to third-party data converters.
This commit makes the third-party data converters check for invalid user
emails. If it finds any, it’ll raise an Exception and show an error
message with all the bad emails listed out.

Fixes: #31783
2024-10-15 16:04:43 -07:00
Alex Vandiver
2c51824b7d slack_import: Strip port from "domain_name".
This lets slack conversions be done on development hosts, which have a
trailing :9991 on their EXTERNAL_HOST; otherwise, we generate fake
emails like `imported-slack-bot@host.name:9991` which fail to
validate.
2024-09-26 12:01:11 -07:00
Alex Vandiver
e68096c907 slack: Protect against zip bombs.
A file which unpacks to more than 10x its original size is suspect,
particularly if that results in an uncompressed size > 1GB.
2024-09-26 12:01:11 -07:00
Alex Vandiver
6f7c14c9ec slack: Check that the archive is shaped the way we expect.
This is some minor protection against malicious zipfiles (e.g. many
very deep directories to chew up inodes), in addition to validation.
2024-09-26 12:01:11 -07:00
Alex Vandiver
d9f868a163 slack: Clean up expanded zipfiles more consistently. 2024-09-26 12:01:11 -07:00
Alex Vandiver
c8c480baef mattermost: Handle duplicate dm-groups.
Observed in the wild, cause unknown.

Partially fixes: #24131.

Co-authored-by: Mateusz Mandera <mateusz.mandera@zulip.com>
2024-09-09 12:05:28 -07:00
roanster007
c6a06d4684 direct_message_group: Add new group_size field.
This commit adds a new `group_size` field to the `DirectMessageGroup`
model, and backfills its value to each of the existing direct message
groups.

Fixes part of #25713
2024-08-23 11:09:41 -07:00
Mateusz Mandera
9cad9644e7 import_util: Fix file extensions of emoji files downloaded from Slack.
The Slack API when returning the emoji records, returns the record for
its thumbsup_all emoji with the url ending with .png, even though the
file is a gif.

For that reason, we have to make that code correct file extensions based
on the response content-type. Emojis are the smallest set of images to
download, so for simplicity of implementation, we remove the
parallelization of the downloads in favor of just processing them
serially.
2024-08-21 16:30:19 -07:00
Prakhar Pratyush
19d56f77b5 rocketchat: Fix "OSError: [Errno 36] File name too long" error.
Earlier, we were replacing too long attachment name with random uuid
when the character count of the file name was greater than 255.

This results in "OSError: [Errno 36] File name too long" error in
few cases when the file name has less than 255 characters but more
than 255 bytes (file name with Non-ASCII characters).

This commit updates the code to check the file name's byte size
instead of characters count.
2024-08-14 18:18:31 -07:00
roanster007
7b3e163d55 refactor: Rename huddle to direct_message_group in non api files.
This commit completes rename of "huddle" to "direct_message_group"
in all the non API files.

Part of #28640
2024-07-31 23:25:56 -07:00