Commit Graph

102 Commits

Author SHA1 Message Date
Alex Vandiver
218bbe041e upload: Adjust minimum encoding confidence thresholds.
UTF-8 is far and wide the dominant encoding, so we assume that
encoding if we have more than 66% confidence -- this allows it to be
guessed even for short files.  ISO-8859-1 caps at 73% confidence, so
we assume that if it's the best option and it hits that max.
Otherwise, we still require 90% confidence.
2025-08-06 09:41:44 -07:00
Alex Vandiver
085d137871 upload: Rename attachment_vips_source, as it's not just for vips_source. 2025-07-29 10:01:40 -07:00
Alex Vandiver
25fbb05fea upload: Attempt to guess character set text/* which are served inline.
This is only `text/plain`, currently.  In such cases where the
client-provided content-type also does not specify a `charset`, we use
`chardet` to make a guess, and store that guess to provide later when
serving the content.  The detection is done in a streaming fashion,
and thus should not require re-downloading the full content.
2025-07-24 22:40:12 -07:00
Alex Vandiver
ae001dfa96 upload: Allow uploads to set charset.
Previously, the `user_file.content_type` only contained the MIME type
of the uploaded file; no other parameters were included, meaning that
a file a client specified as `content-type: text/plain; charset=big5`
would be stored with an `Attachment.content_type` of `text/plain`.

Re-construct the full content-type header from `content_type_extra`,
which includes those parameters.

We do not include a test because Django does not support specifying
such parameters in the upload path.
2025-07-24 22:40:12 -07:00
Alex Vandiver
183da665ac upload: Rename .source to .vips_source. 2025-07-24 22:40:12 -07:00
Alex Vandiver
aeed907c50 s3: Support non-AWS S3 providers which do not support request checksums. 2025-04-10 11:11:25 -07:00
Alex Vandiver
9fa5ab951c mime_types: Move INLINE_MIME_TYPES to prevent future import loops. 2025-01-09 09:09:13 -08:00
Anders Kaseorg
84d5ef1285 upload: Show text/plain uploads inline.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-12-10 09:51:57 -08:00
Mateusz Mandera
420849ff6a slack: Call the correct resize_* function when importing realm icon.
For resizing the icon.png files, we use resize_avatar, not resize_logo.
This is pretty confusing - sure, for icons we use the same function as
for avatars, but we should have a proper name for the function called in
the icon context. So this commit also adds resize_realm_icon, and
changes the calls to resize_avatar in icon contexts to
resize_realm_icon.
2024-11-08 15:43:18 -08:00
Prakhar Pratyush
9c9866461a transaction: Add durable=True to the outermost db transactions.
This commit adds `durable=True` to the outermost db transactions
created in the following:
* confirm_email_change
* handle_upload_pre_finish_hook
* deliver_scheduled_emails
* restore_data_from_archive
* do_change_realm_subdomain
* do_create_realm
* do_deactivate_realm
* do_reactivate_realm
* do_delete_user
* do_delete_user_preserving_messages
* create_stripe_customer
* process_initial_upgrade
* do_update_plan
* request_sponsorship
* upload_message_attachment
* register_remote_server
* do_soft_deactivate_users
* maybe_send_batched_emails

It helps to avoid creating unintended savepoints in the future.

This is as a part of our plan to explicitly mark all the
transaction.atomic calls with either 'savepoint=False' or
'durable=True' as required.

* 'savepoint=True' is used in special cases.
2024-11-05 17:58:47 -08:00
Mateusz Mandera
da4443f392 thumbnail: Make thumbnailing work with data import.
We didn't have thumbnailing for images coming from data import and this
commit adds the functionality.

There are a few fundamental issues that the implementation needs to
solve.

1. The images come from an untrusted source and therefore we don't want
   to just pass them through to thumbnailing without checking. For that
   reason, we cannot just import ImageAttachment rows from the export
   data, even for zulip=>zulip imports.
   The right way to process images is to pass them to maybe_thumbail(),
   which runs libvips_check_image() on them to verify we're okay with
   thumbnailing, creates ImageAttachment rows for them and sends them
   to the thumbnailing queue worker. This approach lets us handle both
   zulip=>zulip and 3rd party=>zulip imports in the same way,

2. There is a somewhat circular dependency between the Message,
   Attachment and ImageAttachment import process:

- ImageAttachments would ideally be created after importing
  Attachments, but they need to already exist at the time of Message
  import. Otherwise, the markdown processor doesn't know it has to add
  HTML for image previews to messages that reference images. This would
  mean that messages imported from 3rd party tools don't get image
  previews.
- Attachments only get created after Message import however, due to the
  many-to-many relationship between Message and Attachment.

This is solved by fixing up some data of Attachments pre-emptively, such
as the path_ids. This gives us the necessary information for creating
ImageAttachments before importing Messages.

While we generate ImageAttachment rows synchronously, the actual
thumbnailing job is sent to the queue worker. Theoretically, the worker
could be very backlogged and not process the thumbnails anytime soon.
This is fine - if the app is loaded and tries to display a message with
such a not-yet-generated thumbnail, the code in `serve_file` will
generate the thumbnails synchronously on the fly and the user will see
the image preview displayed normally. See:

1b47134d0d/zerver/views/upload.py (L333-L342)
2024-10-24 10:32:51 -07:00
Alex Vandiver
a20673a267 upload: Allow filtering to just a prefix (e.g. a realm id). 2024-09-26 12:01:11 -07:00
Alex Vandiver
579cf4ada7 upload: Make local-file save_attachment_contents chunk-at-a-time.
This means it does not attempt to hold large files entirely in memory
when writing them to a new location on disk.
2024-09-26 12:01:11 -07:00
Alex Vandiver
2dc737335e upload: Switch from BinaryIO to IO[bytes].
This is slightly more generally-compatible.
2024-09-26 12:01:11 -07:00
Alex Vandiver
638c579c56 tusd: Set metadata correctly in S3.
The Content-Type, Content-Disposition, StorageClass, and general
metadata are not set according to our patterns by tusd; copy the file
to itself to update those properties.
2024-09-26 12:00:43 -07:00
Alex Vandiver
287850d08d tusd: Remove non-ASCII characters from path-ids. 2024-09-26 12:00:43 -07:00
Alex Vandiver
84280ed7c2 upload: When serving s3 download URLs, send real filename.
Setting `ResponseContentDisposition=attachment` means that we override
the stored `ContentDisposition`, which includes a filename.  This
means that using the "Download" link on servers with S3 storage
produced a file named the sanitized version we stored.

Explicitly build a `ContentDisposition` to tell S3 to return, which
includes both `attachment` as well as the filename (if we have it
locally).
2024-09-26 12:00:43 -07:00
Anders Kaseorg
184c0203f3 upload: Lazily import boto3.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-09-24 16:38:37 -07:00
Alex Vandiver
e125ad823d exports: Add a separate bucket for realm exports.
This allows finer-grained access control and auditing.  The links
generated also expire after one week, and the suggested configuration
is that the underlying data does as well.

Co-authored-by: Prakhar Pratyush <prakhar@zulip.com>
2024-09-20 15:43:49 -07:00
Alex Vandiver
c1e8ecd08f uploads: Cache boto client in the module and be writable.
The `get_signed_upload_url` code is called for every S3 file serve
request, and is thus in the hot path.  The boto3 client caching
optimization is thus potentially useful as a performance optimization.
2024-09-20 15:43:49 -07:00
Alex Vandiver
1a7b3ef7ed upload: Use get_export_tarball_url in upload_export_tarball. 2024-09-20 15:43:49 -07:00
Alex Vandiver
4cf835d9dd upload: Remove common cache from get_export_tarball_url.
This is not called in the hot path like get_avatar_url is.
2024-09-20 15:43:49 -07:00
Alex Vandiver
a5bf452202 upload: Realm is not Optional in upload_export_tarball.
af4eb8c0d5 marked the base class and local backend as non-Optional,
but left the S3 backend as Optional for some reason.

Remove it.
2024-09-20 15:43:49 -07:00
Alex Vandiver
9a1f78db22 thumbnail: Support checking for images from streaming sources.
We may not always have trivial access to all of the bytes of the
uploaded file -- for instance, if the file was uploaded previously, or
by some other process.  Downloading the entire image in order to check
its headers is an inefficient use of time and bandwidth.

Adjust `maybe_thumbnail` and dependencies to potentially take a
`pyvips.Source` which supports streaming data from S3 or disk.  This
allows making the ImageAttachment row, if deemed appropriate, based on
only a few KB of data, and not the entire image.
2024-09-17 12:51:30 -07:00
Alex Vandiver
903bfb31e6 upload: Provide the frontend with the less-modified filename. 2024-09-09 12:40:17 -07:00
Alex Vandiver
b4764f49df upload: Download files with their original names.
Fixes: #29491.
2024-09-09 12:40:17 -07:00
Alex Vandiver
ca72e756eb upload: Rename "upload_image_to_s3"; it is not only for images. 2024-09-09 12:40:17 -07:00
Anders Kaseorg
91ade25ba3 python: Simplify with str.removeprefix, str.removesuffix.
These are available in Python ≥ 3.9.
https://docs.python.org/3/library/stdtypes.html#str.removeprefix

Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-09-03 12:30:16 -07:00
Alex Vandiver
4351cc5914 thumbnail: Move get_image_thumbnail_path and split_thumbnail_path. 2024-07-18 13:50:28 -07:00
Alex Vandiver
2e38f426f4 upload: Generate thumbnails when images are uploaded.
A new table is created to track which path_id attachments are images,
and for those their metadata, and which thumbnails have been created.
Using path_id as the effective primary key lets us ignore if the
attachment is archived or not, saving some foreign key messes.

A new worker is added to observe events when rows are added to this
table, and to generate and store thumbnails for those images in
differing sizes and formats.
2024-07-16 13:22:15 -07:00
Alex Vandiver
229dcd0218 upload: Clean up empty directories in local storage. 2024-07-16 13:22:15 -07:00
Anders Kaseorg
0fa5e7f629 ruff: Fix UP035 Import from collections.abc, typing instead.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-13 22:28:22 -07:00
Anders Kaseorg
531b34cb4c ruff: Fix UP007 Use X | Y for type annotations.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-13 22:28:22 -07:00
Anders Kaseorg
e08a24e47f ruff: Fix UP006 Use list instead of List for type annotation.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
2024-07-13 22:28:22 -07:00
Alex Vandiver
0385e5bab9 emoji: Store in S3 with a long public cache-control. 2024-07-12 13:26:47 -07:00
Alex Vandiver
262689da76 thumbnail: Fix MAX_EMOJI_GIF_FILE_SIZE_BYTES check to be post-resize.
This check was intended to check the post-resized image size, not the
pre-resized image.
2024-07-12 13:26:47 -07:00
Alex Vandiver
54f2fabac0 thumbnail: Still emoji are always pngs. 2024-07-12 13:26:47 -07:00
Alex Vandiver
f6b99171ce emoji: Derive the file extension from a limited set of content-types.
We thumbnail and serve emoji with the same format as they were
uploaded.  However, we preserved the original extension, which might
mismatch with the provided content-type.

Limit the content-type to a subset which is both (a) an image format
we can thumbnail, and (b) a media format which is widely-enough
supported that we are willing to provide it to all browsers.  This
prevents uploading a `.tiff` emoji, for instance.

Based on this limited content-type, we then reverse to find the
reasonable extension to use when storing it.  This is particularly
important because the local file storage uses the file extension to
choose what content-type to re-serve the emoji as.

This does nothing for existing emoji, which may have odd or missing
file extensions.
2024-07-12 13:26:47 -07:00
Alex Vandiver
62a0611ddb emoji: Pass down content-type, rather than guessing from extension. 2024-07-12 13:26:47 -07:00
Alex Vandiver
4bc563128e thumbnail: Use a consistent set of supported image types. 2024-07-11 07:31:39 -07:00
Alex Vandiver
ff90e5355f upload: Pass down content-type of realm icon/logo to backend.
This saves having to try to re-derive it from the file extension,
which may be ".original" in some cases.
2024-07-11 07:31:39 -07:00
Alex Vandiver
79f858b4b8 upload: Pass bytes to create_attachment.
This will be used to analyze the bytes for image metadata.
2024-07-07 14:40:07 -07:00
Alex Vandiver
f97a30f240 upload: Reorder arguments to parallel upload_message_attachment. 2024-07-07 14:40:07 -07:00
Alex Vandiver
f52a93bc14 upload: Stop requiring callers pass in the file size.
This can be calculated because we have the contents.
2024-07-07 14:40:07 -07:00
Alex Vandiver
58a9fe9af1 upload: Drop unused parameters to upload_message_attachment. 2024-07-07 14:40:07 -07:00
Alex Vandiver
0a296b2a6e upload: Start storing content-type for new uploads. 2024-07-07 14:40:07 -07:00
Alex Vandiver
e29a455b2d avatars: Encode version into the filename.
Hash the salt, user-id, and now avatar version into the filename.
This allows the URL contents to be immutable, and thus to be marked as
immutable and cacheable.  Since avatars are served unauthenticated,
hashing with a server-side salt makes the current and past avatars not
enumerable.

This requires plumbing the current (or future) avatar version through
various parts of the upload process.

Since this already requires a full migration of current avatars, also
take the opportunity to fix the missing `.png` on S3 uploads (#12852).

We switch from SHA-1 to SHA-256, but truncate it such that avatar URL
data does not substantially increase in size.

Fixes: #12852.
2024-07-07 14:40:07 -07:00
Alex Vandiver
feca9939bb s3: Support setting a cache-control on uploads. 2024-07-07 14:40:07 -07:00
Alex Vandiver
6258817bfd s3: Stop setting empty Content-Disposition header. 2024-07-07 14:40:07 -07:00
Sahil Batra
5ef14c3a8e users: Fix uploading user avatars.
Due to recent refactoring in 9fb03cb2c7, a user could not
upload avatar if the server uses local upload backend and there
was already an avatar file for that user.

This commit fixes it to just check if there exists a file only
when importing and not when the user is actually trying to
change the avatar.

Fixes #30676.
2024-07-02 13:26:21 -07:00