This is only `text/plain`, currently. In such cases where the
client-provided content-type also does not specify a `charset`, we use
`chardet` to make a guess, and store that guess to provide later when
serving the content. The detection is done in a streaming fashion,
and thus should not require re-downloading the full content.
For resizing the icon.png files, we use resize_avatar, not resize_logo.
This is pretty confusing - sure, for icons we use the same function as
for avatars, but we should have a proper name for the function called in
the icon context. So this commit also adds resize_realm_icon, and
changes the calls to resize_avatar in icon contexts to
resize_realm_icon.
The Content-Type, Content-Disposition, StorageClass, and general
metadata are not set according to our patterns by tusd; copy the file
to itself to update those properties.
Setting `ResponseContentDisposition=attachment` means that we override
the stored `ContentDisposition`, which includes a filename. This
means that using the "Download" link on servers with S3 storage
produced a file named the sanitized version we stored.
Explicitly build a `ContentDisposition` to tell S3 to return, which
includes both `attachment` as well as the filename (if we have it
locally).
This allows finer-grained access control and auditing. The links
generated also expire after one week, and the suggested configuration
is that the underlying data does as well.
Co-authored-by: Prakhar Pratyush <prakhar@zulip.com>
The `get_signed_upload_url` code is called for every S3 file serve
request, and is thus in the hot path. The boto3 client caching
optimization is thus potentially useful as a performance optimization.
We may not always have trivial access to all of the bytes of the
uploaded file -- for instance, if the file was uploaded previously, or
by some other process. Downloading the entire image in order to check
its headers is an inefficient use of time and bandwidth.
Adjust `maybe_thumbnail` and dependencies to potentially take a
`pyvips.Source` which supports streaming data from S3 or disk. This
allows making the ImageAttachment row, if deemed appropriate, based on
only a few KB of data, and not the entire image.
A new table is created to track which path_id attachments are images,
and for those their metadata, and which thumbnails have been created.
Using path_id as the effective primary key lets us ignore if the
attachment is archived or not, saving some foreign key messes.
A new worker is added to observe events when rows are added to this
table, and to generate and store thumbnails for those images in
differing sizes and formats.
Hash the salt, user-id, and now avatar version into the filename.
This allows the URL contents to be immutable, and thus to be marked as
immutable and cacheable. Since avatars are served unauthenticated,
hashing with a server-side salt makes the current and past avatars not
enumerable.
This requires plumbing the current (or future) avatar version through
various parts of the upload process.
Since this already requires a full migration of current avatars, also
take the opportunity to fix the missing `.png` on S3 uploads (#12852).
We switch from SHA-1 to SHA-256, but truncate it such that avatar URL
data does not substantially increase in size.
Fixes: #12852.
boto3 has two different modalities of making API calls -- through
resources, and through clients. Resources are a higher-level
abstraction, and thus more generally useful, but some APIs are only
accessible through clients. It is possible to get to a client object
from a resource, but not vice versa.
Use `get_bucket(...).meta.client` when we need direct access to the
client object for more complex API calls; this lets all of the
configuration for how to access S3 to sit within `get_bucket`. Client
objects are not bound to only one bucket, but we get to them based on
the bucket we will be interacting with, for clarity.
We removed the cached session object, as it serves no real purpose.
e883ab057f started caching the boto client, which we had identified
as slow call. e883ab057f went further, calling
`get_boto_client().generate_presigned_url()` once and caching that
result.
This makes the inner cache on the client useless. Remove it.
Uploads are well-positioned to use S3's "intelligent tiering" storage
class. Add a setting to let uploaded files to declare their desired
storage class at upload time, and document how to move existing files
to the same storage class.
Actions like deleting realms may leave unreferenced uploads in the
attachment storage backend.
Fix these by walking the complete contents of the attachment storage
backend, and removing files which are no longer present in the
database. This may take quite some time, as it is necessarily O(n) in
the number of files uploaded to the system.