mirror of
https://github.com/zulip/zulip.git
synced 2025-11-13 18:36:36 +00:00
Added user and realm export guidance in production maintenance docs, linked to conversion guide, and revamped the introduction and styled the text that Steve wrote.
270 lines
9.2 KiB
Markdown
270 lines
9.2 KiB
Markdown
# Exporting data from a large multi-realm Zulip server
|
|
|
|
## Draft status
|
|
|
|
This is a draft design document considering potential future
|
|
refinements and improvements to make large migrations easier going
|
|
forward, and is not yet a set of recommendations for Zulip systems
|
|
administrators to follow.
|
|
|
|
## Overview
|
|
|
|
Zulip offers an export tool, `management/export.py`, which works well
|
|
to export the data for a single Zulip realm, and which is your best
|
|
choice if you're migrating a Zulip realm to a new server.
|
|
|
|
This document supplements the explanation in `management/export.py`,
|
|
but here we focus more on the logistics of a big conversion of a
|
|
multi-realm Zulip installation. (For some historical perspective, this
|
|
document was originally begun as part of a big Zulip cut-over in
|
|
summer 2016.)
|
|
|
|
There are many major operational aspects to doing a conversion. I will
|
|
list them here, noting that several are not within the scope of this
|
|
document:
|
|
|
|
- Get new servers running.
|
|
- Export data from the old DB.
|
|
- Export files from Amazon S3.
|
|
- Import files into new storage.
|
|
- Import data into new DB.
|
|
- Restart new servers.
|
|
- Decommission old server.
|
|
|
|
This document focuses almost entirely on the **export** piece. Issues
|
|
with getting Zulip itself running are out of scope here; see [the
|
|
production installation instructions](index.html#prod-install-docs).
|
|
As for the import side of things, we only touch on it implicity. (My
|
|
reasoning was that we *had* to get the export piece right in a timely
|
|
fashion, even if it meant we would have to sort out some straggling
|
|
issues on the import side later.)
|
|
|
|
## Exporting multiple realms' data when moving to a new server
|
|
|
|
The main exporting tools in place as of summer 2016 are below:
|
|
|
|
- We can export single realms (but not yet limit users within the
|
|
realm).
|
|
- We can export single users (but then we get no realm-wide data in
|
|
the process).
|
|
- We can run exports simultaneously (but have to navigate a bunch of
|
|
/tmp directories).
|
|
|
|
Things that we still may need:
|
|
- We may want to export multiple realms simultaneously.
|
|
- We may want to export multiple single users simultaneously.
|
|
- We may want to limit users within realm exports.
|
|
- We may want more operational robustness/convenience while doing
|
|
several exports simultaenously.
|
|
- We may want to merge multiple export files to remove duplicates.
|
|
|
|
We have a few major classes of data. They are listed below in the order
|
|
that we process them in `do_export_realm()`:
|
|
|
|
#### Public Realm Data
|
|
|
|
`Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream`.
|
|
|
|
#### Cross Realm Data
|
|
|
|
`Client/zerver_userprofile_cross_realm`
|
|
|
|
This includes `Client` and three bots.
|
|
|
|
`Client` is unique in being a fairly core table that is not tied to
|
|
`UserProfile` or `Realm` (unless you somewhat painfully tie it back to
|
|
users in a bottom-up fashion though other tables).
|
|
|
|
#### Disjoint User Data
|
|
|
|
`UserProfile/UserActivity/UserActivityInterval/UserPresence`.
|
|
|
|
#### Recipient Data
|
|
|
|
`Recipient/Stream/Subscription/Huddle`.
|
|
|
|
These tables are tied back to users, but they introduce complications
|
|
when you try to deal with multi-user subsets.
|
|
|
|
#### File-related Data
|
|
|
|
`Attachment`
|
|
|
|
This includes `Attachment`, and it referencs the `avatar_source` field
|
|
of `UserProfile`. Most importantly, of course, it requires us to grab
|
|
files from S3. Finally, `Attachment`'s `m2m` relationship ties to
|
|
`Message`.
|
|
|
|
#### Message Data
|
|
|
|
`Message/UserMessage`
|
|
|
|
### Summary
|
|
|
|
Here are the same classes of data, listed in roughly
|
|
decreasing order of riskiness:
|
|
|
|
- Message Data (sheer volume/lack of time/security)
|
|
- File-Related Data (S3/security/lots of moving parts)
|
|
- Recipient Data (complexity/security/cross-realm considerations)
|
|
- Cross Realm Data (duplicate ids)
|
|
- Disjoint User Data
|
|
- Public Realm Data
|
|
|
|
(Note the above list is essentially in reverse order of how we
|
|
process the data, which isn't surprising for a top-down approach.)
|
|
|
|
The next section of the document talks about risk factors.
|
|
|
|
# Risk Mitigation
|
|
|
|
## Generic considerations
|
|
|
|
We have two major mechanisms for getting data:
|
|
|
|
##### Top Down
|
|
|
|
Get realm data, then all users in realm, then all recipients, then all
|
|
messages, etc.
|
|
|
|
The problem with the top down approach will be **filtering**. Also,
|
|
if errors arise during top-down passes, it may be time consuming to
|
|
re-run the processes.
|
|
|
|
##### Bottom Up
|
|
|
|
Start with users, get their recipient data, etc.
|
|
|
|
The problems with the bottom up approach will be **merging**. Also,
|
|
if we run multiple bottom-up passes, there is the danger of
|
|
duplicating some work, particularly on the message side of things.
|
|
|
|
### Approved Transfers
|
|
|
|
We have not yet integrated the approved-transfer model, which tells us
|
|
which users can be moved.
|
|
|
|
## Risk factors broken out by data categories
|
|
|
|
### Message Data
|
|
|
|
- models: `Message`/`UserMessage`.
|
|
- assets: `messages-*.json`, subprocesses, partial files
|
|
|
|
Rows in the `Message` model depend on `Recipient/UserProfile`.
|
|
|
|
Rows in the `UserMessage` model depend on `UserProfile/Message`.
|
|
|
|
The biggest concern here is the **sheer volume** of data, with
|
|
security being a close second. (They are interrelated, as without
|
|
security concerns, we could just bulk-export everything one time.)
|
|
|
|
We currently have these measures in place for top-down processing:
|
|
- chunking
|
|
- multi-processing
|
|
- messages are filtered by both sender and recipient
|
|
|
|
|
|
### File Related Data
|
|
|
|
- models: `Attachment`
|
|
- assets: S3, `attachment.json`, `uploads-temp/`, image files in
|
|
`avatars/`, assorted files in `uploads/`, `avatars/records.json`,
|
|
`uploads/records.json`, `zerver_attachment_messages`
|
|
|
|
When it comes to exporting attachment data, we have some minor volume
|
|
issues, but the main concern is just that there are **lots of moving
|
|
parts**:
|
|
|
|
- S3 needs to be up, and we get some metadata from it as well as
|
|
files.
|
|
- We have security concerns about copying over only files that belong
|
|
to users who approved the transfer.
|
|
- This piece is just different in how we store data from all the other
|
|
DB-centric pieces.
|
|
- At import time we have to populate the `m2m` table (but fortunately,
|
|
this is pretty low risk in terms of breaking anything.)
|
|
|
|
### Recipient Data
|
|
- models: `Recipient/Stream/Subscription/Huddle`
|
|
- assets: `realm.json`, `(user,stream,huddle)_(recipient,subscription)`
|
|
|
|
This data is fortunately low to medium in volume. The risk here will
|
|
come from **model complexity** and **cross-realm concerns**.
|
|
|
|
From the top down, here are the dependencies:
|
|
|
|
- `Recipient` depends on `UserProfile`
|
|
- `Subscription` depends on `Recipient`
|
|
- `Stream` currently depends on `Realm` (but maybe it should be tied
|
|
to `Subscription`)
|
|
- `Huddle` depends on `Subscription` and `UserProfile`
|
|
|
|
The biggest risk factor here is probably just the possibility that we
|
|
could introduce some bug in our code as we try to segment `Recipient`
|
|
into user, stream, and huddle components, especially if we try to
|
|
handle multiple users or realms. I think this can be largely
|
|
mitigated by the new `Config` approach.
|
|
|
|
And then we also have some complicated `Huddle` logic that will be
|
|
customized regardless. The fiddliest part of the `Huddle` logic is
|
|
creating the set of `unsafe_huddle_recipient_ids`.
|
|
|
|
Last but not least, if we go with some hybrid of bottom-up and
|
|
top-down, these tables are neither close to the bottom nor close to
|
|
the top, so they may have the most fiddly edge cases when it comes to
|
|
filtering and merging.
|
|
|
|
Recommendation: We probably want to get a backup of all this data that
|
|
is very simply bulk-exported from the entire DB, and we should
|
|
obviously put it in a secure place.
|
|
|
|
### Cross Realm Data
|
|
- models: `Client`
|
|
- assets: `realm.json`, three bots (`notification`/`email`/`welcome`),
|
|
`id_maps`
|
|
|
|
The good news here is that `Client` is a small table, and there are
|
|
only three special bots.
|
|
|
|
The bad news is that cross-realm data **complicates everything else**,
|
|
and we have to avoid **database ID conflicts**.
|
|
|
|
If we use bottom-up approaches to load small user populations at a
|
|
time, we may have **merging** issues here. We will need to
|
|
consolidate IDs either by merging exports in `/tmp` or handle it at
|
|
import time.
|
|
|
|
For the three bots, they live in `zerver_userprofile_crossrealm`, and
|
|
we re-map their IDs on the new server.
|
|
|
|
Recommendation: Do not sweat the exports too much. Deal with all the
|
|
messiness at import time, and rely on the tables being really small.
|
|
We already have logic to catch `Client.DoesNotExist` exceptions, for
|
|
example. As for possibly missing messages that the welcome bot and
|
|
friends have sent in the past, I am not sure what our risk profile is
|
|
there, but I imagine it is relatively low.
|
|
|
|
### Disjoint User Data
|
|
- models: `UserProfile/UserActivity/UserActivityInterval/UserPresence`
|
|
- assets: `realm.json`, `password`, `api_key`, `avatar salt`,
|
|
`id_maps`
|
|
|
|
On the DB side this data should be fairly easy to deal with. All of
|
|
these tables are basically disjoint by user profile ID. Our biggest
|
|
risk is **remapped user ids** at import time, but this is mostly
|
|
covered in the section above.
|
|
|
|
We have code in place to exclude `password` and `api_key` from
|
|
`UserProfile` rows. The import process calls
|
|
`set_unusable_password()`.
|
|
|
|
### Public Realm Data
|
|
|
|
- models: `Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream`
|
|
- asserts: `realm.json`
|
|
|
|
All of these tables are public (per-realm), and they are keyed by
|
|
realm ID. There is not a ton to worry about here, except possibly
|
|
**merging** if we run multiple bottom-up jobs for a single realm.
|