export: Add docs/conversion.md.

This is not hooked into anything yet. After this weekend's conversion, it will probably be worthwhile to take a second pass at it!
2025-11-04 05:53:43 +00:00 · 2016-08-12 08:21:20 -07:00
parent 90494cfec2
commit fd77585a8c
1 changed files with 246 additions and 0 deletions
--- a/docs/conversion.md
+++ b/docs/conversion.md
@@ -0,0 +1,246 @@
 # Exporting data
 ## Overview
 Occasionally Zulip administrators will need to move data from one
 server to another.
 There are many major operational aspects to doing a conversion. I will
 list them here, noting that several are not within the scope of this
 document:
 - Get new servers running.
 - Export data from the old DB.
 - Export files from S3.
 - Import files into new storage.
 - Import data into new DB.
 - Restart new servers.
 - Decommission old server.
 This document focuses almost entirely on the **export** piece.  Issues
 with getting Zulip itself running are totally out of scope here.  For the
 import side of things, I only touch on it implicity.  (My reasoning is
 that we *have* to get the export piece right in a timely fashion, even
 if it means we have to sort out some straggling issues on the import side
 later.)
 ## Export
 We have tools that essentially export Zulip data to the file system.
 A good overview of the process is here:
 [management/export.py](https://github.com/zulip/zulip/blob/master/zerver/management/commands/export.py)
 This document supplements that explanation, but here we focus more
 on the logistics of a big conversion.  For some historical perspective,
 this document was originally drafted as part of a big Zulip cut-over.
 The main exporting tools in place as of summer 2016 are below:
 - We can export single realms (but not yet limit users within the realm).
 - We can export single users (but then we get no realm-wide data in the process).
 - We can run exports simultaneously (but have to navigate a bunch of /tmp directories).
 Things that we still may need:
 - We may want to export multiple realms simultaneously.
 - We may want to export multiple single users simultaneously.
 - We may want to limit users within realm exports.
 - We may want more operational robustness/convenience while doing several exports simultaenously.
 - We may want to merge multiple export files to remove duplicates.
 We have a few major classes of data.  They are listed below in the order
 that we process them in `do_export_realm()`:
 #### Public Realm Data
 Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream.
 #### Cross Realm Data
 Client/zerver_userprofile_cross_realm
 This includes Client and three bots.
 Client is unique in being a fairly core table that is
 not tied to UserProfile or Realm (unless you somewhat painfully tie
 it back to users in a bottom-up fashion though other tables).
 #### Disjoint User Data
 UserProfile/UserActivity/UserActivityInterval/UserPresence.
 #### Recipient Data
 Recipient/Stream/Subscription/Huddle.
 These tables are tied back to users, but they introduce complications
 when you try to deal with multi-user subsets.
 #### File-related Data
 Attachment
 This includes Attachment, and it referencs the avatar_source field of
 UserProfile.  Most importantly, of course, it requires us to grab files
 from S3.  Finally, Attachment's m2m relationship ties to Message (although
 that's mostly managed at import time, not export time).
 #### Message Data
 Message/UserMessage
 ### Summary
 Here are the same classes of data, listed in roughly
 decreasing order of riskiness:
 - Message Data (sheer volume/lack of time/security)
 - File-Related Data (S3/security/lots of moving parts)
 - Recipient Data (complexity/security/cross-realm considerations)
 - Cross Realm Data (duplicate ids)
 - Disjoint User Data
 - Public Realm Data
 (Note the above list is essentially in reverse order of how we
 process the data, which isn't surprising for a top-down approach.)
 The next section of the document talks about risk factors.
 # Risk Mitigation
 ## Generic considerations
 We have two major mechanisms for getting data:
 ##### Top Down
 Get realm data, then all users in realm, then all recipients, then all messages, etc.
 The problem with the top down approach will be **filtering**.  Also, if
 errors arise during top-down passes, it may be time consuming to re-run
 the processes.
 ##### Bottom Up
 Start with users, get their recipient data, etc.
 The problems with the bottom up approach will be **merging**.  Also, if
 we run multiple bottom-up passes, there is the danger of duplicating some
 work, particularly on the message side of things.
 ### Approved Transfers
 We have not yet integrated the approved-transfer model, which tells us
 which users can be moved.
 ## Risk factors broken out by data categories
 ### Message Data
 - models: Message/UserMessage.
 - assets: messages-*.json, subprocesses, partial files
 Rows in the Message model depend on Recipient/UserProfile.
 Rows in the UserMessage model depend on UserProfile/Message.
 The biggest concern here is the **sheer volume** of data, with
 security being a close second.  (They are interrelated, as without
 security concerns, we could just bulk-export everything one time.)
 We currently have these measures in place for top-down processing:
 - chunking
 - multi-processing
 - messages are filtered by both sender and recipient
 ### File Related Data
 - models: Attachment
 - assets: S3, image files in avatars/, assorted files in uploads/, avatars/records.json, uploads/records.json, zerver_attachment_messages
 When it comes to exporting attachment data, we have some minor volume issues, but the
 main concern is just that there are **lots of moving parts**:
 - S3 needs to be up, and we get some metadata from it as well as files.
 - We have security concerns about copying over only files that belong to users who approved the transfer.
 - This piece is just different in how we store data from all the other DB-centric pieces.
 - At import time we have to populate the m2m table (but fortunately, this is pretty low
  risk in terms of breaking anything.)
 ### Recipient Data
 - models: Recipient/Stream/Subscription/Huddle
 - assets: realm.json, (user,stream,huddle)_(recipient,subscription)
 This data is fortunately low to medium in volume.  The risk here will come
 from **model complexity** and **cross-realm concerns**.
 From the top down, here are the dependencies:
 - Recipient depends on UserProfile
 - Subscription depends on Recipient
 - Stream currently depends on Realm (but maybe it should be tied to Subscription)
 - Huddle depends on Subscription and UserProfile
 The biggest risk factor here is probably just the possibility that we could introduce
 some bug in our code as we try to segment Recipient into user, stream, and huddle components,
 especially if we try to handle multiple users or realms.
 I think this can be largely mitigated by the new Config approach.
 And then we also have some complicated Huddle logic that will be customized
 regardless.  The fiddliest part
 of the Huddle logic is creating the set of unsafe_huddle_recipient_ids.
 Last but not least, if we go with some hybrid of bottom-up and top-down, these tables
 are neither close to the bottom nor close to the top, so they may have the most
 fiddly edge cases when it comes to filtering and merging.
 Recommendation #1: We probably want to get a backup of all this data that is very simply
 bulk-exported from the entire DB, and we should obviously put it in a secure place.
 Recommendation #2: Change our Stream export logic in `do_export_realm()` to be
 tied to subscriptions instead of hanging off the realm id.
 ### Cross Realm Data
 - models: Client
 - assets: realm.json, three bots (notification/email/welcome), id_maps
 The good news here is that Client is a small table, and there are
 only three special bots.
 The bad news is that cross-realm data **complicates everything else**,
 and we have to avoid **database id conflicts**.
 If we use bottom-up approaches to load small user populations at a time, we may
 have **merging** issues here.  We will need to consolidate ids either by merging
 exports in /tmp or handle it import time.
 For the three bots, they live in zerver_userprofile_crossrealm, and we re-map
 their ids on the new server.
 Recommendation: Do not sweat the exports too much.  Deal with all the messiness at
 import time, and rely on the tables being really small.  We already have logic
 to catch Client.DoesNotExist exceptions, for example.  As for possibly missing
 messages that the welcome bot and friends have sent in the past, I am not sure
 what our risk profile is there, but I imagine it is relatively low.
 ### Disjoint User Data
 - models: UserProfile/UserActivity/UserActivityInterval/UserPresence
 - assets: realm.json, password, api_key, avatar salt, id_maps
 On the DB side this data should be fairly easy to deal with.  All of these
 tables are basically disjoint by user profile id.  Our biggest
 risk is **remapped user ids** at import time, but this is mostly covered
 in the section above.
 We have code in place to exclude password and api_key from UserProfile
 rows.  The import process calls set_unusable_password().
 ### Public Realm Data
 - models: Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream
 - asserts: realm.json
 All of these tables are public (per-realm), and they are keyed by
 realm id.  There is not a ton to worry about here, except possibly
 **merging** if we run multiple bottom-up jobs for a single realm.