export: Add docs/conversion.md.

This is not hooked into anything yet. After this weekend's conversion, it will probably be worthwhile to take a second pass at it!
2025-11-03 13:33:24 +00:00 · 2016-08-12 08:21:20 -07:00
parent 90494cfec2
commit fd77585a8c
1 changed files with 246 additions and 0 deletions
--- a/docs/conversion.md
+++ b/docs/conversion.md
@@ -0,0 +1,246 @@
+# Exporting data
+
+## Overview
+
+Occasionally Zulip administrators will need to move data from one
+server to another.
+
+There are many major operational aspects to doing a conversion. I will
+list them here, noting that several are not within the scope of this
+document:
+
+- Get new servers running.
+- Export data from the old DB.
+- Export files from S3.
+- Import files into new storage.
+- Import data into new DB.
+- Restart new servers.
+- Decommission old server.
+
+This document focuses almost entirely on the **export** piece.  Issues
+with getting Zulip itself running are totally out of scope here.  For the
+import side of things, I only touch on it implicity.  (My reasoning is
+that we *have* to get the export piece right in a timely fashion, even
+if it means we have to sort out some straggling issues on the import side
+later.)
+
+## Export
+
+We have tools that essentially export Zulip data to the file system.
+
+A good overview of the process is here:
+[management/export.py](https://github.com/zulip/zulip/blob/master/zerver/management/commands/export.py)
+
+This document supplements that explanation, but here we focus more
+on the logistics of a big conversion.  For some historical perspective,
+this document was originally drafted as part of a big Zulip cut-over.
+
+The main exporting tools in place as of summer 2016 are below:
+
+- We can export single realms (but not yet limit users within the realm).
+- We can export single users (but then we get no realm-wide data in the process).
+- We can run exports simultaneously (but have to navigate a bunch of /tmp directories).
+
+Things that we still may need:
+- We may want to export multiple realms simultaneously.
+- We may want to export multiple single users simultaneously.
+- We may want to limit users within realm exports.
+- We may want more operational robustness/convenience while doing several exports simultaenously.
+- We may want to merge multiple export files to remove duplicates.
+
+We have a few major classes of data.  They are listed below in the order
+that we process them in `do_export_realm()`:
+
+#### Public Realm Data
+
+Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream.
+
+#### Cross Realm Data
+
+Client/zerver_userprofile_cross_realm
+
+This includes Client and three bots.
+
+Client is unique in being a fairly core table that is
+not tied to UserProfile or Realm (unless you somewhat painfully tie
+it back to users in a bottom-up fashion though other tables).
+
+#### Disjoint User Data
+
+UserProfile/UserActivity/UserActivityInterval/UserPresence.
+
+#### Recipient Data
+
+Recipient/Stream/Subscription/Huddle.
+
+These tables are tied back to users, but they introduce complications
+when you try to deal with multi-user subsets.
+
+#### File-related Data
+
+Attachment
+
+This includes Attachment, and it referencs the avatar_source field of
+UserProfile.  Most importantly, of course, it requires us to grab files
+from S3.  Finally, Attachment's m2m relationship ties to Message (although
+that's mostly managed at import time, not export time).
+
+#### Message Data
+
+Message/UserMessage
+
+### Summary
+
+Here are the same classes of data, listed in roughly
+decreasing order of riskiness:
+
+- Message Data (sheer volume/lack of time/security)
+- File-Related Data (S3/security/lots of moving parts)
+- Recipient Data (complexity/security/cross-realm considerations)
+- Cross Realm Data (duplicate ids)
+- Disjoint User Data
+- Public Realm Data
+
+(Note the above list is essentially in reverse order of how we
+process the data, which isn't surprising for a top-down approach.)
+
+The next section of the document talks about risk factors.
+
+# Risk Mitigation
+
+## Generic considerations
+
+We have two major mechanisms for getting data:
+
+##### Top Down
+
+Get realm data, then all users in realm, then all recipients, then all messages, etc.
+
+The problem with the top down approach will be **filtering**.  Also, if
+errors arise during top-down passes, it may be time consuming to re-run
+the processes.
+
+##### Bottom Up
+
+Start with users, get their recipient data, etc.
+
+The problems with the bottom up approach will be **merging**.  Also, if
+we run multiple bottom-up passes, there is the danger of duplicating some
+work, particularly on the message side of things.
+
+### Approved Transfers
+
+We have not yet integrated the approved-transfer model, which tells us
+which users can be moved.
+
+## Risk factors broken out by data categories
+
+### Message Data
+
+- models: Message/UserMessage.
+- assets: messages-*.json, subprocesses, partial files
+
+Rows in the Message model depend on Recipient/UserProfile.
+
+Rows in the UserMessage model depend on UserProfile/Message.
+
+The biggest concern here is the **sheer volume** of data, with
+security being a close second.  (They are interrelated, as without
+security concerns, we could just bulk-export everything one time.)
+
+We currently have these measures in place for top-down processing:
+- chunking
+- multi-processing
+- messages are filtered by both sender and recipient
+
+
+### File Related Data
+
+- models: Attachment
+- assets: S3, image files in avatars/, assorted files in uploads/, avatars/records.json, uploads/records.json, zerver_attachment_messages
+
+When it comes to exporting attachment data, we have some minor volume issues, but the
+main concern is just that there are **lots of moving parts**:
+
+- S3 needs to be up, and we get some metadata from it as well as files.
+- We have security concerns about copying over only files that belong to users who approved the transfer.
+- This piece is just different in how we store data from all the other DB-centric pieces.
+- At import time we have to populate the m2m table (but fortunately, this is pretty low
+  risk in terms of breaking anything.)
+
+### Recipient Data
+- models: Recipient/Stream/Subscription/Huddle
+- assets: realm.json, (user,stream,huddle)_(recipient,subscription)
+
+This data is fortunately low to medium in volume.  The risk here will come
+from **model complexity** and **cross-realm concerns**.
+
+From the top down, here are the dependencies:
+
+- Recipient depends on UserProfile
+- Subscription depends on Recipient
+- Stream currently depends on Realm (but maybe it should be tied to Subscription)
+- Huddle depends on Subscription and UserProfile
+
+The biggest risk factor here is probably just the possibility that we could introduce
+some bug in our code as we try to segment Recipient into user, stream, and huddle components,
+especially if we try to handle multiple users or realms.
+I think this can be largely mitigated by the new Config approach.
+
+And then we also have some complicated Huddle logic that will be customized
+regardless.  The fiddliest part
+of the Huddle logic is creating the set of unsafe_huddle_recipient_ids.
+
+Last but not least, if we go with some hybrid of bottom-up and top-down, these tables
+are neither close to the bottom nor close to the top, so they may have the most
+fiddly edge cases when it comes to filtering and merging.
+
+Recommendation #1: We probably want to get a backup of all this data that is very simply
+bulk-exported from the entire DB, and we should obviously put it in a secure place.
+
+Recommendation #2: Change our Stream export logic in `do_export_realm()` to be
+tied to subscriptions instead of hanging off the realm id.
+
+### Cross Realm Data
+- models: Client
+- assets: realm.json, three bots (notification/email/welcome), id_maps
+
+The good news here is that Client is a small table, and there are
+only three special bots.
+
+The bad news is that cross-realm data **complicates everything else**,
+and we have to avoid **database id conflicts**.
+
+If we use bottom-up approaches to load small user populations at a time, we may
+have **merging** issues here.  We will need to consolidate ids either by merging
+exports in /tmp or handle it import time.
+
+For the three bots, they live in zerver_userprofile_crossrealm, and we re-map
+their ids on the new server.
+
+Recommendation: Do not sweat the exports too much.  Deal with all the messiness at
+import time, and rely on the tables being really small.  We already have logic
+to catch Client.DoesNotExist exceptions, for example.  As for possibly missing
+messages that the welcome bot and friends have sent in the past, I am not sure
+what our risk profile is there, but I imagine it is relatively low.
+
+### Disjoint User Data
+- models: UserProfile/UserActivity/UserActivityInterval/UserPresence
+- assets: realm.json, password, api_key, avatar salt, id_maps
+
+On the DB side this data should be fairly easy to deal with.  All of these
+tables are basically disjoint by user profile id.  Our biggest
+risk is **remapped user ids** at import time, but this is mostly covered
+in the section above.
+
+We have code in place to exclude password and api_key from UserProfile
+rows.  The import process calls set_unusable_password().
+
+### Public Realm Data
+
+- models: Realm/RealmAlias/RealmEmoji/RealmFilter/DefaultStream
+- asserts: realm.json
+
+All of these tables are public (per-realm), and they are keyed by
+realm id.  There is not a ton to worry about here, except possibly
+**merging** if we run multiple bottom-up jobs for a single realm.