Friday, September 13, 2013

IPython Notebooks in the Cloud with Realtime Synchronization and Support for Collaborators

I spent the last two weeks implementing hosted IPython notebooks with sync for https://cloud.sagemath.com. Initially I had just plan to simplify the port forwarding setup, since using multiple forward and reverse port forwards seemed complicated. But then I became concerned about multiple users (or users with multiple browsers) overwriting each other's notebooks; this is a real possibility, since projects are frequently shared between multiple people, and everything else does realtime sync. I had planned just to add some very minimal merge-on-save functionality to avoid major issues, but somehow got sucked into implementing full realtime sync (even with the other person's cursor showing).

Here's how to try it out


  • Go to https://cloud.sagemath.com and make an account; this is a free service hosted on computers at University of Washington.
  • Create a new project.
  • Click +New, then click "IPython"; alternatively, paste in a link to an IPython notebook (e.g., anything here http://nbviewer.ipython.org/ -- you might need to get the actual link to the ipynb file itself!), or upload a file. 
  • An IPython notebook server will start, the given .ipynb file should load in a same-domain iframe, and then some of the ipython notebook code is and iframe contents are monkey patched, in order to support sync and better integration with https://cloud.sagemath.com.
  • Open the ipynb file in multiple browsers, and see that changes in one appear in the other, including moving cells around, creating new cells, editing markdown (the rendered version appears elsewhere), etc.
Since this is all very new and the first (I guess) realtime sync implementation on top of IPython, there are probably a lot of issues. Note that if you click the "i" info button to the right, you'll get a link to the standard IPython notebook server dashboard.

IPython development

Regarding the monkey patching mentioned above, the right thing to do would be to explain exactly what hooks/changes in the IPython html client I need in order to do sync, etc., make sure these makes sense to the IPython devs, and send a pull request. As an example, in order to do sync efficiently, I have to be able to set a given cell from JSON -- it's critical to do this in place when possible, since the overhead of creating a new cell is huge (due probably to the overhead of creating CodeMirror editors); however, the fromJSON method in IPython assumes that the cell is brand new -- it would be nice to add an option to make a cell fromJSON without assuming it is empty. The ultimate outcome of this could be a clean well-defined way of doing sync for IPython notebooks using any third-party sync implementation. IPython might provide their own sync service and there are starting to be others available these days -- e.g., Google has one, and maybe Guido van Rosum helped write one for Dropbox recently?

How it works

Earlier this year, I implemented Neil Fraser's differential synchronization algorithm, since I needed it for file and Sage worksheet editing in https://cloud.sagemath.com. There are many approaches to realtime synchronization, and Fraser makes a good argument for his.  For example, Google Wave involved a different approach (Operational Transforms), whereas Google Drive/Docs uses Fraser's approach (and code -- he works at Google), and you can see which succeeded. The main idea of his approach is eventually stable iterative process that involves heuristically making and applying patches on a "best effort" basis; it allows for all live versions of the document to be modified simultaneously -- the only locking is during the moment when a patch is applied to the live document. He also explains how to handle packet loss gracefully. I did a complete implementation from scratch (except for using the beautiful Google diff/patch/match library). There might be a Python implementation of the algorithm as part of mobwrite.

The hardest part of this project was using Fraser's algorithm, which is designed for unstructured text documents, to deal with IPython's notebook format, which is a structured JSON document. I ended up defining another less structured format for IPython notebooks, which gets used purely for synchronization and nothing else. It's a plain text file whose first line is a JSON object giving metainformation; all other lines correspond, in order, to the JSON for individual cells. When patching, it is in theory possible in edge cases involving conflicts to destroy the JSON structure -- if this happens, the destruction is isolated to a single cell, and that part of the patch just gets rejected.

The IPython notebook is embedded as an iframe in the main https://cloud.sagemath.com page, but with exactly the same domain, so the main page has full access to the DOM and Javascript of the iframe. Here's what happens when a user makes changes to a synchronized IPython notebook (and at least 1 second has elapsed):
  • The outer page notices that the notebook's dirty flag is set for some reason, which could involve anything from typing a character, deleting a bunch of cells, output appearing, etc.
  • Computes the JSON representation of the notebook, and from that the document representation (with 1 line per cell) described above. This takes a couple of milliseconds, even for large documents, due to caching.
  • The document representation of the notebook gets synchronized with the version stored on the server that the client connected with. (This server is one of many node.js programs that handles many clients at once, and in turn synchronizes with another server that is running in the VM where the IPython notebook server is running.  The sync architecture itself is complicated and distributed, and I haven't described it publicly yet.)
  • In the previous step, we in fact get a patch that we apply -- in a single automatic operation (so the user is blocked for a few milliseconds) -- to our document representation of the notebook in the iframe. If there are any changes, the outer page modifies the iframe's notebook in place to match the document. My first implementation of this update used IPython's noteobook.fromJSON, which could easily take 5 seconds (!!) or more on some of the online IPython notebook samples. I spent about two days just optimizing this step. The main ideas are:
    1. Map each of the lines of the current document and the new document to a unicode character,
    2. Use diff-patch-match to find an efficient sequence of deletions, insertions, swaps to transforms one document to the other (i.e., swapping cells, moving cells, etc.) -- this is critical to do,
    3. Change cells in place when possible.
    With these tricks (and more can be done), modifying the notebook in place takes only a few milliseconds in most cases, so you don't notice this as you're typing.
  • Send a broadcast message about the position of your cursor, so the other clients can draw it.  (Symmetrically, render the cursor on receiving a broadcast message.)