Thursday, October 11, 2012

Post RICON 2012 idea fragment: embed CRDTs in data

So RICON2012 is over and it was pretty fun. I thought the length of the session was just right (two days) and I thought that the amount of people was pretty good, too, especially for a first convention.


There was a lot of talk of distributed data structures (or "Conflict-free replicated data types" [CRDTs for short]), which is neat. I missed some of the talks, but I was thinking about this and it seems to me that CRDTs are nice, but you often don't want to keep track of just one data structure.

For example, you might store user interactions in Riak as a simple JSON dictionary. Clicks, adds to cart, etc.

Your key might be uid_category or uid_campaign, with a data set that looks like: {clicks: 8, adds: 9, ... }

This might make a lot of sense if you're doing simple analytics to figure out how popular a product or product category is or whatever. However, you might not want to break up these multiple statistics into many keys, because whatever process is figuring out how to rank things may want all the data but might not want to make, say 6 key fetches to get them (especially if you're doing batch processing over lots of users.)

In addition, you may have a lot of data coming in so obviously a CRDT counter implementation would be useful for each one and you would probably want siblings turned on. This makes me think you might want to embed the operations you're performing in the data (this might be meta-data, but I'm not sure.) That is, append to your data something like this:

{clicks.increment 1, adds: increment 1}

Then, if you have siblings, you assemble the corrected data from the CRDT log in the siblings. Easy! This is a limited application, for sure, and it's just something I thought of off the top of my head, but I'm wondering if tooling around this would be interesting/useful.