MongoDB Data Locality

at June 18th, 2013

This article is part one of two detailing my talk at MongoDB San Francisco 2013.
The video and slides for the entire talk can be found here.

Before selling a Groupon for two hours of trampoline jump time we must find and contact a high-quality purveyor of jumping. To do this, we acquire as much information about all the businesses in the world as possible. Then we clean and process this data, such that we can accurately identify those businesses that offer the best trampolining in the area. Finally, this data needs to be readily available to salespeople, so that when a sales manager demands “More Jumping!” his team can quickly contact businesses in this category.

With thousands of salespeople contacting local businesses every day, the depth and accuracy of our local data has a significant impact on our bottom line. Groupon’s Merchant Data team works to provide the best information possible, but the challenges are immense.

We get data directly from vendors, webcrawling public data sources, salespeople, even crowdsourcing—and all of this data may be dirty, and must be normalized and combined.

Consider these documents:

{
  "source":   1,
  "name":     "Joes Pizza",
  "address":  "1000 Market Street",
  "phone":    "415-555-2300"
}

{
  "source":   2,
  "name":     "Joe's",
  "address":  "1000 Market St.",
  "phone":    "(415) 555-2300"
}

We’ve gotten data about a place from two sources, but its difficult for a machine to determine they are the same place given the variations in each field. Our job is to match these documents together and present a consistent view to other teams in Groupon, including information like business quality, category, website, hours of operation, etc.

We’ve found that much of this data is easiest to store as a document. For example, many businesses have multiple phone numbers, and a document-oriented model makes it trivial for us to replace the simple “phone” field shown above with an array of numbers.

As a performant document store, we use MongoDB to store data at a few critical points in our processing pipeline. Several Merchant Data team members hail from Hyperpublic, a local data startup acquired by Groupon in 2012, where we were heavy users of MongoDB for most of our backend systems. When we began integrating our technology with Groupon’s, we continued to use MongoDB, but were forced to scale our systems to handle the challenges of data at Groupon’s size. For example, we store information on around 10 million businesses in the United States, and have about 800 million individual pieces of information about these businesses.

Manipulate Data Locality With The Primary Key

We shard our Mongo collection of place data with its primary key, which we call the Place ID. This makes direct lookups quick, but there are a couple other reasons we chose this shard key rather than, say, a location based shared key. First, local data is often messy, so its hard to predict what will be in our place fields before cleaning it up. Places with an unknown postal code, for example, can have values such as ‘00000‘, ‘null‘, ‘NA‘, or ‘N/A‘. Next, it’s hard to predict the cardinality of these fields. Some postal codes, or cities, or states, will have far more places than others. This makes it a poor shard key because its impossible to predict the size of each chunk, and a large chunk could violate Mongo’s chunk-size limit, which prevents it from balancing these chunks among your shards.

Mongo Cluster

A standard Mongo cluster configuration.

We use UUIDs heavily at Groupon, mostly because they can be generated on distributed machines without synchronization, but also because they can store other useful information such as a timestamp.

Here are a few sequentially generated version 1 UUIDS:

82d991c6-b098-11e2-8fc0-c82a14fffe86
82d9996e-b098-11e2-8fc0-c82a14fffe86
82d99f04-b098-11e2-8fc0-c82a14fffe86
82d9a40e-b098-11e2-8fc0-c82a14fffe86

Note that the first 8 characters of the ids are very similar, and they all begin the same way. This is because the first block is based on the time when the UUID was generated.

The OIDs used by Mongo are similar:

51a6bdfcad894a0f768d106f
51a6bdfdad894a0f768d1070
51a6bdfdad894a0f768d1071
51a6bdfead894a0f768d1072

Again, the beginning of the ID is time-based, meaning it is identical for IDs generated at similar times even when generated on multiple machines. If data is being inserted into a collection sharded on this key, only one shard will be written to during a certain time period, a condition sometimes called a “hot shard”. Effectively, we’re now limited to the write throughput of the single machine on which this shard resides.

This problem is well understood, and in fact Mongo 2.4 recently introduced the capability to hash the shard key and ensure a random distribution.

Writes to a collection sharded on a time-based key can result in a "hot shard".

Writes to a collection sharded on a time-based key can result in a “hot shard”.

On the other hand, there are cases where this scenario is advantageous. For some read-heavy workloads, it makes sense to pair spatial locality of our data with the temporal locality we get when records are inserted at similar times. Ideally, this characteristic should be easily controllable.

As mentioned earlier, one advantage of using UUIDs is the ancillary information they contain, such as timestamp. This information is easily accessible within the well-defined 128 bit structure, and can be useful. There are many other data systems, such as HBase, with similar sharding characteristics based on the primary key.

Locality UUID

Our solution to these problems is something we call the Locality UUID. It has the structure

wwwwwwww-xxxx-byyy-yyyy-zzzzzzzzzzzz

w: counter value
x: process id
b: literal hex 'b' representing the UUID version
y: fragment of machine MAC address
z: UTC timestamp (milliseconds since epoch)

The counter value is controllable to assist in manipulating locality while the other fields contain useful information that helps ensure id uniqueness. The counter has 2 modes.

    • In the default mode, the counter is incremented by a large odd number, then reversed for UUID serialization. This means that there is very high variability for the first few characters. We use an odd number, because the space of a 32-bit integer (2^32) is not divisible by odd numbers, so as we increment the counter and it eventually overflows, it has the other 2^32 – 1 possible values before arriving at the same value.
      Here are a few sequentially generated ids in this mode:

      c8c9cef9-7a7f-bd53-7a50-013e4e2afbde
      14951cfa-7a7f-bd53-7a50-013e4e2afbde
      6f5169fb-7a7f-bd53-7a50-013e4e2afbde
      ba2da6fc-7a7f-bd53-7a50-013e4e2afbde
      06f8f3fc-7a7f-bd53-7a50-013e4e2afbde
Choosing our shard key carefully results in a more even distribution of writes.

Choosing our shard key carefully results in a more even distribution of writes.

  • The counter can be switched into sequential mode, which begins the counter at a hash of the current 10-minute window and increments by one.
    The result is that ids generated at the same time on different machines start with an identical, highly variable value, then increase linearly.
    Here are some sequentially generated ids in this mode:

    f5166777-7a7f-bd53-7a50-013e4e2afc26
    f5166778-7a7f-bd53-7a50-013e4e2afc26
    f5166779-7a7f-bd53-7a50-013e4e2afc26
    f516677a-7a7f-bd53-7a50-013e4e2afc26
    f516677b-7a7f-bd53-7a50-013e4e2afc26

    Uniqueness is supported by the millisecond precision timestamp, the MAC address of the generating machine, the 2 byte process id, and a 4 byte counter. Thus, a UUID is guaranteed to be unique in an id space if each machine allows 65,536 processes or less, does not share the last 28 bits of its MAC address with another machine in the id space, and generates fewer than 4,294,967,295 ids per millisecond in a process.

The embedded information can be easily extracted and read. This is useful when debugging systems that use these UUIDs, for example by determining precisely when a UUID was created.

20be0ffc-314a-bd53-7a50-013a65ca76d2                            

counter     : 3,488,672,514                                        
process id  : 12,618                                               
MAC address : __:__:_d:53:7a:50                                    
timestamp   : 1,350,327,498,450 (Mon, 15 Oct 2012 18:58:18.450 UTC)

We’ve released the (fairly simple) generation code in Java and Ruby implementations and we hope it will prove useful for your projects!


No comments yet


Leave a Reply

Your email address will not be published. Required fields are marked *