MongoDB MapReduce with Hadoop

4 Responses to MongoDB MapReduce with Hadoop

  1. Patrick Salami
    July 18, 2013 @3:05 am

    thank you for this post, this is very helpful. We also have a setup that requires data to be exchanged between Hadoop and MongoDB. However, we need to go the other way around: we generate large amounts of data in Hadoop and need to write that data to MongoDB. Currently we are writing the data over the network to a couple of mongos servers using a custom OutputFormat, similar to what you have in your library that you posted on Git. However, the process is very slow and MongoDB becomes unresponsive to clients that are reading data while the load is in progress. In addition, it ties up our Hadoop cluster for a long time. So I figured perhaps we can write the MongoDB native files in HDFS (we assume that the database is initially empty). Writing the namespace file might be challenging in MapReduce though.
    What do you think?


    • Peter Bakkum
      July 18, 2013 @12:23 pm

      Thanks for the question, Patrick.

      As you mentioned, there are several options which come to mind for writing Hadoop data back to Mongo:

      - Insert into Mongo in the OutputFormat. This is currently what we do since we don’t currently have a requirement to write from Hadoop to Mongo very often. This is simple, but as you mentioned, not a quick process for large data sizes because every document needs to be inserted. This can probably be optimized a bit more by buffering in the RecordWriter and inserting batches of documents (so that they get sent in a single message), but you’re still limited by how fast Mongo can process the inserts, and you’re still affecting other queries.

      - Write to a JSON or BSON file in the OutputFormat and import into Mongo. This could be a little faster, but means you would have to move the file from HDFS to a location accessible from the mongoimport/mongorestore utilities. Depending on your requirements, this may not be fast enough.

      - Write to Mongo native files and open in MongoDB. This would also require copying files out of HDFS, but afterwards they would be immediately accessible in Mongo. I haven’t tried this, but its definitely possible and I’d love something like this in the library. As you mentioned, it would involve writing a namespace file and at least one data file and writing your documents into Mongo’s record and extent scheme. The RecordWriter would have to be a bit more complex because each record needs to have a pointer to the next and previous records. Perhaps more challenging, each extent also needs to know the locations of the previous and next extents. You would also need a method of grouping records into extents and data files. So basically, my feeling is that this would be difficult but very useful if it was working, you’re definitely not the only one with this problem.

      I’d be curious to hear about how you guys end up handling/optimizing this problem.

      • Patrick Salami
        July 18, 2013 @6:37 pm

        thanks for your reply, it’s very encouraging. I’m going through your code right now and I’m also looking at the MongoDB source code to get a better understanding of the native file format. I’ll take a shot at this and see how far I get. Can I message you privately in case I have any questions or to share progress?

  2. Peter Bakkum
    July 22, 2013 @10:33 am

    Sure, email me at

Leave a Reply

Your email address will not be published. Please enter your name, email, and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Scroll to top