Event: MongoDB Atlanta 2012. Comprehensive Session Summary

I was in Atlanta this Friday attending 10gen's MongoDB conference. It was an exciting experience. There was a variety of sessions divided broadly into two categories - Beginners vs. Advanced MongoDB users. Being already very familiar with MongoDB, I chose the advanced track. The first session was "Rapid and Scalable Development with MongoDB, PyMongo and Ming" by Rick Copeland.

Rick started the session by talking about how to port to MongoDB using Python. The only library required is the PyMongo driver which is readily available for download. He also spoke about using map-reduce with MongoDB. It is single threaded and blocks most operations till it finishes. Sharding solves this problem to some extent for the map-reduce operations and for the $where clause. He advises using a different map-reduce framework such as Hadoop. MongoDB’s new aggregation framework is being released soon. That will address this issue.

Rick also spoke about how to use the Ming MongoDB drivers. We can think of Ming as an ORM tool. The main aim of Ming is to allow the developer to specify the schema for the data in Python code. The MongoDB - Ming integration seems to be extremely simple. The main thing to consider here is just mapping a mongo database to a ming datastore. Ming documents can be accessed either using dictionary-style lookups or attribute-style lookups. He then spoke about the details of querying, schema design and operators. Overall, due to the the similarities of MongoDB and Ming, they work together flawlessly. Next on the advanced schedule was “Journaling and the Storage Engine ” by Mathias Stearn. Since I had attended his session on storage back in MongoSV, I decided to attend the Schema-Design session by Jesse Davis instead.

Jesse started off by describing exactly what I wanted to know: how to do schema design in MongoDB. He mainly focused on five different things:

 

  • What is the nature of data?
  • How am I going to use the data?
  • What is more common, writes or reads? Specifically the ratio of writes to reads.
  • What are the expected data access patterns?
  • For a given unit of data would a separate document makes sense or should it be embedded inside some other document. If it is too large and embedding it in another document seems impractical, does linking the document make more sense ?

He described other basic operations like finds, inserts, basic queries, the explain() command and using operators in MongoDB. This was all basic knowledge. He talked about the document structure. MongoDB has a rich document structure backed by a dynamic schema. This makes MongoDB ideal for rich queries and makes it suitable for most operations which are possible in the standard RDBMS databases.

The next topic was one that I definitely wanted a deeper insight on: Embedding vs. Linking. Embedding is a helpful concept that, if used cleverly, can result in pretty simple queries and low-latency operations for fetching most data. As mentioned before it is an integral part of schema design. Linking, on the other hand, is a deceptive concept. From the name itself we get the feeling that it is an actual link to another document. However, it is actually just another part of the data, a simple “field:value” pair. We literally store the object id of another document there and there is no referential check done by MongoDB to verify it. Which means it’s not actually a link, just a value of a field. We can store any random value and it would be accepted. We can actually envision it as a virtual link instead of a physical link. Using the link, we can perform 2 queries to fetch the desired document. In the first query, we fetch the object ID itself from the main document and in the second query we use the object ID to retrieve the actual referenced document.

After that interesting talk, the next session in the advanced track was "Mobilize Your MongoDB! Developing iPhone and Android Apps in the Cloud" by Grant Shipley. Again, having attended that talk at one of the MongoDB meetups in Raleigh, NC, I decided to attend "Indexing And Optimization" by Robert Stam instead.

"Indexing And Optimization" was a basic session mainly focused on Indexing. Robert started off by talking about how to create an index using the createIndex() and ensureIndex() commands. The only difference between the two being that ensureIndex() checks for the existence of the index before creating it and therefore is sometimes faster. He showed how to create indexes on different fields, as well as ascending, descending and compound indexes. Indexing is a demanding process for the CPU and should only be performed under conditions of low load. MongoDB provides an option to create the index in the background in order to reduce the load on the system. He described several other operations that can be performed on indexes like get all indexes, drop all indexes, drop specific indexes and re-index an existing index. Contrary to some other key-value stores, a rich index structure improves the performance of MongoDB to a great extent.

The bulk of the talk focused on showing how to use indexes and which indexes affect which queries. Almost all queries in MongoDB can be run without indexes. Only queries which sort documents of size larger than 8MB and queries that are geo-spatial in nature absolutely require indexes. Indexes cannot be used with queries that use specific operators like $ne, $mod, and $not or partial range queries like db.foo.find({x:{$gte:0, $lte:10},y:5}) with an index on x,y fields.

He addressed geo-spatial indexes, which are a prime feature of MongoDB. Geo-spatial indexes make location-based queries possible. Geo-spatial queries solve problems such as "give me a list of all the restaurants which are near this particular location". $within queries solve problems such as a list of all restaurants "within 5 miles of my current location". Geo-spatial indexes can be used along with other indexes resulting in complicated compound indexes. This talk also described covering indexes where the query can be completely evaluated using just the index. All fields present in the query itself or which are returned in the results are required to be a part of the index. Since the _id field is returned in every result, we can use {$set:0} on the id field to remove it from the results. Having indexes is good, but having too many indexes sometimes defeats the purpose itself. There is a hard limit of 64 indexes per collection.

Robert also covered the MongoDB query optimizer. It is cost-based. Typically the query optimizer runs through several different indexes for a query and selects the optimal one. It then uses that index every time it queries on the same field. It re-evaluates the performance if there is a change in the schema design or architecture. We can also force MongoDB to use a specific index. Although that might not be a good idea since MongoDB ideally selects the optimal index.

Finally, Robert showed the advantages of using indexes. One can observe the output of the explain() function to see what index was used, how many objects were scanned, the range of the index that was used and several other factors. We can also use the built-in profiler to view the benefits of using indexes.

After that great session on Indexing and Optimization, I chose to attend the "Mission Critical MongoDB" again from the basic track, which was a continuing session from Lulu's September 2011 session that I had attended in Raleigh.

After a brief description of the original session, Kevin Calcagno from Lulu moved on to describe the entire data migration process that they carried out at Lulu. It seemed to be a slow, painful process prone to problems. But drawing from his experience, he pointed out a few general rules for mass data migration.

Kevin then described some of the tweaks that they had to make when migrating to MongoDB. The “write concern” option is not set by default, this is fine for data such as logs but not for application data. For valuable data, you should set the write concern option. NoSQL is schemaless, but usually your application is not. As a result, do not be tempted to tweak the schema to such an extent that the application has to perform extraneous transformations on it. Backups traditionally happen every 24hrs. So if your data is too valuable, write custom scripts to back it up more frequently.

Having already attended several replication sessions, the next session that I attended was the “Real Money, Real Time: MongoDB-Powered Recommendation Engine” by Chris Siefken

Chris started off by describing his work at Bean Stalk Data. They monitor consumer transactions at major restaurant chains, make marketing offers and send notifications designed to change consumer behavior. Their original implementation proved to be very problematic for them. Specifically, a model based on a high number of writes: 93 % OLAP for reporting, which caused several problems needing sharding solutions for their relational data due to scalability problems on a non-distributed infrastructure.

They looked at several new NoSQL solutions like CouchDB, Hbase, VoltDb, Neo4j, MongoDB and their own custom solution. CouchDB was great, but did not have the specific document-based model they wanted. Hbase had several small issues that conflicted with their requirements. VoltDB was not selected because of its lack of maturity and stability at the time. Neo4j, a graph database, was not a good option for future considerations because it does not support sharding. MongoDB was document-based and hand good sharding built-in. It also had readily available documentation, which meant it was easy to train others on it. So they went with MongoDB.

After implementing it, they never looked back. Apart from the perfect fit, they also made efficient use of map-reduce in their reporting service and replaced OLAP completely. Chris showed some metrics from the implementation. They achieved 87,000 writes/sec on a $500 node with MongoDB. That took care of their high writes problem and reduced the infrastructure cost dramatically.

By this time it was just 3:00pm, but I had already gained more than I could from weeks of self-study. The last session was “Saving Time with MongoDB” by Harris Reynolds.

Harris spoke about two applications that they had designed using MongoDB and Java. With the first, Visualizing Random Data, he described a process of creating visual graphs from structured data using MongoDB and Java. He discussed how to use the simple Java APIs for MongoDB using the MongoDB-Java driver. The other application that he discussed gathers social media data. Apart from this he gave several hypothetical examples on other use-cases of MongoDB. Overall it was a session that was designed to address different ways of using MongoDB intuitively. After that there was a session, “Meeting the Experts,” which was a great opportunity to network with people. I learned about several on-going MongoDB projects and new ways of using MongoDB with other systems.

It was a great experience attending MongoDB Atlanta 2012. It was a good location with excellent event management, many interesting topics and, to top it all, experienced presenters both from 10gen and other companies using MongoDB in production. Having attended MongoSV 2011 last year, I found that this event covered more basic concepts than MongoSV. However MongoDB Atlanta excelled with sessions on real-world implementations. I look forward to attending such events in the future.

  • You will have Bad Data Accept it and be ready to deal with it. You might not be able to detect bad data right away, but a data migration helps in identifying bad data and removing it.
  • It should be fast The migration needs to happen fast. Due to bad data and several other factors, there are going to be several re-dos and re-architectures of the new data model. If you don't have a way to quickly deploy and test the changes you make, you are bound to fall behind schedule. Automated scripts and small custom applications go a long way in solving these problems.
  • Use production data Unless you use the data you have in production, you will never detect the bad data. With a smaller custom made data set like a small stub repository, you can simply test the functioning of your application. Your data is what is wrong here, and you are never going to detect that unless you use the actual data.
  • Plan for the future Make sure that the new system you are migrating to supports enough functionality that will enable you to avoid future migrations. The new system should be flexible enough to accommodate changes in the data, which are bound to happen sooner or later. It should also be able to scale to support potential growth.

Comments

Post new comment

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
Are you for real?
Image CAPTCHA
Enter the characters shown in the image.