Identifier les cas où un document store n'est pas adapté.
« MongoDB is a document-oriented database. Instead of storing your data in tables made out of individual rows, like a relational database does, it stores your data in collections made out of individual documents. In MongoDB, a document is a big JSON blob with no particular format or schema.
[...]
At the root, we have a set of TV shows. Each show has many seasons, each season has many episodes, and each episode has many reviews and many cast members. When users come into this site, typically they go directly to the page for a particular TV show. On that page they see all the seasons and all the episodes and all the reviews and all the cast members from that show, all on one page. So from the application perspective, when the user visits a page, we want to retrieve all of the information connected to that TV show.
There are a number of ways you could model this data. In a typical relational store, each of these boxes would be a table. You’d have a tv_shows table, a seasons table with a foreign key into tv_shows, an episodes table with a foreign key into seasons, and reviews and cast_members tables with foreign keys into episodes. So to get all the information for a TV show, you’re looking at a five-table join.
We could also model this data as a set of nested hashes. The set of information about a particular TV show is one big nested key/value data structure. Inside a TV show, there’s an array of seasons, each of which is also a hash. Within each season, an array of episodes, each of which is a hash, and so on. This is how MongoDB models the data. Each TV show is a document that contains all the information we need for one show.
[...]
It’s got some title metadata, and then it’s got an array of seasons. Each season is itself a hash with metadata and an array of episodes. In turn, each episode has some metadata and arrays for both reviews and cast members.
It’s basically a huge fractal data structure.
[...]
All of the data we need for a TV show is under one document, so it’s very fast to retrieve all this information at once, even if the document is very large. There’s a TV show here in the US called “General Hospital” that has aired over 12,000 episodes over the course of 50+ seasons. On my laptop, PostgreSQL takes about a minute to get denormalized data for 12,000 episodes, while retrieval of the equivalent document by ID in MongoDB takes a fraction of a second.
So in many ways, this application presented the ideal use case for a document store.
[...]
There is a really important difference between Diaspora’s social data and the Mongo-ideal TV show data that no one noticed at first.
With TV shows, each box in the relationship diagram is a different type. TV shows are different from seasons are different from episodes are different from reviews are different from cast members. None of them is even a sub-type of another type.
But with social data, some of the boxes in the relationship diagram are the same type. In fact, all of these green boxes are the same type — they are all Diaspora users.
A user has friends, and each friend may themselves be a user. Or, they may not, because it’s a distributed system. (That’s a whole layer of complexity that I’m just skipping for today.) In the same way, commenters and likers may also be users.
This type duplication makes it way harder to denormalize an activity stream into a single document. That’s because in different places in your document, you may be referring to the same concept — in this case, the same user. The user who liked that post in your activity stream may also be the user who commented on a different post.
[...]
We can represent this in MongoDB in a couple of different ways. Duplication is any easy option. All the information for that friend is copied and saved to the like on the first post, and then a separate copy is saved to the comment on the second post. The advantage is that all the data is present everywhere you need it, and you can still pull the whole activity stream back as a single document.
[...]
You can also see why this is dangerous. Updating a user’s data means walking through all the activity streams that they appear in to change the data in all those different places. This is very error-prone, and often leads to inconsistent data and mysterious errors, particularly when dealing with deletions.
[...]
There is another approach you can take to this problem in MongoDB, which will more familiar if you have a relational background. Instead of duplicating user data, you can store references to users in the activity stream documents.
With this approach, instead of inlining this user data wherever you need it, you give each user an ID. Once users have IDs, we store the user’s ID every place that we were previously inlining data. New IDs are in green below.
[...]
This eliminates our duplication problem. When user data changes, there’s only one document that gets rewritten. However, we’ve created a new problem for ourselves. Because we’ve moved some data out of the activity streams, we can no longer construct an activity stream from a single document. This is less efficient and more complex. Constructing an activity stream now requires us to 1) retrieve the stream document, and then 2) retrieve all the user documents to fill in names and avatars.
What’s missing from MongoDB is a SQL-style join operation, which is the ability to write one query that mashes together the activity stream and all the users that the stream references. Because MongoDB doesn’t have this ability, you end up manually doing that mashup in your application code, instead.
[...]
Let’s return to TV shows for a second. The set of relationships for a TV show don’t have a lot of complexity. Because all the boxes in the relationship diagram are different entities, the entire query can be denormalized into one document with no duplication and no references. In this document database, there are no links between documents. It requires no joins.
On a social network, however, nothing is that self-contained. Any time you see something that looks like a name or a picture, you expect to be able to click on it and go see that user, their profile, and their posts. A TV show application doesn’t work that way. If you’re on season 1 episode 1 of Babylon 5, you don’t expect to be able to click through to season 1 episode 1 of General Hospital.
[...]
Once we started doing ugly MongoDB joins manually in the Diaspora code, we knew it was the first sign of trouble. It was a sign that our data was actually relational, that there was value to that structure, and that we were going against the basic concept of a document data store.
Whether you’re duplicating critical data (ugh), or using references and doing joins in your application code (double ugh), when you have links between documents, you’ve outgrown MongoDB. When the MongoDB folks say “documents,” in many ways, they mean things you can print out on a piece of paper and hold. A document may have internal structure — headings and subheadings and paragraphs and footers — but it doesn’t link to other documents. It’s a self-contained piece of semi-structured data.
If your data looks like that, you’ve got documents. Congratulations! It’s a good use case for Mongo. But if there’s value in the links between documents, then you don’t actually have documents. MongoDB is not the right solution for you. It’s certainly not the right solution for social data, where links between documents are actually the most critical data in the system.
[...]
When MongoDB is all you have, it’s a cache with no backing store behind it. It will become inconsistent. Not eventually consistent — just plain, flat-out inconsistent, for all time. At that point, you have no options. Not even a nuclear one. You have no way to regenerate the data in a consistent state.
When Diaspora decided to store social data in MongoDB, we were conflating a database with a cache. Databases and caches are very different things. They have very different ideas about permanence, transience, duplication, references, data integrity, and speed.
[...]
Remember that TV show application? It was the perfect use case for MongoDB. Each show was one document, perfectly self-contained. No references to anything, no duplication, and no way for the data to become inconsistent.
About three months into development, it was still humming along nicely on MongoDB. One Monday, at the weekly planning meeting, the client told us about a new feature that one of their investors wanted: when they were looking at the actors in an episode of a show, they wanted to be able to click on an actor’s name and see that person’s entire television career. They wanted a chronological listing of all of the episodes of all the different shows that actor had ever been in.
We stored each show as a document in MongoDB containing all of its nested information, including cast members. If the same actor appeared in two different episodes, even of the same show, their information was stored in both places. We had no way to tell, aside from comparing the names, whether they were the same person. So to implement this feature, we needed to search through every document to find and de-duplicate instances of the actor that the user clicked on. Ugh. At a minimum, we needed to de-dup them once, and then maintain an external index of actor information, which would have the same invalidation issues as any other cache. »
Via
http://www.gcu-squad.org/2014/11/mongo-or-not-mongo/