Using a Single Collection for the entire application in MongoDB

  softwareengineering

It strikes me why most of the sites don’t talk about this topic. Recently, I’ve been doing some extensive research when it comes to NoSQL schema design, and evaluating the different NoSQL options (Focusing on MongoDB, and DynamoDB), and found out, through this video:

That whenever you’re working with NoSQL you should aim to design a single database with a single collection for the entire application (Except it’s something very specific). In addition, just because the data is de-normalized in the database doesn’t mean that it is not relational in reality. I do remember reading from the Mongo docs that we should design the DB according to how are we going to access the data.

Therefore it does put me to think a lot on how should I de-normalize the data (Some data is going to be repeated, quite often) that follows a best “NoSQL practice” (if any exists).

Nonetheless, MongoDB has been incorporating more SQL’esque APIs such as $lookup that add LEFT OUTER JOINs capabilities that makes the data less repeated.

So the question is… should I really aim to design a MongoDB application with a single collection in mind, or will it not matter if I separate the application in different collections only for the sake of reducing redundant data?

0

Bags

Well bags are very easy, you throw everything inside. You don’t have to make any upfront decisions. Updating is a breeze, just throw it in, or take it out. Because everything is in the one spot you won’t every miss or forget anything. But it does mean you have to pay attention to everything to find anything.

Now if the searches are ad-hoc and have no common structure or do not operate over similar sub-categorisations, then a bag is very efficient. You do not have to pay for what you do not use.

However should some K queries look for similar structures or sub-categories then it would pay to pre-categorise that data, instead of paying for that structure/categorisation on each search.

Bookcases

Bookcases take work to set up, you have to categorise your books, then pick a shelf/s, then order them. When you add or remove a book you essentially have to repeat the exercise adjusting each shelf as needed. If you didn’t adjust the shelves appropriately there is a chance you might miss something, but on the other hand if you know what you are looking for its a breeze to just go to the shelf and easily find the book.

A Book case performs well when the queries share structural and sub-categories with those used to setup the bookcase. Essentially leveraging the already done work.

Conversely when they are not shared, performance becomes much slower than a bag. Because the query has to fight the imposed structure.

Bags vs. Bookcases

A Bag is a NoSQL collection, or a Relational Table with a single varchar(max) field or some other equivalent, plus or minus an explicit unique record identifier. The only index is solely over the record identifier.

A Bookcase is a NoSQL collection where every entry is an instance of a particular textual-structure and that structure has been leveraged to create indecies, or a Relational Table/s with multiple typed columns, foreign keys, indecies, etc…

Which is the best way to go?

  • Do you have a pile of random data? Go with a bag at first.
  • Do you know exactly what you are dealing with? Go with a bookcase at first.

Why?

Because your job as a developer is to engineer a solution that best supports the usage of the system.

De-normalisation and Data-Duplication

Data duplicates enough without designing a system to exacerbate the situation.

To my knowledge there are exactly two times you should permit data duplication.

  • as a caching mechanism, with explicit identity/version tracking and expiry.
  • as a logging mechanism to record operations and/or original/traceable reproductions.

The core data in your database should be normalised regardless of underlying technology. (As a rule of thumb, there are always exceptions but you do have to look really hard to justify them).

That being said, how it is normalised does not require it to be spread across multiple collections or even multiple documents. It is reasonable for it to be contained within a single document.

NoSQL engines are incomplete

The problem is that most modern NoSQL databases do not provide transactions beyond atomic document updates. This of course complicates multi-document updates, or lands you in a hunt for the truth.

Should your particular NoSQL engine provide multi-document transactions, great. You can skip this next section.

Fundamentally I consider this lack of transactional support to be a design flaw of the current NoSQL engines. It is a design flaw because it forces all relevant data must be co-located within the same document. From here there are several approaches:

  • Treat that document as a flat database file. This essentially turns the NoSQL Engine into a database locator, and it is now your task to implement a database. For trivial applications that would simply be a serialised set of objects/records. However this does suppose that you maintain backward compatibility with that serialisation and object semantics.

  • Treat that document as a event in an event-sourcing stream. Now your job is to implement appropriate read operations that either assemble the current reality from the event stream, or some combination of look-aside cache and event stream. This is a non-trivial solution and is hard to get right.

  • Treat that document as a check-point in a workflow. As the workflow executes it updates the relevant documents in the other collections, and then updates itself. At any point in the process what has been updated is known, with the exception of the current document which may/may not be updated. Any read operation will require checks to determine if the data is stale, and the workflows will need to be idempotent or at least carefully orchestrated so as to be applied correctly.

Or there is a secret fourth option:

  • White noise. You cannot trust the state of your data at all, just dismantle the application.

When to use a single collection

Almost never.

Applications have structure, and deal with data containing structure even if that structure is some form of meta-structure. Otherwise every piece of code would be one-off never to be used again.

The only reason to use a single collection is:

  • when the system is trivial and only has a single structure.
  • when their is an external limitation that you application can only have a single collection.

The first is quite simple, to add a second collection complicates the system more than it simplifies it.

The second is also quite simple, you do not have the choice. In which case seriously consider disagreeing and getting that limitation removed.

That whenever you’re working with NoSQL you should aim to design a single database with a single collection for the entire application

Why? Just because you can do it, it doesn’t mean you have to. You have to model the persistence that best suits your needs. Usually, the scheme design is solved around how the data is accessed, updated, inserted, etc. The goal is finding the data organization most performant for these operations. And of course, the most natural to the application’s domain.

From MongoDB’s Official Page

Generally, having a large number of collections has no significant
performance penalty and results in very good performance. Distinct
collections are very important for high-throughput batch processing.

[…]

You should consider embedding for performance reasons if you have a
collection with a large number of small documents. If you can group
these small documents by some logical relationship and you frequently
retrieve the documents by this grouping, you might consider
“rolling-up” the small documents into larger documents that contain an
array of embedded documents.

And not much later

However, if you often only need to retrieve a subset of the documents
within the group, then “rolling-up” the documents may not provide
better performance. Furthermore, if small, separate documents
represent the natural model for the data, you should maintain that
mode

Emphasis of mine

To my experience, we start with different documents of different natures with fair different schemes; stored in different collections. A data organization that is easier to reason about. Relationships between these documents are designed as arrays of identifiers. Call these collections masters or roots. Additionally, other more document-specific relationships start as small embedded documents or collections within those roots.

As the application evolves, we detect different growth factors between roots and embedded documents. At some point, we realise that the application is consuming only a subset of the document (most of the time) making a waste of memory to bring these embedded documents to memory, or their weight cause computational impact on the querying.

As these relationships gains or lose critic mass, we choose one or another data relationship strategy.

LEAVE A COMMENT