Multiple Databases per Microservice

02/11/2022 softwareengineering

We have a scenario in which all the important and transactional fields of our business entities are highly structured and relational. The data size of these important fields is also very small. However, there is a raw JSON associated with each entity that is very rarely updated (only in exceptional cases). However, most of our read APIs require all the data including the raw JSON.

Considering this, we have chosen MySql as our datastore with the data-bag (raw JSON) stored as a blob with each entity. This doesn’t cause any functional issues. However, the data size is increasing rapidly and the contribution of the JSON blob is around 70%.

So, we are thinking of moving the raw data into a NoSql store and using the primary key of MySql as a referential key in NoSql (to be enforced by code).

However, this somehow appears to me as an anti-pattern because it introduces distributed transactions (as we need to ensure the consistency across both the DBs). This can be avoided using Saga pattern wherein the write to NoSql goes through a message queue. But we need strong consistencies in reads so we can’t rely on this. Moreover, it introduces further complexity that can cause maintenance/monitoring issues.

We can choose to move completely to a NoSql store, but our main domain entities don’t really need it and we will lose the goodness of relational data-structure.
We can shard MySql based on size, but this will force us to have some cross-shard queries.

Is there a common pattern to address this and is the “multiple databases per service” a pattern or an anti-pattern?

Adding a second database isn’t going to solve your problem. It will give you a whole bunch of new problems that you’ve identified, and likely more. What you need to do is structure your data better. Truly unstructured data is rare, most applications are a way to present data in a structured way, reflecting that structure correctly in a database can be difficult. There is likely more structure to your data than you are willing to admit, and moving as much as possible into your database will result in better performance. At a minimum breaking a large JSON blob into multiple blobs will give you some benefit, and may be a good first step to better analyze your data to find structures. Another thing to consider is using the JSON datatype withing MYSQL, this will help the database better optimize storage and performance, and could allow you to do more filtering at the database level which will ultimately lead to a more performant solution than a coded approach.

Multiple databases or distributed databases are a last resort solution. They are a huge cost in both the extra hardware, and bodies required to keep everything synced and maintained. Once you go down this route everything gets more difficult, and it takes a lot to justify that difficulty.

You need to implement horizontal scaling for your whole database, not just the JSON parts. Extracting the JSON parts will only buy you 70% more space once, so you will have the same problem again soon enough with the relational data.

Since the JSON parts seem to be basically “black box” to your application you don’t get a lot of benefit by storing it in a relational database – but you won’t get any benefit from using a different database system either. And the increased complexity and maintenance cost of having two database systems is vastly higher.

Depending on your database engine, you can probably combine vertical and horizontal partitioning (sharding), so you shard the column with the JSON blobs separately from the relational data.

As for “patterns” and “anti-patterns” – that is the wrong way to think of it. Surely there are scenarios where having both a relational database and a NoSql system in the same service might make sense. But in your case it doesn’t bring you any benefit and doesn’t solve the problem you have.

I would tend to agree in general with the idea that having two DBs is more trouble than it’s worth but based on your situation, it’s worth considering. One option you could also consider is gzipping the JSON. Depending on the size of the documents, this could save a significant amount of space. A nice thing about this is that you can set the MIME-type on the response and return the raw gzipped data from a service.

But assuming there’s no easy way to extend your MySQL capacity, you need to either move the whole thing to a horizontal scaling DB or just move the LOBs. Since you are using relational features for the core data, this complicates the former. There are probably ways to do but it’s going to create a lot of challenges and work that needs to be done.

The main stumbling block to using a document store seems to be that you are worried about is consistency. This is a challenge if you truly need consistency between both DBs but it could be workable if you can weaken that requirement slightly. The way you might be able to do this is to require that your JSON document be written and confirmed before you commit the relational data. Additionally, it would be important to never modify the JSON records. In the case you were able to write the JSON data but the relational part failed, you would simply have an orphaned record in the document store. There’s no clear reason why this is a (big) problem as I would expect that you retrievals from the NoSQL store would be based on the data returned from the relation queries. You could implement some sort of cleanup process if needed.

JacquesB brings up a good point on the space usage. If everything holds as you have explained, you would simply delay your problem until later. In order to account for that, you need to work out what your long term storage needs are. If you must keep everything in one DB forever, you need a horizontally scaling solution for everything. If you can divide your data up in some meaningful way (e.g. by date) you might have more options.

LEAVE A COMMENT Hủy