I have a system that handles various transactions from hundreds of clients a minute. Each transaction is limited to a subset of functionality, and thus for each type of transaction (circa 25 of those) I have a separate database table (this avoids lock waits for all the clients connecting). These clients are connecting to several C# web apps running in IIS on Windows.
However, we have a requirement for a read-only access for a subset of the data that is a product of most of these transactions. Currently, this read-only process is a 14 table LEFT OUTER JOIN, which is painfully slow. (We are using Ms SQL server)
The current thinking is to create a separate table to store the data and ship the read data on update, so that read-only operation is as fast as possible.
What would be the most stable and scalable approach?
The way I see it there are three options, none of which I really like:
- Db triggers
- A background service (daemon) that updates data (periodically or via a trigger)
- A background task fired in my web that updates the data.
EDIT: Ideally, the POs want the data to be no more than 3 seconds stale. We are talking just tens of thousands of rows (less than 100K) with about 90 columns.
1
I will try to give you a different perspective of this problem by separating write and read operations and their respective stores. A big red warning before diving into this in more depth: while you can optimize for performance, you are also introducing complexity, which you should understand, consider, and discuss with your business/operations managers.
Write operation
As you mention your “transactions with a subset of functionality,” these are your use cases/models for write operations. This way, you can optimise the write models as your business drives you. Your write models could be still stored in your current RDBMS choice.
The first important distinction and added value here is that you allow yourself to:
- Separate the write concern (and its persistence store) from reads
- Easily add cross-cutting concerns like testing, logging, auditing, deadlock management, authorising, scaling, etc.
Your writes are basically commands that originate from your client’s requests. These commands contain all the necessary payload for storing your use-case optimised model to the write persistence store.
Read operation
Depending on your data structure and relations, read models are a way to store data in a denormalized way. As you wrote: a current read operation consists of 14 SQL joins, which could be a nightmare to deal with. This is where use-case optimised read models can step in.
Your read models are the queries that serve the denormalised data (maybe from its separate data store). While using separate read models you optimise for performance but also add neccessary complexity.
- Separate the read concern from writes (and its persistence store)
- Ease of adding cross cutting concerns like testing, logging, auditing, caching, authorising, scaling, etc.
Read side storage could range from RDMS to fast NoSQL solutions like document DBs and key-value stores.
Synchronisation between the data stores
The way I do it is: whenever a command executes, it publishes an event containing the payload of IDs necessary to start reconstituting the read store (to be in sync with the write side).
To answer the question you asked in a comment, syncing really depends on your actual architecture. I, for one, am really against pushing code out to external sources (e.g. triggers, batch jobs) if possible. I prefer the inverse: push all code as deep as possible into my own implementations (over which I have direct control). As stated, I’m using commands, events, and queries for data manipulation.
A concrete lightweight example:
- A request is caught by an IIS-hosted WebApi or MVC controller.
- The Controller populates the respective command object with values from the request.
- A CommandHandler gets fired up from the Controller.
- The command handler does its job storing the given write model
- The command handler raises an event to notify the subscribers about the change
- An EventHandler, responsible for handling the event raised by the previous command handler, fires up a new command that’s responsible for reconstituting the read store side.
To summarize the points above: you have the flow of syncing in your hand, and do not depend on third-party solutions like triggers, NT services, and so on.
Some things to note:
- Since we adhere to Command Query Separation, we can easily decorate point 2 with command authorisation, validation, logging, etc. without actually modifying the code in the Controller or the command handler.
- As you may have noticed, we struggle to keep parts of this subsystem as decoupled and testable as possible. Also, the concerns are blindingly separated.
Your question about “what if the event handler dies?” is gong to make
this a bit more complex, by introducing new concepts called Eventual Consistency and message/event queuing. Quoting Greg Young:
On most systems you can just use a queue as the read system will want pretty much every event from the write system anyways.
This concept extends this answer with:
- a place to store all events that happened on the write side
- a queue that goes through these stored events and applies them to the read side
Conclusion
All this above is nothing new. I like to call this CQRS Lite – a term I have borrowed from Dino Esposito. He has a very comprehensive tutorial course on this at PluralSight. I wholeheartedly recommend that you watch it there, as there is a lot to learn from him. [1]
[1] It must be noted here that this course is behind a paywall, it means you need to pay money to watch it, and I would like to make it clear that I’m not affiliated with PluralSight in any ways.
Another good source of information on this subject is CQRS Performance Engineering: Read vs Read/Write Models from Derick Bailey.
4
Another option is to alter the code which updates the database to populate both tables.
This is simple if the code base has a proper level of abstraction around the table population–i.e. the update actions are within some kind of wrapping class, library, or API. You simply change this wrapper to do two inserts/updates rather than one at the relevant places.
If your code does not have this abstraction, and lots of places are inserting or updating directly on the database, then this won’t be so easy. But adding such a wrapper layer might be a possible solution. You would have to do a good cost/benefit check to see whether it would be worth it.
5
Materialized view may work for you here.
To support efficient querying, a common solution is to generate, in
advance, a view that materializes the data in a format most suited to
the required results set. The Materialized View pattern describes
generating prepopulated views of data in environments where the source
data is not in a format that is suitable for querying, where
generating a suitable query is difficult, or where query performance
is poor due to the nature of the data or the data store.
3
The answer depends:
- If you have many updated datarows (i.e. > 100 000 rows per hour)
- and if it is ok that the aggregated data is one days old
datawarehouses or “BI” system use a second analysis database that gets updated once a day when the transactional database (where the write operations take place) has low load.
This has the advantage that the transactional database is not blocked when analysis takes place.
Another approach may be to circumvent the database for individual read access altogether. To do this, you could have a background process that queries the database and writes the results into a file.
The format of the file depends on what you need to serve out to the requesting party.
Also, if you need high availability, you should work with two files, one being read/served out, and one being written by the background process. You could then simply use a renaming strategy to get the updated file in place (rename the read file to a temp file, rename the updated file to the read file and delete the temp file).
This way you can achieve the highest possible read speed, with minimal impact on both the rest of the software and the write performance.
3