What is the best way to model transactional system with a need to read holding/balance?

01/11/2022 softwareengineering

I have a data model that is similar to the bank transactions. Simplified version looks like this:

01.01 +1000
01.02 -300

So I will have all the transactions stored in the database. Users will of course would like to know the balance rather than history of the transactions (700 in my example).

From what I can see there are few ways of solving this problem:

Always calculate balance from transactions. This way it will always be correct and data won’t be duplicated ever, but reading the data will be rather heavy.
Each time a transaction happens update a value that represents the balance. This way it’s always easy to ready, but the data is duplicated and can diverge in case of errors. This would ideally require an atomic db transaction, but even then there is no guarantee that some other code did not change the data.
Some hybrid of the previous two approaches: calculate balance once in a while and create some “balance record per date” that can be based as as a baseline for calculations instead of recalculating everything.

I am wondering what is a standard way of resolving such problems and if there is perhaps some other better way that I did not think about?

This is a fairly standard schema for monetary transaction. In a relational database, this can be enforced using multi column foreign key and a constraint.

For single entry bookkeeping, the transaction schema should look more or less like this:

Transaction
    id: primary key
    account_id: foreign key to account table
    previous_transaction_id: unique, same type as id. For transaction zero, set previous_transaction_id = id or null
    previous_balance: number
    mutation_value: number
    current_balance: number

    unique foreign key from Transaction(account_id, previous_transaction_id, previous_balance) references Transaction(account_id, id, current_balance)
    check constraint: previous_balance + mutation_value = current_balance

Yes, there’s some data duplications here, but it’s a very controlled duplication as the database will enforce that duplication is always consistent using foreign keys references. The only chance for data inconsistency is if there’s a bug in your RDBMS and bugs in the application layer can never violate the constraint.

There is also a somewhat more complicated schema if you want to do double entry bookkeeping. Double entry bookkeeping is useful if you want to enforce that every transaction must have a corresponding opposite transaction, or in other words: “you can move money around within the system, but not create or destroy it”. If this rule is important in your system, then a double entry system might be desirable.

#1 would be only an option if one knows an upper limit of transactions for a specific balance, and that limit is not too huge. Otherwise, with each new transaction any read access will become slower, and slower, and slower until the system becomes unusable.

So I would recommend #2, in combination with DB transactions. Of course, you have to make sure “no other code” will mess around with the data.

If there is “other code” which can change some values written by someone who does not understand the system correctly, such hypothetical “other code” could always change the balance in a wrong way, regardless of whether there is a redundant representation of the data, or not.

The redundancy then is not necessarily a problem here, quite the opposite: it may be used to allow an additional validation of the data in regular intervals, which may help to detect errors.

Performance vs correctness

This is an inevitable tradeoff.

Storing the calculated value obviously saves you from having to calculate it time and time again, but having this second source of information (transactions + calculated value) allows for the two to contradict one another.

Options 1 and 2 are both valid options, but they have different priorities. Option 1 sacrifices performance/database bandwidth for the sake of guaranteed correctness. Option 2 favors performance/database bandwidth but opens the door to inconsistences.

It’s very contextual what the consequence of bad values is. In some applications, it can make the difference between literal life or death of a person, or significant financial losses due to decisions made using faulty data. In other cases, it’s a minor annoyance for the user to have to recalculate the total value of their monthly grocery expenses. Only invest effort in ensuring correctness to a degree that actually matters to the end user. If this is banking software, that need is high. If this is a small personal budget application, that need is low.

Option 3 is sort of a duplicate of option 2. Presumably, your application is always going to want to store the transaction history regardless of whether you store a calculated “current” value or not. If that is the case, then options 2 and 3 are pretty much the same; in either case you store both the history and the current value.
If that is not the case, then storing only the calculated value is risky. You have no historical log to reconstruct the value if something goes wrong.

Eventual consistency

The problem domain you find yourself in often resorts to using eventual consistency as a data model. That is to say that a new transaction gets recorded, and at some point, eventually, the system will update the calculated value to account for that new transaction.

It’s not going to be immediate, but it will eventually happen.

For example, you could have a scheduled job that recalculates the current value every X minutes. Users can enter new transactions, but the calculated sum values will only be updated when that job fires.

This is just one of many ways to do this. Your consistency could be scheduled, or manually triggered, or using automatic triggers. That’s too broad to contain in a single answer.

Inconsistency is always possible

This would ideally require an atomic db transaction, but even then there is no guarantee that some other code did not change the data.

From the moment you have more than one actor and they need to communicate, you lose the ability to 100% guarantee message delivery. This situation is commonly known as the two generals problem.

In short, two armies that must coordinate to attack together cannot do so when their communication (a messenger on horseback) is never guaranteed to reach their destination. Whether it’s a messenger on horseback dying while on the road, or a packet loss in a network, if the message doesn’t arrive, then you have a big problem on your hands.

If one general proposes a date to attack, how does he know that the second general received the message? If the second general returns an acknowledgement message, how does the second general know that the first general received that acknowledgement?

This ends in an infinite chain of acknowledging the previous message. By not sending a new acknowledgement, you effectively make the other party guess that you must’ve received the message.

Your runtime (e.g. REST api) and database are these two generals. They can never 100% guarantee that a message was received.

Maybe the database never stored the data
Maybe the database stored the data but failed to confirm that to the REST api

From the perspective of the REST api, it’s impossible to distinguish between the two.

The moral of the story here is that you can minimize, but you can never truly exclude the possibility of two sources going out of sync. However, you can observe the odds of something going wrong, how frequently it would happen, the impact of it happening, and judge whether it’s worth spending the effort to minimize the occurence of inconsistencies.

LEAVE A COMMENT Hủy