I’m working on a project that generates massive amount of data and store that data in a SQL
database. I’m talking about like 1 thousand records per second. Next I’ve to push that data to a database. If I do the data generation and database push sequentially, it’ll slow down the data generation which is not acceptable. I considered keeping a buffer and two parallel process- one will write data in the buffer, and other will read data from buffer and push to database. This solution might work but the problem is due to write latency in database, the buffer grows too large and eventually causes memory limit exception. Writing to a file and then updating the database from file doesn’t seem to be a very good solution either. From software engineering perspective, what other solution I can consider.
Edit for Comments:
- Writing to file doesn’t seem to be a good solutions as the file size will also go too big. I feel that’s not how a tech giant will solve this kind of problems.
- The data is generated from some calculation. The calculation is time critical and we expect one calculation per second. So 1K records per second.
9
Consider not inserting in the final table, but in an intermediate table with no indexes to maintain or constraints to validate. That way insertions would be a lot faster.
In the other hand that table would be just a buffer, you still have to move those records to the final table and you will find the bottleneck again.
It seems to me you will have to use a buffer that is written to database when a certain threshold is reached. That buffer can be memory or a file. But at the end the producer would have to slow down. Unless you spend good money in a very high performance setup with heaps of RAM, solid state disks, fiber channel communication and the like. That will set you down serious cash.
Depending upon how tech-gianty you would like to go. It sounds like a case for an ESB.
Apache Kafka seems to be good fit for you.(LinkedIn, the creator, boasts 2million writes/sec on an instance.)
If you want to implement this yourself, you should probably look into caching on a memory (memcache/redis?) and pushing it out in bulk(maybe a lua script on the embedded redis lua interpreter?) like others pointed out.
To compensate for the additional buffering, you’ll have to trade with raw performance.Again you’ll have to consider peaks,max load, etc.
I really wanted to comment,but my reps are too low for that.
Hope it helps u in any way. 🙂
1
Use a buffer not (only) as a method of flow control, but as a tool for making the storage operation go faster.
Storage – and databases in particular – are considerably more efficient when they deal with large quantities of data at once. That is, saving thousand rows will take longer than saving 1 row, but not a thousand times longer – more like a hundred times or 10 times or even less(depending on the DB, the data, the hardware etc.). This means that saving these thousand rows as a bunch will be dramatically faster than saving them one at a time.
A buffer can help you do this. You let the buffer grow enough – say, to a thousand records – and then write these thousand records at once using the database’s facilities for inserting many rows at once(with SQL databases it’s usually working in a transaction and using a prepared statement). You should also trigger a periodic push to the database even when there aren’t enough records, so that the database remain sufficiently up-to-date when the flow of data happens to be low.
If this is not fast enough, a buffer can also help you parallelize the database insertion operation. If speed is that critical and you reach a limit that is unacceptably slow, you may want to invest money in hardware. Getting faster hardware is not always possible, but you can always buy more machines and put more cores and network interfaces on your machines.
This means that if you want to invest in hardware to make pushing to database go faster, you can run multiple processes, spread across multiple cores and network interfaces, that read from the buffer and write to the database in parallel. Synchronization overhead should be negligible if you are writing(and reading from the buffer) in large bulks.
If writing to a message queue server on the intranet is considerably faster than writing to the database – specifically, fast enough for your needs – you can have multiple servers reading from the same message queue server and writing to the database.
If the database itself is too slow, you can check if it supports scaling to multiple servers. Then each process that reads from the buffer can write to a different database server, acting as a load balancer.
Use a producer consumer pattern so it is not processed sequentially. If the queue gets to long then you have to slow down the producer.
Work on the insert speed. For small rows you can get up to 1000 / second.