I have a very big file delimited by some sequence of characters ‘*L*I*N*E’. The file will be of the order of 250G. And each line comes around 600bytes to 1000 bytes. I will be performing the following operations on the file,
Read the file line by line and for each line, I will be giving it to a parser which will do some calculation on the line and update some stats. The parser will take roughly 15 micros per line.
As of now I am using the BufferedReader
to read the lines and pass it over to the parser in a single thread. My question is if I have a separate reader thread which only reads the file and dump everything in a queue (in memory) and have my parser act on the in-memory queue(in a separate consumer thread), can I achieve better throughput?
Nothing changes except that my parser acts on the inmemory data and another thread takes only IO operation(read the file and dump in a queue).
It is a complex piece of software in which the above is one part, so I am trying to speed up the above part. Hence I can’t able to post the actual code.
2
250G split by 600 bytes to 1000 bytes gives you about 250.000.000 lines. Spending 15 microseconds at parsing these will take 3750 seconds ie about an hour (one hour is 3600 seconds).
About an hour would be the time spent on parsing in between execution of reader code in same thread.
To estimate throughput gain I’d next benchmark single threaded version. To save time I’d likely benchmark with smaller size file, like 2.5G (timing would have to be multiplied by 100 in this case) or even 250Mb (timing multiplied by 1000).
With benchmarked single threaded timing, I’d get the (optimistic) estimate of throughput gain:
- single threaded timing 24h would give me >=23h in multithreaded execution
- …4h would give me >=3h
- …2h would give me >=1h
- …1.5h or less would give me >=1h (yeah at less than 2h, parsing would be bottleneck not file reading)
Actually, having benchmark less than 2h I’d also check if parsing could run concurrently because in this case, having more than one thread to parse could increase gain.
Assume 1MB/s IO, can I achive performance throughput with multithreaded mode?
Assuming above, reading 250G file would take about 250.000 seconds. 250 gigabytes is 250 thousands times by 1 megabyte.
250.000 seconds is about three days. Parsing in a separate thread would save you one hour or less of 3 days (~2% I think). It’s up to you to decide whether it is worth it. I personally would rather think about something like GFS plus MapReduce to handle stuff like that.
1