I am writing a program to crawl the web and find frequently used words. What would be a good way to store the data? I was thinning of using an RDBMS with a single table (with columns ‘word’ and ‘count’) but for something so simple it seems like overkill.
I need to achieve some functionality like sort by count/find n letter words etc.
Is there a better way to do it? Or is using RDBMS the way to go?
I have used sqlite for many such small programs. It’s a full RDBMS, has a good memory footprint, doesn’t need any long running processes, and has the most permissive license around (Public Domain).
But really, if all your doing is storing a mapping of word to count, then almost anything will do as long as your word list will comfortably fit in memory. A hash table would be the best data structure.
Once you get a large sample of your data and you start to see patterns in it, you can start optimizing the data structure depending on your usage.
Two other options that come to mind are Berkeley DB (a high-performance file-based key-value storage) And RRDTool, which is highly optimized for counters and graphs.