lightweight document indexing to handle less than 250k potential records

Recently I’ve found myself chafing at the limitations of document indexing engines. I was developing a small website that needed some fairly robust searching capabilities but due to their hardware constraints I couldn’t deploy a Lucene-ish solution (such as Solr or ElasticSearch, like I normally would) to handle this need.

And even then, while I needed to serve up some complex data and calculations that were database-intensive, I didn’t need to handle more than 250k potential records. Deploying an entire Solr or ES instance just to handle this seemed like a waste.

After I thought about it, it seems like a fairly large problem. Most people handle search requirements solely with SQL. They just run SQL queries for their data and that’s that. Their search capabilities also end up being terrible.

Doing a blanket full-text wildcard search can be painfully slow on some systems (shared hosts in particular) and bog down your database, especially if you have complicated queries and lots of joins.
You end up doing multiple queries on a single request from the user. You might get around this with ever-more-complicated queries, but see the previous point.
Lack of features typically present in full-text engines.

Databases had the same problem of needing to be deployed as a server and then SQLite came along and suddenly we could deploy a database that is self-contained in a single file. My Googling has produced nothing – wonder if something exist like this for full-text indexing/searching.

What factors to take into account when deciding whether to implement lightweight document indexing (eg as explained in answers to another question) or keep using SQL for these situations?

You know, I’ve got to say consider using redis.

Use the idea of context. It would be hard to go in depth without knowing more about the documents. Often you can discern many things from the headings of documents. Profiling each document is the basic first step, just like web crawling.
Do a count on each document of words in a dictionary of keywords. Keep track of each word’s popularity count for total project. Add more weight to the iterator for this count if you happen to be able to detect high relevance in a document or set.

The first thing this does is give you an all-inclusive list of words in your whole set. Anything NOT found in that list, automatic return of ‘no results’. I’d suggest results ranking of lower than the bottom 5-20% of popularity (when running search query on index) also simply say no results’.
If you do go with something like redis, or even just make your own memory structure you can pair documents with descriptor files or mini-db file and page objects that describe each specific document back and forth to memory. Keep the common searches in memory by maybe having them compete for slots or giving them a time to live that grows on each search.
To go further, start saving reference data that groups a link/ref/pointer/index/whatever of two or more documents and a pool of keywords or phrases. Basically you get a pumped up tag cloud.
Further still, make phrase detection by tracking when a word in your dictionary is followed or preceeded by an exact string commonly in documents of similar metadata/title. This is intensive but requires only one pass to render the data.
The more ways you can segregate your data and keep the groups related to each other in actual usage, the better.
Connect the likelihood of correctness by tracking every time a user clicks a result that is not the top three. Gain improving phrase detection by watching user searches that didn’t deliver perfect results. Force your queries to become relative to the clients’ searches.
Do you have to watch for document updates? Chronjobs/shell script or scheduled tasks/batch script can help. There are various options for scheduling and scripting though obviously.
Waste disk, gain speed, lose complexity. Save multiple trees of your documents and/or trees of links to the documents. Only search the trees for whom criteria have been met, or at least prefer them to get result quicker in most cases.
Make your own lightweight permutation engine or find one that uses quick character detection and no regex. Or just make one using regex in a few hours but the performance difference will be noticeable here for sufficient searches.
So many things.

These are meant as possible solutions to implementing robust document indexing and searching. It isn’t all inclusive. And at that you’ld probably do better to grab a spare box, throw a neural net on it and spend a couple days making a nice web interface to that neural net.

Filed under: softwareengineering - @ 01:17

Thẻ: indexing, web-development

Thiết kế website giá rẻ

Danh mục

lightweight document indexing to handle less than 250k potential records