I need to perform quick searches against a combination of tags while including date ranges:
- who have requested notifications
- who did not respond to a notification sent at least 3 days ago
- and who have not been sent any other notification in the past 3 days
The event data structure is pretty simple:
Normalized Database Structure Performance Concerns
- With billions of events, doing a table scan will not work
- The only column that would index well is Date and it will not always be included in every filter
- EventName will not be distributed well for an index (some event names could include 1/4 of the records)
- Doing a simple WHERE query against this normalized table would most likely require a full table scan which won’t be fast enough
LIKE or Full-Text Search
Another approach is to convert these tags into a single text column one per entity.
- EntryType_EntryName_Date_Time1, EntryType_EntryName_Date_Time2, EntryType_EntryName_Date_Time3
Then, I can do a SQL full-text search.
This would reduce the number of rows by at least 10 times, but I cannot figure out how to search by date range:
- CONTAINS (RequestedNotifications*)
- NOT CONTAINS (OpenNotification_ID4*) (Before 3 days ago ???)
- NOT CONTAINS (SentNotification*) (Since 3 days ago ???)
At best, I could reduce the table and then scan a smaller partition, but I don’t think it will help much.
I have thought about creating a dedicated VM with an in-memory data structure of the entire complex data set.
Basically, I would create a dictionary for each tag type with buckets for date-time ranges and keep everything in hashtables for rapid intersections:
// Some structures like these Dictionary<EventType, Dictionary<EventName, HashSet<int>>> nameIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, HashSet<int>>>> dayIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, Dictionary<Hour, HashSet<int>>>>> hourIndex; // To search like this (kind of) var entityIds = filters.Select(f => hourIndex[f.tagType][f.tagName][f.day][f.hour]) .IntersectMultiple() .ToList(); // Note: In order to perform an operation like before 3 days var filterResults = nameIndex[eventType][eventName].except(dayIndex[...][...][today].union(...[today-1]).union(...[today-2]));
On a dedicated server, it could handle around a billion events in-memory.
- Each Event would require ~12 bytes (~4 bytes for each maintained index)
- 1 Billion Events ~ 12 GB memory ~ $250/month A5 VM
This could be scaled by partioning the entities and persistence wouldn’t be too difficult.
But before going that very custom route, I would like to know if there is a simpler way to do this.
- How can I structure an index so that fast searches can be performed against multiple text and date range filters as per my example?
- What Azure solution would be the best way to solve this problem?
One of the methods I have used in the past to tremendously increase query performance is to pre-aggregate the data along one or more attributes (i.e., event type by day, event type by event name, etc.). This would be the equivalent of running a summation (for example) over the last month’s data and storing the results in a separate table.
This will allow you to archive the transactions (each event record) into a backup table for when you need to see the actual data, but still allow for a quick-ish lookup for current data.
This technique appears to be fairly common for reporting.
You will probably want to increase period for the attribute going further into history to minimize the amount of data that you must query against for longer ranges of data. (You might only need the aggregation by monthly data beyond the last few weeks.)