I have a project which uses a rolling window to count events from incoming XML files and rules based on the values of the window. For example if 10 events of X happen in an hour – trigger an alert.
Now further requirements have surfaced where we need to pinpoint the data that caused the alarm to trigger, ie. What 10 XML docs/records caused the alert.
What would be the most efficient way to:
- Store the XML documents into a DB (Oracle XML was too slow). They
need to be in a DB so you can run general queries on the DB. - Reference XML documents against specific windows (ie. XML doc 1 updated window
Ive attempted to store the XML dump into a table with a primary key, and another table that has a many-many relation between the XML and the sliding windows it changes.
I’m thinking maybe a HADOOP solution would be more scalable and faster? What are other options? In terms of requirements I’m looking for:
- simplicity in design
- accessibility such that users can query the XML docs without needing an alarm. Users may wish to find all XML files that have event X in them
- scalability
Extract the primary data you need for searching/querying into its own table.
For example, if users want to search by type of event and timestamp, put that information into a simple table which will be fast to query. Every time a new XML file comes in, process it and add rows to this table.
Once you have a simple table of events and timestamps, it will be trivial to calculate your rolling window.
You can also dump the raw XML files into the database in case more detailed information is needed. Your simple table can include a reference to the original XML file. But if you can narrow down the search first with a simple ordinary database table, it will be much more efficient.
Raw XML data is always going to be slow to query. Queries of large numbers of XML files are something that should be avoided if possible.
If you do need to be able to perform arbitrary queries of complex XML data, and the structure and content of the queries can’t be predicted in advance, something like Hadoop is a good solution. This allows you to simply throw a lot of raw processing power at the XML processing problem and overcome its inherent inefficiency that way. But from your question it sounds like you really want to query some relatively simple data; if that’s true, I would go with a more traditional database solution.
2