My service has a large ongoing number of user events, and we would like to do things like “count occurrence of event type T since date D.”
We are trying to make two basic decisions:
-
What to store? Storing every event vs. only storing aggregates
- (Event log style) log every event and count them later, vs.
- (Time-series style) store a single aggregated “count of event E for date D” for every day
-
Where to store the data
- In a relational database (particularly MySQL)
- In a non-relational (NoSQL) database
- In flat log files (collected centrally over the network via
syslog-ng
)
What is standard practice / where can I read more about comparing the different types of systems?
Additional details:
- The total event stream is large, potentially hundreds of thousands of entries per day
- But our current need is only to count certain types of events within it
- We don’t necessarily need real-time access to the raw data or aggregation results
IMHO, “log all events to files, crawl them at a later time to filter and aggregate the stream” is a pretty standard UNIX Way, but my Rails-y compatriots seem to think that nothing is real unless it’s in MySQL.
5
It always depends, I’ll give you my advice to offer you a new perspective
What to store? Storing every event vs. only storing aggregates
(Event log style) log every event and count them later, vs.
If you plan to don’t miss any detail, even though now they are not relevant, on my eyes that’s the best approach, because sometimes, as the results comes, then you find some other events that for X or Y they were not relevant, or they didn’t bring any extra information, but after some analysis, it simply does, and you need to also track that one, then because its recorded but not accounted it would take you some time before you can add it to the picture.
(Time-series style) store a single aggregated “count of event E for date D” for every day
If you want to implement and use it tomorrow, it can work, but then if you have a new requirements, or you find a correlation with another event that you omitted for any reason, then you need to add this new event and then wait some long time to have nice aggregation levels
Where to store the data
In a relational database (particularly MySQL)
The first option can be heavy for a DB if you go for recording all events, so MySQL I’m afraid can become too small, and if you want to go for RDBMS solutions you may think bigger, like PostgreSQL or proprietary like Oracle or DB2.
But for the aggregation would be a good choice, depending of the load generated you can aggregate in code and insert those aggregations in the DB.
In a non-relational (NoSQL) database
If you go for this solution, you need to see which approach you want to follow nice read on wikipedia may help you, I can’t help you much on that topic because I simply don’t have enough experience, I mostly use rdbms.
In flat log files (collected centrally over the network via syslog-ng)
I personally would discourage you to go for that option, If the file grows too much, it would be more difficult to parse, but still I don’t know the main purpose, is to follow up on a system, or simply check a log file …
Hope it helps!
1
I think that your idea to parse logs, count and store results in a DB is valid. Not sure you’d want all those raw logs in the DB anyway (I think that’s what you said your compatriots are suggesting). You’ve already got the logs in files, correct? You could just archive those. I suppose that bit really depends on your use case(s).
Also agree with @Thorbjørn Ravn Andersen about moving your “comment answer” to the question.
Depends on your intended usage. If you have a standard graph or report showing aggregate values, then you’ll want to simply filter the events as they come in and aggregate them into the appropriate bucket. If you need to drill down into specific events, or if you think you might want to go back and re-analyze / re-categorize events later, then you should store the individual events.
If you’ve got the time and space, what I typically like to do is aggregate the data, but store the details in a (compressed) file. The details don’t have to be easily accessible, since I almost never need them, but they’re available for bulk re-processing if the classification criteria change.
3
Any architecture decisión should be driven by business needs. In your case, you should have a more clear idea of what information do you want to obtain from your log system and in order to decide how to store, how often you will require this info and how much time you can wait to get the result. This is what drives the design of log collectors, event correlators and similar applications.
Rather than giving you my opinion, I suggest you look at some applications similar to what you try to develop. Some of them may be way more powerful that what you pretend to develop but it won’t hurt if you look at the architecture and storage policies followed. On the professional side, you have SIEM applications like RSA and Arcsight and in the Open Source side you have initiatives like Kiwi or OSSIM (that has also a professional appliance based version).
Another thing to consider is that when you start using the results obtained by the tool, you will start receiving very likely many requests from your management for more information and more detailed one. So… use it carefully and plan with your view in the horizon. It may give you more work, but definitely you may get a lot of support and visibility (pressure comes in the package)….