This is a membership-based site. Each account is capable of:
Making posts (unlimited, no restriction on frequency)
Liking, sharing and commenting (on) the other posts
The goal is to provide the smallest possible resolution for account stats. Instead of bundling
likes into groups of 15-minute intervals (or something similar), we are storing every like as it happens.
Right now, it is a single table tracking likes, for all accounts. Something like…
Table: users (Accounts) user_id (pk) ... remaining data (name, login, etc) ---- Table: posts post_id (pk) user_id (fk) ... remaining data (time, content, etc) ---- Table: post_likes id (ai, pk) post_id (fk) user_id (fk) action timestamp
user_id is that of the user liking the post, and
action is either +1 or -1, for like or dislike, respectively.
Note: A dislike is only undoing a like (likes for a post cannot sum to less than zero). There are valid reasons for tracking dislikes, opposed to just deleting the original like entry
The DB is currently MySQL, and cannot be changed at this moment. All tables are
InnoDB to facilitate row locking.
Playing a little game of Best Case Scenario, let’s assume we’ve got ourselves 10M users.
-> ‘users’ table has 10M entries
And say, after a year, the average user has made 250 posts
-> ‘posts’ table has 2.5B entries (250 * 10M)
And say that each post has an average of 15 like-operations (likes and dislikes).
-> ‘post_likes’ table has 37.5B entries (250 * 10M * 15)
When a user views their metrics, the query would be something like:
Current likes count: SELECT SUM(action) as cur_likes FROM post_likes WHERE post_id=? Comprehensive likes history: SELECT (action, timestamp) FROM post_likes WHERE post_id=? ORDER BY timestamp ACS
Billions of entries in a single table seems like a lot to me. Now, I’m no database guru, but a couple things jump out at me as issues:
What happens when the table exceeds the allotted memory?
According to docs, for InnoDB tables (after extending the
The maximum tablespace size is 64TB.
But is that PER table? For all tables? I’ve been looking into InnoDB General Tablespaces, but I’m still unclear on the matter. I understand that such high numbers are a pipe-dream at this point, but I’d rather be safe than sorry.
Surely the performance of operations on the table suffer with such high numbers of entries?
Assuming the previous point is moot (it is unlikely that we will ever reach or exceed 64TB of data), there is a lot of inconsistency between posts regarding table-size vs performance. Some will tell you that there is no bound to table size, and others say just a few million is already detrimental to performance.
My specialty is not DB administration, and I cannot comfortably say either way on this matter.
Originally, I had a unique table for each post:
Table: post_(post_id)_likes user_id action timestamp
… and was quickly told, “not do do that” – nothing else, so I cannot be sure what the solution would have been. Similar situation for a table tracking all posts likes for a single user/account:
Table: posts_(user_id)_likes user_id post_id action timestamp
I will always know which account or post to retrieve metrics for, so it made sense to me to create a table specifically for that post or account. Then I know exactly where to look, and the number of records to look through would pale in comparison to the billions described above.
There is a similar post here, but the solutions mimic the gigantic tables I’m worried about.
The stuff I’m doing right now does not have to be perfect, and can always be changed at a later date, but I’d like to get it set-up in a way that makes sense, while maintaining performance and keeping it easy enough to alter/fix further down the road. Even if every like was stored within a single table, we’d likely have plenty of time to the accommodate growth. But if just a few million is enough to slow things down, then I need to look at preventative solutions before digging myself into a hole.
Thanks for making it this far!
Well, first of all, you’re making up fictional scenarios using hypothetical numbers that are wildly optimistic. User activity follows an exponential curve; the vast majority of your user activity will come from a relatively small group of core users. The rest of your users will contribute a sizable amount of traffic, but each individual contribution will be quite small. To put your hypothetical numbers in perspective, half of all registered users on Stack Overflow have a reputation of 1.
Second, you aren’t gonna need it. Getting your website to market is far more important than making sure it scales to the entire world. In the unlikely event that your web site reaches Facebook or Twitter standing, you’ll have the money you need to fix what will unquestionably be a good problem.
Third, you cannot predict your performance problems in advance. Software developers are notoriously bad at this kind of prediction. The only sane way to deal with performance issues is to use sensible software design from the start, and deal with performance issues as they occur, using tools that tell you exactly where the performance problem lies.
The design you’re contemplating using MySQL is exactly the kind of design you should be contemplating. It will handle 95% of all traffic scenarios, and if you ever get to the point where you need to build a 5% website to replace the old one, you’ll have the money to do it.
To the already excellent answer of Robert Harvey, I’d like to add a couple of remarks:
- It is a very bad database design to create a table for every post, or even a table for every user’s post ! In a relational database, the structure should in principle be as determined as possible, and not be content dependent.
- You should not worry about the size. DBMS are designed to cope with volume. They use special data structures and indexes for that. And they have optimizers to analyze the best way to execute a query based on the table structures and the exsiting indexes, and the relative size of the tables.
- For instance InnoDB use b+trees for indexes. These are similar to binary trees but designed taking into account a paged access in a database file. To find any record among 37,5 billion requires maximum 11 disk accesses (if there are 10 index entries per file page, 6 accesses if there are 100 entries per page).
- You need to look for an optimal structure and avoid redundant data. So it could makes sense to have a large table for the posts , and another for other events related to the posts (referring to the post’s id). To be analyzed, but you could further distinguish the comments, from the likes and the shares, as these all enrich the event data (user, time, related post) with different data (comment, degree of liking, etc.)
- a database is stored in several tablespaces, each of which is composed of one or several files. According to the documetnation that you’ve referenced, the size of a table is limited by the maximum filesize of the operating system (It’s smaler than the 64TB that you mention). If I understand correctly, the InnoDB has to be in a single tablespace, which puts an upper limit to all the tables in your scheme. Until you reach these limits, you’ll have some time to think about a partitionning scheme (to distribute data between two or more databases), and perhaps you can then afford more powerful RDBMS (some have the limits at 64K tablespaces x 128 TB per tablespace).