We’re developing an application whose data domain (or at least 90% of it) can be modeled effectively using a relational database. We’ve been using PostgreSQL since the beginning and have had no problems whatsoever. However, now the need arises to store relations (friendships) between users, much like Facebook or Snapchat, and we begin to wonder which of the following two paths is preferable:
- Begin by storing friendships in a traditional relationship table in PostgreSQL and be done with it until scalability problems arise (namely the growth on the number of friendships and the infamous “friend of friend”-type queries).
- Start upfront with a graph database (TitanDB + Cassandra) just to be ready for when the need to scale arises, but face a slower startup on development (which includes learning about TitanDB and Cassandra).
Our target is ~75M users. We don’t really have an idea on what queries we will need to perform on this “graph”—for now, our only need is to store this information. Could PostgreSQL effectively scale to such numbers? Is it preferable to follow the graph approach upfront?
Your project’s success is going to depend much more on the features you put in front of the users you manage to attract. For now, I would suggest that you prioritize that. After all, if you don’t reach 75M users you won’t have a scalability problem anyway, so the effort would be wasted.
To phrase this a different way, scalability issues follow from great levels of adoption. Your first problem is the adoption. Work on that first. If you don’t work on things that will recruit users, your project will fail and the scalability issue will be moot.
Most web applications use a combination of technologies to scale. You can have both a relational database for storing user data and take advantage of the aggregation and fast intersect functions + a graph db for the goodies of graphs.
In the graph database you would store not only friendships but you can also store likes, streams, shares so you can see not only friend-of-friends queries but also what friends liked.
Be sure to check OrientDB and Neo4j.
PostgreSQL scales wonderfully under the right conditions:
- Digital Globe really does handle 1 billion transactions at least hourly on their database with PGGIS to boot
- Verizon uses PostgreSQL as a data store
- It has been shown that PostgreSQL’s jsonb format is faster than MonogDB, plus you get a relational format
The issue is with dynamic relationships. You really need to consider the relationships you store as that is what a graph is, nodes with edges. If you have a lookup table for everything and are approaching 6NF instead of storing large blocks of records, use Neo4J. We use PostgreSQL and Neo4j and the backups really aren’t a problem. We are figuring out a way to export Neo4J into PostgreSQL which would make backups super easy as you can easily migrate from PostgreSQL to Neo4j. User data, user and application meta data, large record blocks with only a few lookups, and other similar data with predefined easy to use relationships are stored in PostgreSQL. Neo4J is used for graphical data, data with a ton of relationships, often defined dynamically.