Currently, I’m working on a graduation project about designing a Document Management System (DMS) that incorporates some functionalities as follows:
- It won’t use folders to organize documents; instead, they will be stored in a flat structure.
- Document organization heavily relies on assigning suitable metadata.
- Searching will involve full-text search (FTS) on the document content and an advanced search based on metadata.
- Metadata is in a key:value format, and there will be some kind of relation between values based on their semantic meaning (for example:
iron man –belongs_to → marvel).
- Document recommendations will be based on the relations among metadata.
Considering these functionalities, Elasticsearch would be the first choice for FTS. Still, to model relationships among metadata, we need to use Neo4j. However, since we are currently using PostgreSQL, I would like to ask if it is best practice to use Elasticsearch, Neo4j, and PostgreSQL in the same system like this.
Note: I’m open to any other suggestions on other approaches to this problem, especially the “relation among metadata” part. Is this approach suitable for the problem, considering that my professor’s instructions about this part are somewhat vague.
I’ve done some research about this, and I’ve only seen people using Elasticsearch together with Neo4j for a knowledge graph. But considering the DMS also has other functionalities that can be added later, such as permissions, workflow, and version tracking, I think these would utilize joins from a relational database such as PostgreSQL.
best practice to use ES, Neo4j, and PostgreSQL in the same system like this?
Only incur the cost of adopting yet another technology
if you’re sure the benefits are worth it.
Here, there seems little justification for
straying from your existing postgres setup.
Searching will involve full-text search (FTS) on the document content
Sounds like a good match to postgresql’s mature support
for indexing text corpora and flexibly
relation between values based on their semantic meaning
Seeing this in a requirements document could perhaps
justify the use of neo4j in a project, to chase edges
through arbitrary and perhaps cyclic graphs.
But your example suggests a (potentially deep) tree structure,
something that a WITH RECURSIVE CTE hierarchical query
can easily tackle.
I have used all three of these technologies in production,
and frankly found ElasticSearch a bit of a nightmare
between its verbosely nested queries and an uptime record
that didn’t measure up to what the docs suggested.
Neo4j is a fine tool if you need it.
Plan on having all nodes memory-resident,
and you’ll still obtain much lower “result rows per second”
throughput than you would from any mature RDBMS.
Time box it. Try to prove postgres can or can’t do
something that appears as a bullet point in your
requirements document. Author an automated test
for it, and write down the observed performance numbers.
Your project needs to be complete prior to graduation date.
Do the simplest thing that could possibly work.
Justified complexity is usually not the target of seemingly anti-complexity guidelines. The overall goal of these guidelines is to cut out unnecessary complexity, which can be countered by properly justifying its existence.
You haven’t really spent much time justifying the need for this complexity. The general question to ask yourself here is:
Are the gains worth the cost?
In other words:
- How much more difficult is it going to be to build a system that relies on all three techs (as opposed to two, or one)?
- What do you stand to gain from doing so?
- How much do you need those gains?
You can really only answer your main question when you find answers to these questions and really consider your priorities here. Some contexts justify any squeezed drop of performance, whereas other contexts are better off with a more easily maintained codebase at the cost of minor performance improvements. I can’t answer that for you.
The alternate solution, i.e. a suggestion that uses less complexity and achieves everything you want anyway, is not something I can offer. I haven’t worked with these technologies so I’m not the right person to answer that.
There is a trade-off between:
- Ease of use.
- Ease of development/maintenance.
- Richness of the feature set.
There is a real risk that you will leak the details of your implementation into your user interface. Pick one of three types of queries and based on your choice, you get different functionality.
You can certainly put additional development effort into hiding this. However, you will quickly run into the N+1 problem: run a query against one data source which generates a set of results. Then you need to run N queries against a different data source and aggregate all the results.
Picking a single data source will reduce the feature set you can provide, but it will dramatically simplify the development effort and you may find that with a bit of clever planning you can satisfy most of the requirements that three datasources would have provided, with only a single source.
When you “justify” your three datasource decision, it is insufficient to simply say that you needed data source X for functionality Y, if one of the other datasources could have delivered most of functionality Y… Did you really need the missing functionality? Is that enough to justify the added complexity?
As an example, I have meet most of your requirements using only Solr (the predecessor of Elasticsearch) although some additional processing was needed when the metadata was fed to the indexing process. Solr can’t deliver the same free form queries as Neo4j, but I knew the queries in advance.
Good engineering is not about always using the ‘perfect’ tool for the job. It’s about recognizing when a tool is good enough to meet your requirements, and then not overthinking things.
Sure you could use Elasticsearch for search indexing, but PostgreSQL has its own FTS engine, and I strongly suspect that that will be good enough for what you’re actually doing here. Yes, it might not scale as well as Elasticsearch would, but you likely don’t need absolute maximal scalability here (and even if you do, it may still be good enough if you can throw enough resources at it). So no, unless you could prove to me that you need the scalability (or some other feature that Elasticsearch offers), I would say it’s not justified to use it here.
As far as metadata handling, you could use Neo4J for that, but a full graph database sounds like it’s beyond overkill here. What your stated requirements seem to describe is RDF, so that’s probably what you should be looking into using. The Redland RDF libraries are probably the first thing I would look into; they do add an extra set of dependencies, but they don’t need a new backend component because they can store their data directly in PostgreSQL. It may still make sense to use Neo4J if you have significant experience working with that though, because saving time on development is generally a good thing if you can do so without compromising quality (and while it’s almost certainly overkill, it should still work here), but this is something I would seriously research first (if you can do everything with RDF, and the performance is good enough, do you really want to deal with yet another backend component?).