Data filtering & requests: fetch all entries or split data?

  softwareengineering

I am in a situation I cannot decide which approach is the most optimal (performance wise) while being maintainable at the same time (in the sense of having a clear logic).

The question is laid in the context of a Django web app, but I figure it applies to any related scenario.

In my scenario we are visiting a particular route that displays many matches in a league or tournament associated to a particular season:

URL: season/<season_id>/

Associated Django ORM query: season.match_set.all()

A season has many divisions, and of course, matches are made up of teams. The client can filter by division and/or by team. These filters can also be included in the URL (so users can share it, already filtered), i.e. season/<season_id>/#division=<division_name>, so matches belonging to the specified division are filtered.

However, even when visiting a route including a filter, the entire query is executed: season.match_set.all().

And here is what I cannot decide about. In terms of efficiency, it would be way better to just fetch the matches related to that division:

season.match_set.filter(division=division)

However, it might be pretty common that users use the filters in the page, switch between them, etc. Which, if we use the second approach, would obviously mean additional requests which would also mean extra database hits to retrieve the filtered matches. This would not happen with the first approach since we have all the data set since the beginning: just one request and one database hit (although heavier).

We could try to optimize the second approach by storing filtered data as requested; i.e. if we have a season with three divisions and the user filters by Division 1 (request 1), we store that somewhere (in the client side I figure), then if he/she filters by Division 2 (request 2) we do the same and add it to the existing data, and finally if the user filters by Division 1 again we just get it from the stored data and we can spare ourselves from performing request 3.

However, I have concerns about having a clear logic and code as I mentioned before, because this last optimization approach can easily get really funky and unreliable.

My question: what is the to-go approach? This is a fairly common scenario so I figure there must be a consensus on what is the most efficient approach: fetching all database entries and performing just one request or performing multiple requests and database queries and get data as it is being requested?

No, there isn’t a consenus. You have to use your own judgement.

The reason there isn’t a consensus is that different solutions are appropriate in different situations. What you’re confronting is a question of trade-offs – multiple orthogonal requirements that are all important, but cannot all be satisfied simultaneously. Here are some of the trade-offs involved:

  • memory usage vs. speed: Caching the results of one query for reuse by subsequent queries increases the peed of subsequent queries (not the first one) at the price of using more memory. How to resolve this trade-off depends on how much you value fast responses vs. the cost of buying more RAM.

  • speed of the first query vs. speed of queries in general: fetching all results even for a filtered query takes longer than fetching a subset, but it offers the potential of speeding up subsequent queries. How to resolve this trade-off depends on how many first queries vs. subsequent queries you expect, or on how important fast replies vs consistently-timed replies are to your users (ask a UX expert: paradoxically, people may prefer a consistently not-very-fast response to a sometimes-fast-sometimes-slow one).

  • convenience of use vs. programming complexity: computing the result of one query from the stored results of a previous query can speed up your response by eliminating expensive I/O, but it requires mpore complicated programming with more potential for missed deadlines and defects. Is the complexity worth its price? This is a question only project management can answer. If the project has a hard requirement “must answer withtin 100ms 99% of the time” then the additional logic might be indispensible. If the hardest requirement is “must be live tomorrow” it might not be.

You can see that these are all questions you can answer better than we can. We can only suggest questions to ask yourself.

LEAVE A COMMENT