Let’s say you are running a query on your DB.

In dev & stage, where the amount of data is small, everything looks good.
Unit & integration tests are also fine and the logic is correct.

But when you get to production, where the DB is much bigger, all of a sudden it turns out the query is not efficient and is running very slowly.

So you might add an index, or change the order of the params in the query or find a different solution.

The question is:

How can you identify such a situation before getting to prod?

Meaning – just like unit tests enable us to “shift left” the catch of problematic code, how can we also do it with queries? Are there knows techniques or suggested frameworks for that?

Edit:

Some more information:

  1. you can not clone the prod DB into the dev/stage env.
  2. Load-tests are less relevant here, as the problem is not due to load on the system (and thus on the DB), but the problem is with a certain query that even one such query takes too long.

8

You should have a test system using similar hardware and data sizes as your production database. Sometimes a copy of the production database with any personal/sensitive data anonymized. But it could also be completely generated data, if data is generated you need to be careful to ensure it follows similar patterns as the production database.

It may not be practical or price effective to have a totally identical system to production. But this becomes a tradeoff, the closes to a real production system the lower the risk of bugs, but at a higher cost. Disk and memory is typically fairly cheap, so it may be feasible to use a realistic database size, but slower disks and fewer CPUs, and assume the number of queries scale linearly with better hardware. But keep in mind that this is just an assumption.

In addition to testing performance of individual queries you may want to test performance under load. So you may need some tool to generate a representative amount of queries.

This can be the staging server, or it could be named the “stress-test” system, or whatever you prefer.

1

In my personal experience when I have queries that are slow in production, most are SELECT or read queries. When I am building out those queries and I am suspicious of their performance, usually because they have several joins, then I’ll just run it manually in production (on a read slave using a read-only account) and then monitor its time and CPU usage.

If it takes more than a couple seconds, then I stop it. Or if I expect it to take a while, like a couple minutes for large aggregation reports, then I’ll monitor the CPU usage and kill it if it is likely to affect production traffic. At greater cost, you could have read-slave that doesn’t host production traffic.

I also like PostgreSQL’s timing and EXPLAIN and EXPLAIN ANALYZE commands where you can get actual timing information and execution plans. I guess in theory you could write a unit test that checks the execution plans and looks for patterns such as full table scans that are usually the cause of long running queries.

Another option.. Although you can’t use production data (I’m guessing it is 100s of GB in size, making it impractical to generate similar quantities for a unit test), you could still generate 1000s of rows and then test the queries. If normal, fast, queries on production are 100 ms, but the same queries on your local machine or test machines test take 5 ms, then you could just require that queries take less than 5 ms.

Those are things I have personally done, but the number of new DB queries my team added were just a few a week and I had a good enough grasp of the database that I could catch potentially bad queries in code reviews before code was deployed. If a bad query was deployed, we had a pretty solid revert process.

If this is something you do constantly and absolutely need an automated process for, then the things I would try (theoretical at this point) is creating some static analysis tests that check queries for:

  • where clauses on foreign keys
  • where clauses on indexes
  • execution plans for full table scans

If you are doing tons of UPDATEs, INSERTs, and/or DELETEs.. then that is a little more complicated and I don’t have as much experience in that. The little experience I do have are in situations where the processes take 20 minutes.

New contributor

Magnie Mozios is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

7

How can you identify such a situation before getting to prod?

I would argue that isn’t really possible to do safely, reliably and economically. Instead, you should try to minimize impact of query changes, quickly detect performance regressions and build a software engineering process that allows you to quickly fix issues and regressions.

Use feature flagging to limit new features or changes to just small sample of users. Possibly even to known beta users. You could slowly increase amount of users who have access to the feature, monitor for performance issues and automatically roll back if problems show up.

Modern telemetry systems make it trivial to implement performance monitoring and to identify which part of the code has performance issues.

And having solid automated deployment pipeline, with reliable automated regression test suite should make it possible to quickly deploy fixes to queries or schemas. This would reduce the need to catch issues before they hit production. And with canary releases and good monitoring should reduce need for up-front performance testing.

In fact the querying behaviour of mainstream database engines is practically non-deterministic (from the perspective of a developer), so there is never a test that can decisively identify or anticipate all problematic operation.

The reason the behaviour is non-deterministic is to allow the database engine to automatically adapt to the workload it is applied to from time to time.

The alternative to this automatic adaptation would not only be a lot of additional detailed up-front work which the developer would have to perform, but potentially massive rework every time a change occurred (including not just changes in what we think of as the code or configuration, but even mere changes in data volumes, which can tip the balance of certain algorithms or require different approaches to storage).

It’s also very difficult to predict, define, or emulate the exact pattern of concurrent loads that a database engine will experience in production use, and therefore extremely difficult to reproduce in testing how a database engine will react to it.

The timing of everything is often crucial, and brute-forcing an analysis and testing of all possible theoretical conflicts would be both computationally intensive (probably impossibly so) and would not be very informative about what matters.

It’s impossible for a developer to predict exactly what the timings of everything will be in production use under every possible current and future circumstance, so even if you had a brute-force analysis of every possible problem, you wouldn’t know which problems would actually apply to the real world use in production.

In reality, the design of established database engines incorporates a lot of accumulated experience in real production use, and refinement of the technology from that experience.

The problems themselves are considered fairly essential. That is, they don’t arise from poor tooling or inexpert design choices (at least not always), but arise from the very nature of business information systems and from the circumstances they have to cope with.

The need for concurrent working on mutable shared data is really the root of all the difficulty.

2

I had a very similar situation a few times in my career. The worst was being told “we don’t need to import old data into the new system” until two weeks before we went live, despite me asking for nearly a year. Our home page on that app went from loading in less than a second to taking a minute and forty-five seconds once we had 400k+ base records into the system.

Here’s my checklist of what I look for as I build out a project:

  • How many tables are being joined together in a query?
  • Do we have any subqueries in our SELECT statements, especially within columns?
  • When using Entity Framework, or any other ORM, are we including children and collections of children in the queries?
  • Do we have any triggers on the tables when we perform INSERTs or UPDATEs?
  1. In the example I mentioned earlier, the home page was displaying a table that gathered data from FOURTEEN different tables in the database. All of the JOINs were on Primary Key columns, and there weren’t inefficient columns being used for those connections.

Their DBA suggested we start throwing a ton of Indexes at the tables. However, remember that having too many Indexes can also slow the tables down because of the complexity of the page structures (especially in SQL Server).

By applying all of their DBA’s suggested Indexes, the query time went from 1:45 down to 1:35. Hardly the savings he expected. Instead of throwing indexes at multiple tables, you should take a look at creating a VIEW. What these do is allow the Database Engine to keep the Execution Plan in memory for far longer than the minor optimizations it performs when being sent the same command string repeatedly. Using a VIEW for that homepage to only grab the columns we wanted, and having the full execution plan cached dropped the load time of that homepage back to a single (full) second.

  1. This approach would also apply if you have a complex or subqueried column in a SELECT.

  2. If you happen to notice you’re including multiple children, or collections of children, in your queries through an ORM you’ll notice just how convoluted the SELECT statements it generates can be. This can cause massive duplication and explode the amount of data being sent back and forth.

In these cases, you’ll be far better off to make a separate query for the children using the distinct IDs you get back from those parent objects. Then manually assign the children before returning out of your Data Access layer. It’s a bit more work to do it this way, but if you’re noticing lots of slowdown, you’ll be saving time overall doing the logic yourself instead of letting the ORM handle it for you.

  1. For the most part, you shouldn’t need triggers on your tables. Especially if they wind up cascading to other tables that have their own triggers, you’re going to drastically increase the time required to write data. If you have other code sending the data to your database, try and do that yourself before saving your data. Granted, this isn’t always possible.

Finally, the last piece of advice I would recommend would be to use NO LOCK on your SELECT queries, if your DB Engine supports them. That way, if you have any complex logic in your queries still, you won’t be preventing other reads or writes while you get what you need.