Can caching increase p90 response time

23/04/2024 softwareengineering

I’m learning about caching and I realized a positional problem I need clarification on.

Suppose the database p90 response time is too high and we need to decrease it. We are considering adding a cache.

DB: p50 (median) is 100ms, p90 is 200ms

Cache: 50ms

We cache the most common requests, but those tend to be the inexpensive ones. The average response time is now closer to 50ms, yet the p90-type queries are typically a cache miss. The p90 will now be 200ms+50ms.

Caching seems to be the wrong tool because it favours the common (or recent), at the cost of all else. Even in the right scenario, cache misses in a small cache will increase the p50 response time- especially if requests are a uniform distribution. Is my understanding correct? Will caching be advantageous, or do we need to optimize at the database level?

This is the trouble with caching things. The cache is fast only if this is the second call and we hit the cache. Stray anywhere outside this and a cache solves nothing.

In your case, a cache doesn’t help the expensive queries, so a cache isn’t helpful. If you introduce a cache to speed up expensive queries, but you don’t measure a speed boost, then caching information is simply the wrong tool for the job. Performance problems require a more rigorous approach more akin to scientific experimentation.

Measure the unoptimized system first.
Make one change.
Measure the optimized version.
Compare times to the optimized version. If you haven’t hit your mark, undo what you did, go to step 2 and try something else.

Repeat steps 2 through 4 until you’ve hit your mark.

This, of course, doesn’t answer your question. You will need to identify the part of the system that is running slow, and focus your optimizations on that part. Measure frequently. There is no secret sauce here, unfortunately.

You mention optimizing at the database level. Be careful here. Profile the offending queries first. Most RDMS tools have robust query profiling available. It might take a bit to parse through the results, but this is where you should start with relational databases. After that, look into indexes, views, materialized views and other optimizations first before de-normalizing your tables. Ensure JOINs are done properly, and be suspicious about sub-queries in SQL. Sub-queries are not an automatic performance problem, but in my experience this tends to be the part of a SQL query I need to fix most often. Consider whether a JOIN is more appropriate for sub queries.

Other things to consider:

How often do these queries need to run?
- If not often, and these queries aren’t causing downstream problems, consider not fixing this. Talk with your team, because this might not be a fight worth winning.
- If these queries run often, does the data change often? If the data doesn’t change often, consider eager caching these queries — query on application startup. I’ve heard this called “warming up the cache”, so the first call to this query from an end user is actually the second call, so you hit your cache.
If you have some heavy duty reporting needs, you might look pushing information into a reporting tool or reporting database. Related buzzwords would be “data lake” or “data warehouse”, or “big data” in general. This is significant upfront effort, though. You would really need a lot of reports to justify this.

And this is only a very short list. The only general advice I can give is:

Do some analysis to make sure this is worth your time, and no hitting an arbitrary benchmark doesn’t make it worthwhile. How does this relate to the bottom line and user experience? Remember that software engineering time costs money.
Measure and profile the system. Performance bottlenecks crop up in the most unusual places, so don’t make any assumptions.
When you identify the database query as the slow part, do not de-normalize your tables. Relational databases have been fine tuned for decades. Get to know some of the optimization tools available before you take a wrecking ball to a well-designed schema.
If de-normalization is the route you take, talk to teammates. This can cause architectural problems elsewhere that cause just as many headaches. This will be a benefits and drawbacks comparison that you probably cannot do by yourself. Lean on your team here.

I think the measure of the p90 (or p50) has the assumption that the database is hitting a bottleneck.

It’s only when you notice an increase in the response times from a baseline that it’s worth talking about the average response time in relation to that baseline figure.

In those cases using a cache to lower the number of requests hitting the database can have a good effect. The database does less work and might fall back to its normal average. You will see a nice graph of your p90 response times going down again.

If I just have a really bad query that takes 3 seconds and I optimise it to 100ms, then I would have to re baseline my averages.

If I have a super common query thats uncommonly fast or slow and I remove that by caching 90% of its hits, then I would also need to re baseline

Both things are still “advantageous” It just means I can’t do a direct before and after comparison with just the average

LEAVE A COMMENT Hủy