How can I aggregate this large data set to reduce the overhead of calculating the same values over and over?

  softwareengineering

So, we have a dashboard page where a lot of different values are aggregated and presented. These values are calculated based on roughly 500k data points and presents different index values based on surveys. To calculate a certain index we need to aggregate all data points (these are divided by question number and user id) and then we run a WMA (weighted moving average) for each question and user pair sorted by date, making the date closest to the current date having the most weight. Basically we then put this together by transforming our survey scale (0-3) to a 10 scale. This we do by taking all the answers and then divide by the highest possible outcome, for example: if a user has provided this set of answers to a survey: 2,3,3,1. Then this is transformed to our scale by doing: ((2+3+3+1)/(4*3))*10 = 7,5.

A survey answer will never change as soon as it is submitted.

Our problem is that these values should be as close to real-time as possible and the calculations take a lot of CPU making this hard to scale for lots of concurrent requests. We have looked into caching these values which will help to an extent.

I would like to aggregate these values to remove the overhead of iterating the same 500k data points over and over (this is the bottle-neck). My initial thought is to aggregate the values by week which will reduce the number of data to iterate but I’m not sure how to do this since I need to keep track of the WMA for each user and question pair to get the correct weight before I transform to our 10-scale. Also the user can pick any two days to aggregate their data in the dashboard (this we could limit so that they can only browse data points by week and not by day).

Do any one have any idea how to aggregate these values without having to access all the 500k data points? Is there any smart way I can aggregate these 500k data points into larger chunks to reduce the calculations? The historic data will never change.

Update 2018-07-17

I realize I haven’t explained the WMA involved which is a deal-breaker for this scenario. Given this example with one user who answered the same question 5 times (the first score (3) is the first answer in time and the fifth score (1) is the most current answer in time:

3,3,1,1,1

This will give a WMA of 1.4. This is calculated by the following code:

public decimal Calculate(int[] data)
{
    decimal aggregate = 0;

    int count = data.Count();

    for(int weight = 1; weight <= count; weight++)
    {
        aggregate += data[weight - 1] * weight;
    }

    return decimal.Divide(aggregate, decimal.Divide(count * (1 + count), 2));
}

And then this is transformed to our scale by this formula: (1.4/3)*10 = 4,67.

This means that in order to get a true value one can’t simply aggregate by week ignoring the question and user combination needed for WMA.

As I read your answers I come to realize that perhaps it’s not possible to aggregate these values in a way that makes sense? Meaning that in order to get an aggregate I need to iterate all values.

Update 2018-07-19

In response to JimmyJames answer below The following scenario works fine:

User 1:

  • 2018-07-01, Score: 1
  • 2018-07-02, Score: 0

User 2:

  • 2018-07-01, Score: 3
  • 2018-07-02, Score: 2

This will yield a score of 4.43 for 2018-07-01 to 2018-07-02 using my current code for calculation. The current code do the calculations like this:

  1. Calculate WMA for each user and question pair: (1+0=0.33 WMA)+(3+2=2.33 WMA) = 2.66 WMA
  2. Convert to our scale by doing: (2.66/(3*2))*10 = 4.43

If I try to aggregate the same value set as described in the answer I get the following data:

  • 2018-07-01 Total: 4 (1+3), Count: 2
  • 2018-07-02 Total: 2 (0+2), Count: 2

Weighted total: (4*1) + (2*2) = 8

Weighted count: (2*1) + (2*2) = 6

WMA: 8/6 = 1.33

Convert to our scale: (1.33/3)*10 = 4.43

10

Since the weights are based on date, you can structure an aggregate data set of the date, total for that date, and the count for that date. This means that you will not need to calculate the WMA on a per question basis, you can weight the total for a day, add all the weighted day totals and then divide by the weighted total count across all days.

While there are more elements to calculate over when using days versus using weeks, it’s a max of 366 versus a max of 52. This isn’t going to make much difference in practice.

The other thing I would recommend you do is only calculate the most recent N days of data. When calculating WMA, really old data will have no significant impact on the result. It will just be changing decimal points that you are likely to ignore.

You mention in the question that you need to weight before scaling but I don’t see any reason why that would need to be the case. As long as your scaling is linear, it should not matter whether you scale before you weight or vice versa.

Here’s an example set of data:

    date    | total | count
----------------------------
 2018-07-15 |  435  |  99
 2018-07-16 |  684  |  123
 2018-07-17 |  324  |  51

And assume for simplicity’s sake we need a 3 day WMA:

weighted total = (435 * 1) + (684 * 2) + (324 * 3) = 2775
weighted count = (99 * 1) + (123 *2) + (51 * 3) = 498
WMA = 2775 / 498 = 5.57

0

As an aside, I think you’re making your formula harder to read than needs be.

((2+3+3+1)/(4*3))*10 

would make more sense when written as

((2+3+3+1)/4)*(10/3)
  • ((2+3+3+1)/4) expresses calculating the average
  • (10/3) expresses reframing the resulting value from a score out of 3 to a score out of 10.

This already suggests minor improvements, as the 10/3 part can be hardcoded once and doesn’t need to be repeatedly recalculated. I don’t mean you need to hardcode 0.3333 because you’d always have rounding issues, but rather to ensure that it gets calculated only once. Taking a C# example:

public const double SCALING_FACTOR = (double)10/3;

Is there any smart way I can aggregate these 500k data points into larger chunks to reduce the calculations?

This is a very common problem when calculating averages. Anyone who has ever tried to calculate a running average will have run into this problem. To elaborate:

Let’s say we want to calculate the average of 12, 15 and 35. But we’re going to receive these values at different times, and we need to always have the latest average value on hand.

Attempt 1 – Storing a running average

This, sadly, doesn’t work. You can’t simple take the average of the previously calculated average and the new number, as that gives different results for receiving the same values in a different order:

  • The average of 12 and 15 is 13.5.
    • The average of 13.5 and 35 is 24.25.
  • The average of 12 and 35 is 23.5.
    • The average of 23.5 and 15 is 19.25.

You can test with more values if you want. You’ll notice that when the total amount of values increases, the running average will become increasingly erratic.

The issue here is with the weight of the items.

  • When you average A and B, you are weighing them equally (50% A, 50% B) and store them in a single value (AVG).
    • When you average AVG and C, you are weighing them equally (50% AVG (= 25% A, 25% B), 50% C).

A correctly calculated average would need to express (33% A, 33% B, 33% C). As you can see, the running average does not calculate it correctly.

The core of the issue is that in order to define the weight of the numbers, you need to remember how many values have been used to calculate the running average.

Attempt 2 – Storing a running average and a running count

This, luckily, does work. Instead of only storing the average, we also store the amount of numbers we’ve averaged so far. We change our running average calculation to the following formula:

running_avg = (running_avg * running_total + new_value) / (running_total + 1)

Let’s apply this to the previous example:

  • First we receive 12: running_avg = (0 * 0 + 12) / (0 + 1) = 12
  • Then we receive 15: running_avg = (12 * 1 + 15) / (1 + 1) = 13.5
  • Then we receive 35: running_avg = (13.5 * 2 + 35) / (2 + 1) = 20.66667

And the same values in a different order:

  • First we receive 35: running_avg = (0 * 0 + 35) / (0 + 1) = 35
  • Then we receive 12: running_avg = (35 * 1 + 12) / (1 + 1) = 23.5
  • Then we receive 15: running_avg = (23.5 * 2 + 15) / (2 + 1) = 20.66667

And now we see that we receive the same result.

The reason this works is because running_avg * running_total is a roundabout way of getting the same value as A + B + C (i.e. the sum of all averaged numbers). Therefore, (running_avg * running_total + new_value) is essentially the same as doing A + B + C + D (where D is the new value), thus ensuring that the four values are being weighted correctly.


Before I can get to your solution, one more mention: the current formula relies on you entering one new number into the average.

But notice what happens when you need to add two new values:

running_avg = (running_avg * running_total + new_value_A + new_value_B) / (running_total + 2)

Notice the +2 instead of the +1. Since you’re adding two numbers, the count should increase accordingly.

Taking this a step further, what happens when we want to take the average of two averages?
The solution is this:

running_avg = (average_A * running_total_A + average_B * running_total_B) / (running_total_A + running_total_B)

Long story short, the previous example (where we add one number) is the same as adding an “average of one number” (which is of course equal to the number itself). I always omitted the * 1 (= * running_total_B) for the sake of simplicity.

But this slightly more complex formula is exactly what you need.


My initial thought is to aggregate the values by week

You can calculate the end result from week averages as long as you track both week’s average and count.

For example:

  • Week 1
    • Average 3 (from 1+2+3+4+5)
    • Count 5 (because you added 5 values)
  • Week 2
    • Average 50 (from 60+40)
    • Count 2 (because you added 2 values)
  • Week 3
    • Average 0 (no values entered)
    • Count 0 (because you added 0 values)
  • Week 4
    • Average 14 (from 14+2+26)
    • Count 3 (because you added 3 values)

Using your old “calculate everything” approach:

  • 1+2+3+4+5+60+40+14+2+26 = 157 total
  • 157 / 10 = 15.7 average

Using the new “week average” approach:

  • Week 1 = 3 * 5 = 15
  • Week 2 = 50 * 2 = 100
  • Week 3 = 0 * 0 = 0
  • Week 4 = 14 * 3 = 42
  • Add them all up, you get 157 total
  • 157 / 10 = 15.7 average

To prove the performance argument, count how many operations your approach uses (9 additions and 1 division) and compare that to my method (4 multiplications and 1 division).

The more values you have in a given week, the bigger the performance gain from the second approach, since the second approach only performs one multiplication per week rather than one addition per value.


Don’t forget your scaling factor!

I omitted it for now, because you can only apply it once at the end, instead of using it in every week calculation.

For the sake of example, I’ll show you a C# example of the code:

var weeks = GetWeeks();

int values_sum = 0;
int values_count = 0;

double scaling_factor = (double)10/3;

foreach(var week in weeks)
{
    values_sum += week.Average * week.ValueCount;
    values_count += week.ValueCount;
}

double result_average = values_sum / values_count;

double scaled_result_average = result_average * scaling_factor;

Note: I omitted the code needed to add a value to the weekly (running) average, but that is a simple application of the initial formula I used. I leave it as an exercise to the reader.

11

You can use the software optimization technique called memoization.

  • Use a cryptographic hash function to uniquely (but briefly) identify each question
  • Then wrap access to your computation function with something that runs the hash on the input, and looks up the result in the cache.

This is a very old, and very widely applicable technique for throwing memory at a problem to make it run faster.

6

LEAVE A COMMENT