Frequency distribution by range algorithm

03/11/2022 softwareengineering

I have an array of unsorted positive real numbers. I need to create a frequency distribution by their ranges.

The simplest approach goes like this

for num in numbers
 if (num > 0 and num < 10) a++
 elseif (num >= 10 and num < 20) b++
 ...
 else z++

return [a, b, c, ..., z]

Is there any faster or more efficient way to do this? Or this is the best in this case?

There are a lot of approaches, but I’m going to start by saying that most of them fall into the realm of micro-optimizations.

The basic problem is O(N x M), where N is the number of input values and M is the number of ranges.

In the worst case, you could have M equal to or larger than N, but that’s unlikely. In all of the bucketing code that I’ve written, M tends to be quite small: around a half-dozen to a dozen values.

In that case, the overhead of repeated if tests is small, and the problem is essentially O(N) (where the constant is the number of tests). You should be able to process several million input values per second (you’ll have to run your own tests to see what “several” means).

If M is large, then there are several optimizations that you can make:

As Killian Foth said, if the ranges are equal-sized, simply divide by the spacing. This reduces the problem to O(N) (although with a different constant that may or may not be less than the repeated tests, depending on the low-level cost of branching versus division).
If the range of your input values is constrained, use a counter for each input value and then sum those counters afterward. For example, if you know that each of your input values will be in the range 0-999, just allocate a thousand-element array. How big an input value you can handle depends on how much RAM you have. This is also O(N), with perhaps the lowest constant.
If you have a lot of arbitrarily sized ranges, then you could turn them into a binary tree. This will give you O(N log M).
If your input values span a very wide range, then perhaps you can change the problem to use bucketing by logarithm (eg: 0 -> 0-9, 1 -> 10-99, 2 -> 100-999, and so on). This will also be O(N), but the constant will be huge as logarithms are an expensive operation.

In my opinion, the best approach is to use the simple iterative-test implementation unless you have a very good reason not to. One thing that I would do, however, is to implement using a table of ranges rather than explicit tests. I believe that the latter is more prone to bugs, especially if you need to change the ranges.

Indeed there is. The key insight is that division abbreviates exactly this iterative approach, in the same way that multiplication abbreviates repeated addition of the same number.

Since your range starts from 0, the solution is especially easy. Just divide by 10 and look at the resulting number – you don’t even have to bother with an offset value.

Dictionary<int,int> dic = new Dictionary;
for num in numbers
{  
  int key is num / 10;
  if(dic.ContainsKey(key))
      dic[key] ++;
  else 
      dic.Add(key,1);
}
return dic;

LEAVE A COMMENT Hủy