I am translating some code from MATLAB to Julia. For context, I am relatively new to Julia and relatively seasoned in MATLAB. I have been struggling with the following specific operation that I do multiple times in my code. I need to sum the values of a variable by some pre-fixed ids. To make it very concrete think of the following operation:

│ ids   │ values │                      │ ids   | sum_values |
┼───────┼────────┤                      ┼───────┼────────────┤
│ 2     │ 1948.6 │      converted       │ 1     │ 3995.4     │
│ 1     │ 1994.7 │          to          │ 2     │ 1948.6     │
│ 3     │ 1940.1 │       ======>        │ 3     │ 3844.4     │
│ 1     │ 2000.7 │                      │ 4     │ 1982.0     │
│ 4     │ 1982.0 │                      
│ 3     │ 1904.3 │                      

But in reality think of these two as large arrays, ids and values (in Julia speak, they are Int64 and Float64). Each of these two arrays in my case have about 4 million observations. Note that ids is not necessarily sorted.

In MATLAB code, they look like the following:


n_obs = 4*10^6; # number of observations
n_ids = 3*10^5; # number of unique ids

values = rand(n_obs,1);
ids = randsample(n_ids, n_obs,'true');

and in Julia:

using Random, StatsBase


n_obs = 4*10^6  # number of observations
n_ids = 3*10^5  # number of unique ids

values = rand(n_obs);
ids = sample(1:n_ids, n_obs, replace=true);

So ids groups observations while values are just some values computed in an algorithm. In MATLAB, what I do to compute sum_values is simply to use:

sum_values = accumarray(ids,values);

and I have been having a hard time figuring out what is an extremely time efficient implementation of the same operation. For context, I do about 200,000 calls of the function in different ways (that is, with different ids, and different values) in my optimization routine.

I stumbled upon this 7-year old answer to a similar question that uses DataFrames but I don’t know if this would fly in my setting where the key is to do it fast and then use the sum_values in other matrix operations. Note that these are large arrays, so putting them in a DataFrame, then using groupby then taking the values out, converting them into a Vector{Float64} may actually not be a great idea computationally (happy to be proven wrong). Using dictionaries sounded like a good idea but then it involves sorting (for some reason by dictionaries end up not sorted in ids).

What is the most efficient way to sum values by group in Julia? Any advice on how to implement this, or something similar, would be greatly appreciated.