I am translating some code from MATLAB to Julia. For context, I am relatively new to Julia and relatively seasoned in MATLAB. I have been struggling with the following specific operation that I do multiple times in my code. I need to sum the values of a variable by some pre-fixed ids. To make it very concrete think of the following operation:

``````│ ids   │ values │                      │ ids   | sum_values |
┼───────┼────────┤                      ┼───────┼────────────┤
│ 2     │ 1948.6 │      converted       │ 1     │ 3995.4     │
│ 1     │ 1994.7 │          to          │ 2     │ 1948.6     │
│ 3     │ 1940.1 │       ======>        │ 3     │ 3844.4     │
│ 1     │ 2000.7 │                      │ 4     │ 1982.0     │
│ 4     │ 1982.0 │
│ 3     │ 1904.3 │
``````

But in reality think of these two as large arrays, `ids` and `values` (in Julia speak, they are `Int64` and `Float64`). Each of these two arrays in my case have about 4 million observations. Note that `ids` is not necessarily sorted.

In MATLAB code, they look like the following:

``````rng(9491)

n_obs = 4*10^6; # number of observations
n_ids = 3*10^5; # number of unique ids

values = rand(n_obs,1);
ids = randsample(n_ids, n_obs,'true');
``````

and in Julia:

``````using Random, StatsBase

Random.seed!(9491)

n_obs = 4*10^6  # number of observations
n_ids = 3*10^5  # number of unique ids

values = rand(n_obs);
ids = sample(1:n_ids, n_obs, replace=true);
``````

So `ids` groups observations while `values` are just some values computed in an algorithm. In MATLAB, what I do to compute `sum_values` is simply to use:

``````sum_values = accumarray(ids,values);
``````

and I have been having a hard time figuring out what is an extremely time efficient implementation of the same operation. For context, I do about 200,000 calls of the function in different ways (that is, with different `ids`, and different `values`) in my optimization routine.

I stumbled upon this 7-year old answer to a similar question that uses `DataFrames` but I don’t know if this would fly in my setting where the key is to do it fast and then use the `sum_values` in other matrix operations. Note that these are large arrays, so putting them in a DataFrame, then using `groupby` then taking the values out, converting them into a `Vector{Float64}` may actually not be a great idea computationally (happy to be proven wrong). Using dictionaries sounded like a good idea but then it involves sorting (for some reason by dictionaries end up not sorted in ids).

What is the most efficient way to sum values by group in Julia? Any advice on how to implement this, or something similar, would be greatly appreciated.