Chunking an array into letter groups that are generally equal length

Given a list of items, I want to break it up into four groups with as equal length as possible. The items need to stay grouped by the first letter in each item as well.

26 letters / 4 groups would generally cover 6.5 letters in each group. If we had an equal amount of items starting with the same letter in each group it might look like this:

[A-F] (6 letters)
[G-M] (7 letters)
[N-S] (6 letters)
[T-Z] (7 letters)

However, in practice, we may find that our original list is heavy on the items in the [N-S] group.

[A-F] (50 items)
[G-M] (40 items)
[N-S] (70 items)
[T-Z] (40 items)

We may want to push all the items that start with N into group 2 and all the items that start with S into group 4 to achieve balance:

[A-F] (50 items)
[G-N] (50 items)
[O-R] (50 items)
[S-Z] (50 items)

Does anyone have any ideas of where to point me in terms of coming up with an algorithm that can solve this sort of problem.

I will most likely be using javascript on the client to implement whatever solution will work. I’d like to use as functional of an approach as possible.

3

Create 4 lists for the results and 26 lists to hold the interim items by initial letter then apply this pseudo-code:

foreach item in masterlist
    append item to initial letter list
    increment total counter

average bucket size = total/4

iterate letter buckets in order
    if current results bucket size  + length letter bucket < avergae bucket size
        append letter bucket to results bucket
    else
        append letter bucket to results bucket if it will be over average by less than it would be under if you do nothing
        move to next results bucket

Not being given any other requirements and
assuming you want to stick just first letters and
assuming you will stick to a number of groups that is a power of 2:

  • Sort the list alphabetically
  • Cut it in as close to half as you can
  • Cut those two pieces in as close to half as you can

This solves the “read ahead problem” you’re bringing up. Each cut considers the whole list.

Nice bit of reuse in here as well.

A functional way to approach this problem is a closure. Produce a function that closes over the sorted list and takes another list that defines the groups as a parameter. This letter list would be starting and ending letters of the groups. The function produces a new letter-list that doubles the number of groups.

Your letter-lists would progress like this:

[A-Z] [A-N][O-Z] [A-F][G-N][O-R][S-Z]

In pseudo code

h = ItemRepository(sortedListOfItems).Halve;
result = h(h([A-Z]));

4

Maybe I’m missing something but I think this general approach might be what you are looking for:

  1. take all N items and sort them
  2. starting at item N/4 x
  3. does x start with the same letter as the precceeding element?
    1. yes: move to the next element x, repeat 3.
    2. no: x is the beginning of the next group

Repeat this for each group. The main issue here is that if say half the words start with A, it’s not going to give you the results you want. You need to start looking at second letter and so on.

3

Brute force is your best approach for this problem provided that your data size is relatively small. It means you will (a) get a perfect answer every time, not and approximation and (b) have more understandable, maintainable code. The only cost is a bit more processing time, but that is cheap.

You need:

1) An iterator that can generate each grouping and return the next kind of grouping

2) A function that calculate the degree of balance between the groups. This could probably just be a simple standard deviation calculation on the array sizes.

In pseudocode, this will be something like:

For each possible way of grouping {
  Create a new structure for that grouping
  Spread the items among the groups
  Get the count of items in each group
  Calculate the standard deviation of the counts
  If this is the lowest standard deviation so far {
      save it together with the grouping that created it
      }
}
Result is the last grouping saved

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *