Most efficient way to find top values

Let’s say we’re given n positive integers in random order. What’s the most efficient way to find the m largest elements and what’s the complexity?

For example, given 1000 values, find the top 10.

1

The m largest elements of a sequence of length n can be found with O(n log(m)), assuming that comparing individual elements can be done in constant time.

Start with the trivial case that m = 1.

ROUTINE Maximum
INPUT
    items[1…n] : Ordered
OUTPUT
    max : Ordered
REQUIRES
    n ≥ 1
VARIABLES
    i : Integer
BEGIN
    max ← Ordered.MIN_VALUE
    FOR i ← 1 TO n DO
        IF items[i] > max THEN maxitems[i] FI
    DONE
END

It should be obvious that this algorithm has complexity O(n).

Now replace the single-valued variable max with a constant-size min-heap of m items.

ROUTINE Maxima
INPUT
    items[1…n] : Ordered
OUTPUT
    max[1…m] : Ordered
REQUIRES
    nm
VARIABLES
    i : Integer
BEGIN
    FOR i ← 1 TO m DO max[i] ← Ordered.MIN_VALUE DONE
    FOR i ← 1 TO n DO
        IF items[i] > max[1] THEN
            ;; Replace the smallest of the current m maximum values by the
            ;; new value and restore the heap property if needed.
            max[1] ← items[i]
            CALL MinHeapifyDown(max)
        FI
    DONE
END

The worst-case complexity will be reached when the inputs are sorted in ascending order. In this case, the heap will have to be modified in each iteration, that is, O(n) times. Restoring the heap property of a m-value heap after replacing the top item has complexity O(log(m)). Hence, the overall complexity is no worse than O(n log(m)).

Epilogue:

If m is small, the algorithm shown above will have very good performance and a desirable memory access pattern (small random-access working-set in max and one-time linear forward traversal of items). It also does not require random access or even multi-pass capabilities of the input sequence which means it could be used for linked lists or even online data that is never stored in memory in its entirety. However, if m is on the order of O(n) and items provides random-access, then a partitioning algorithm like Introselect as suggested (or hinted) by Jerry Coffin would be preferred as it achieves O(n) complexity. In C++, it is even available in the standard library.

1

The M largest items can actually be found with O(N) complexity.

Although it’s generally known for finding the median, the “median of medians” algorithm can be used to find the item that would land at any spot in an array of that array were sorted. More importantly (for the situation at hand) it also partitions the array into those items smaller than the one chosen, and those larger than the one chosen.

Although it’s technically something like O(N2), Hoare’s Select algorithm is usually faster, and definitely simpler. It’s very similar to Quicksort (which Hoare also invented). The basic difference is fairly simple: with Quicksort, you partition the array, then recursively sort each half. With Select, you partition the array, then recursively select only in the partition that contains the element you care about.

In the worst case, this only eliminates one element per partition step–but unless the starting order is being set up with malice aforethought, that possibility is fairly rare (especially with reasonable care in choosing your pivot element).

Although I don’t know of anybody having tested it for this application, it seems like the strategy used in the Pattern Defeating Quicksort would apply equally well to this task. I’m not sure exactly how this works out from a theoretical point of view, but from a practical viewpoint it seems likely (at least to me) that it essentially guarantees O(N) overall complexity.

3

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *