Should quickselect modify the input array or not?

I have recently implemented quickselect, an algorithm for computing the k-th smallest element of an array, which, roughly speaking, works by repeatedly partitioning the array around a pivot and suitably shrinking the array.

The implementation rearranges the input array, which it recevies by reference, to avoid wasting extra memory.

Which of the two options below do you prefer (as someone who uses the algorithm)?

  1. The algorithm makes a copy of the input array and fiddles with the copy.
  2. The algorithm modifies the input array (the user has to take care to pass a copy in case order is important to her).

1

I would normally expect

  1. The algorithm makes a copy of the input array and fiddles with the copy.

However, you might have very good reasons to go the other way:

  1. The array is very large and the user will call your function in a loop.
    If you do this, be sure to call it out in the documentation as it is unusual behavior
  2. The input array is opaque to the user – i.e., the user generated it by calling one of your functions, was never promised a particular order, and may not even know exactly what’s in it.

In other words, you need to determine what’s best for your specific situation.

1, or both if possible

When the language permits the distinction between readonly and mutable formal parameters to functions (e.g. const reference in C++, pass by value array types + references), specific usages may not care that the array gets trashed.

Otherwise, I agree with the other answers that reordering data passed breaks expectations and interacts badly with parallelism.

I’d appreciate 2, since it’s more flexible: that way, I can make a copy if I care about the input data, while I can just pass the original data if I care more about efficiency.

Of course it’s your job to state very clearly that you are going to modify the input sequence.

Or, put in an another way: quick select can be thought as an in-place “partial sorting” algorithm, and interfaces to these kind of algorithms usually do change the input data: think about std::sort and std::make_heap.

In fact, C++ actually implements QuickSelect in its standard library (function std::nth_element). Guess what? It does change the input data 😉

4

Unless very clearly documented, 1 (don’t mess with the input).

Passing an array into a function and trashing my input data is extremely unexpected behaviour. Under usual circumstances I’d call it a major bug.

Even if you manage to reverse the operation, keep in mind that this breaks thread safety (as even reading is now undefined behaviour).

Edit: If you’re after optimisation options (and you are sure that it is needed!), you might add an optional parameter to the function, i.e. bool mayRearrangeArray = false. That way, the user has the option, but is – by default – on the safe side.

6

The answer typically depends on the language you are using and the kind of users you are working with.

In languages where performance is king, it is typical to make the changes in-situ. Allocating memory is expensive, and can be a problem on machines with limited memory, so C developers (who I would consider the “kings” of performance) will always do the sort in place. If you look at pretty much every implementation of a sorting algorithm, you’ll find they do it this way.

On the other extreme, MATLAB is typically far more concerned with clarity than it is with performance. Thus, sorting algorithms in MATLAB often return a newly created array with the new values in it.

If your language supports in-situ modification, I recommend going down that path for a reason akappa made in comments. Its trivial to make a sort which simply copies the array and then passes it to an in-situ sort. It is impossible to go the other way, so by making your sort in-situ, you support more options.

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *