Quicksort dilemma

Instead of writing stable partitions with the help of a buffer, the Quicksort family of algorithms is characterized by SWAPing pairs of elements in-place, hence sacrificing² stability. Quicksort is tricky, wrong implementations easily lead to quadratic runtime, particularly if deterministic pivots are used, or even leads to endless loops, particularly if sentinels are used. Note that its inventor Tony Hoare defined Quicksort as a probabilistic algorithm using random pivots. Many historic attempts to tune Quicksort with deterministic pivots have finally been regretted because it made Quicksort vulnerable for unexpected runtime and algorithmic complexity attacks. In the following we assume random pivots.

Compared to inefficient approaches like the Lomuto-scheme, the brilliant idea of Hoare in Quicksort1 was to write the two partitions outside-in from the left and the right, this saves a counting-scan and promised to isolate pivot-ties in a third partition in the middle (1961a, 1961b). Hoare’s symmetric design obviously aimed on efficient scaling (average \(\mathcal{O}(N\log{N})\) operations) and early-terminating on ties (\(\mathcal{O}(N\log{D})\), where \(D\) is the number of distinct keys). However, on certain data inputs Quicksort1 degenerated to \(\mathcal{O}(N^2)\).

Several attempts have been made to avoid quadratic runtime for arbitrary data inputs. Sedgewick (1977) compared Hoare’s algorithm with two different approaches: asymmetric partitioning and a symmetric partitioning going back to Singleton (1969), where the two pointers from the left and right both stop on pivot-ties before SWAPing. This guarantees a balanced partitioning (and hence average \(\mathcal{O}(N\log{N})\)), but it SWAPs ties and it does not early terminate on ties (hence not \(\mathcal{O}(N\log{D})\)). Sedgewick concluded that the asymmetric approach was vulnerable to quadratic runtime and recommended the symmetric version: Quicksort2 became the standard for many years.

In an attempt to fix quadratic runtimes in Quicksort2 implementations with deterministic pivots and in order to gain early-termination on ties, Wegner (1985) and later Bentley and McIlroy (1993) developed Quicksort3³ that collects pivot-ties in a third partition in the middle between the low and high partition. Quicksort3 achieves \(\mathcal{O}(N\log{D})\) for tied data, but due to extra-operations for identifying and placing the ties it is slower than Quicksort2 for untied data. This trade-off between worst-case efficiency and best-case efficiency we term the ‘Quicksort dilemma’.

Yaroslavskiy (2009) achieved a notable improvement with dual-pivot Quicksort (Dupisort⁴) that uses its extra-operations to create a real third partition. This is faster and no longer a binary sort: it improves the average \(\mathcal{O}(N\log_2{N})\) algorithm to \(\mathcal{O}(N\log_3{N})\) regarding moves, but the algorithm is strongly serial and difficult to implement branchless. Then, Edelkamp and Weiß (2016) published an even faster and simpler binary Block-Quicksort that reduced branch-mispredictions. However, Block-Quicksort had only a rudimentary and expensive early-termination mechanism, that we skipped in Quicksort2B. At this point, history returned in a huge cycle⁵ to Quicksort2, still leaving the quicksort dilemma unresolved.

Stepping back and analyzing the commonalities of all those attempts, it seems that all authors assume that partitioning must by symmetric, and for early-termination on ties a third partition is necessary. However, the greeNsort^® analysis of the problem finds that

for early-termination not three, not two but one partition is sufficient, the algorithm just needs to be able to diagnose an all-tied partition and stop recursing deeper into this branch
in order to get pure partitions where ties are not in multiple partitions, the partitioning must be Mutually Exclusive and Collectively Exhaustive (MECE), this requires to make a clear (asymmetric) decision to which side pivot-ties go
it is possible to provably optimally modify an asymmetric partitioning for detection of all-tied data (DIET-Method)
it is possible to make asymmetric partitioning reliable against any adverse input pattern (FLIP-method)

DIET-method

Distinct Identification for Early Termination (DIET) works as follows: in the main loop of asymmetric binary partitioning, when searching for a pair of elements that require SWAPing, one pointer stops at pivot-ties, the other does not stop on pivot-ties. Assume the algorithm begins to search with the pointer that stops on ties. DIET adds a pre-loop that searches from the same side for a non-pivot-tied-key. If this pre-loop reaches the other side without finding a non-tie, all data are tied and the algorithm can exit recursion. If the pre-loop finds a non-tie, almost no work is lost and can be reused: only the last (non-tie) elements needs to be compared a second time against the pivot to find out to which side it belongs. I.e. we just set the pointer back one element and enter the main loop. That’s it. And it provably costs exactly one extra comparison (and one extra pointer increments and decrement) per partitioning, hence less than \(N\) extra comparisons. That is the unavoidable price of tuning for early-termination on ties. If we use this DIET-partitioning in the usual self-recursive manner, we obtain Zocksort, which still is vulnerable for certain (asymmetric) inputs. Now we reached a similar point like Sedgewick, but we do not try to fix these beautifully efficient loops (or give up). Instead we FLIP.

FLIP-method

Fast Loops In Partitioning (FLIP) means to embrace the fast loops of the asymmetric partitioning and to create symmetry on the recursion level by left-right-mirroring. An input pattern that fools an asymmetric partitioning must itself be asymmetric, for example a Zocksort that partitions pivot-ties to the lower partition is fooled by data with two distinct keys, many high, and one low: random pivots are mostly high, hence all data goes to the left partition and the algorithm does not progress. Note that this data input cannot fool a Zocksort that partitions pivot-ties to the higher partition. By recursively alternating between an asymmetric partitioning and its left-right-mirrored twin, no asymmetric pattern can fool the resulting algorithm. Zig-zagging between the two twins is called Zacksort (‘Zack’ is the German word for ‘zag’ and implies a connotation of ‘quick’). An elaboration is the Zucksort algorithm which flips the partitioning asymmetry only in the recursive branch that can contain the pivot-ties, in the other branch it does not matter (‘Zuck’ is the German word for ‘twitch’ hence also implies a connotation of ‘quick’).

POET-method

Pre-Order Early Termination (POET) is a tuning replacement of the DIET pre-loop in order to early-terminate on presorted data: instead of looping while equaling the pivot, the algorithm loops while the data does not violate the desired sorting order. If the loop reaches the other side the algorithm can early terminate, if it detects an out-of-order element, it resets the pointer to the starting point and enters the main loop. Note that Early-Termination on ties is a special case of POET. Note that POET unlike DIET is not provably optimal for early-termination on ties because the work of the pre-loop cannot be reused. The resulting algorithm is called Ducksort.

Partial sorting

Several variants of partial sorting are easily derived from Quicksort. A quite generic version (Quickpart) takes two parameters position l and position r in sorted order and guarantees sorted order between l and r. Other definitions of partial sorting are special cases thereof. For example the C++ std::partial_sort guarantees the smallest elements until position middle to be sorted, this is achieved by setting l=middle and r=max position. Specifying l==r gives the popular Quickselect published by Hoare as FIND (1961a, 1961c). Note that Quickselect still does partial sorting work in order to guarantee that the l==r-th element is in the l==r-th position and that all elements smaller than the selected element are left of l and that all elements greater than the selected element are right of r. Note further that Quickselect does not define which element it selects in case of ties. But the well-defined tie-handling of Zackselect, Zuckselect or Duckselect returns the leftmost and rightmost positions of elements tied with the element at the desired position, for some tricky details see the code.

Branch-prediction

Like Block-Quicksort from Edelkamp and Weiß (2016) these greeNsort^® algorithms such as Ducksort, Duckpart, Duckselect can be tuned to behave branchless, hence faster (named DucksortB, DuckpartB, DuckselectB). Note that Pdqsort of Orson Peters (2014, 2015; O. R. L. Peters 2021b, 2021a) is a related asymmetric algorithm that when necessary involves a mirrored asymmetric partitioning. It does not achieve high-level symmetry and is not provably optimal, for example uses some heuristic shuffling. However Peters excellent C++ implementation is branchless with tuning for ties and tuning for presorted data and little overhead and it has been formally proven correct [Lammich:2020], hence for all practical purposes I can highly recommend it as mostly faster than the greeNsort^® research implementations. However, once combined with an expensive comparison function such as localized string comparison strcoll, Pdqsort becomes slower than DucksortB. Note also that Pdqsort is implemented with deterministic pivots and fallback to potentially slower Heapsort, and note that the greeNsort^® algorithms can be implemented like this as well, without any need for heuristic shuffling. Finally note that the simple branch-parallel PDucksortB outperforms Pdqsort, which has a serial dependence as the author states himself.⁶

Quicksort conclusion

Embracing low-level asymmetry and addressing high-level bi-symmetry elegantly solves the decade-old Quicksort-dilemma in an easily provable way, which suggests applying symmetric recursion to other algorithms as well: to overcome the limitations of Quicksorts. While Quicksorts are memory parsimonious and have good cache-locality, they lack stability, deterministic worse-case guarantees and the ability to sort elements of varying size, and they are difficult to parallelize. Hence let’s turn to Mergesorts.

Note that SWAPing directly equally-sized elements directly sacrifices generality, indirect sorting by SWAPing pointers to variable-sized elements causes random-access and sacrifices robustness.↩︎
we report a simpler version that is slightly faster in the worst-case of untied keys and slightly slower in the best case of tied keys, see also von Leitner (2007)↩︎
our slightly better Tricksort removes a dead-branch from Yaroslavskiy’s code↩︎
Indeed a cycle: the optimization published by Edelkamp&Weiß (2016) was already suggested in a little known paper of Hoare (1962) in which he gives context and explanations for his 1961 algorithm↩︎
“I don’t have plans currently. I would have to do some research on modern standard C++ parallel programming, and there are some tricky things in PDQsort if you want to parallelize. In particular it is assumed the left partition is recursed on first.” https://github.com/orlp/pdqsort/issues/16#issuecomment-823145493 ↩︎