# Further algorithms

## Low-memory merging

In-place merging algorithms have enjoyed huge academic attention. Several in-place Mergesorts are known for decades but were too slow in practice. In 2008 (or earlier) Sedgewick created the term “holy sorting grail” for a practically usable stable \(\mathcal{O}([N\log{N})\) in-place-Mergesort. In 2013 Astrelin (2013) published *Grailsort* an implementation of Huang and Langston (1992) that was not dramatically slower than other *Mergesorts*, hence deserved its name. Since 2014 Sedgewick redefined the “holy sorting grail” and required a \(\mathcal{O}(N)\) best case. Anyhow, in-place merging is overrated, Astrelin also published *Sqrtsort* which is faster than *Grailsort* and needs only a practically negligible \(\mathcal{O}(\sqrt{N})\) buffer. The *greeNsort ^{®}* algorithm

*Walksort*also requires \(\mathcal{O}(\sqrt{N})\) buffer, is faster, and its equally fast variant

*Jumpsort*uses relocation-moves for distance minimization (like

*Crocosort*).

Regarding speed and energy for sorting random data, these \(\mathcal{O}(\sqrt{N})\) buffer algorithms are inferior to *Frogsort* and *Squidsort*. Regarding Footprint the ranking for sorting random doubles is *Frogsort2* < *Squidsort2* < *Frogsort3* < *Walksort* < *Jumpsort* < *Frogsort1* < *Squidsort1* < *Sqrtsort* < *Grailsort* < *Knuthsort*. For pre-sorted data the ranking of *Walksort* and *Jumpsort* improves: they are quite adaptive.

## Partition&Pool

Some of the *greeNsort ^{®}* learnings can be transferred from

*Split&Merge*to

*Partition&Pool*algorithms. It is possible to turn

*Zacksort*into the stable

*Kiwisort*algorithm using 100% distant buffer. By partitioning not only data but also 100% local buffer we get the distance reducing

*Swansort*. For a very special use case it is even possible to reduce to 50% buffer, see

*Storksort*.

## Size-varying

A popular method for sorting elements of varying size - such as strings, bigints or arbitrary objects with a key field - is indirect sorting of pointers to elements using *Quicksort*, or better one of *Zacksort*, *Zucksort* or *Ducksort*. However, this is inefficient because it incurs heavy random access (see *UZacksort*, and even more inefficient if the sort is stabilized by breaking ties by comparing pointer address values, see *WQuicksort2*). An alternative is directly sorting size varying elements using a buffer. Splitting can be done either by helper-pointers to elements or directly splitting at null-terminators (for the latter see *VKnuthsort* and *VFrogsort1*). These direct sorts are more efficient than the indirect Quicksorts. Only if there are many string duplicates, unstable *UZacksort* benefits from its \(\mathcal{O}(N\log{D})\) early termination. Hence it seems promising to implement direct stable Partition&Pool algorithms for size-varying data (*VKiwisort* and *VSwansort*, not yet implemented).

## Parallel

Doing a task in parallel can save energy, Intel calls this “race to idle”. All Divide&Conquer algorithms are easily implemented with parallel branches (see *PQuicksort2*, *PQuicksort2B*, *PDucksort*, *PDucksortB*). However, the partitioning of the *Quicksort* family is difficult to implement in parallel, and this involves trade-offs. By contrast, parallelizing binary merges is relatively straightforward. *greeNsort ^{®}* uses a method that allows to parallelize over an arbitrary number of processes, not only power of two processes, see

*PKnuthsort*,

*PFrogsort0*,

*PFrogsort1*,

*PFrogsort2*,

*PFrogsort3*,

*PVKnuthsort*,

*PVFrogsort1*. As expected, parallel execution reduces not only runTime but also Energy (although to a lesser extent).

Note that the parallel speed-up of the *Frogsorts* scales almost as linear with the number of cores like the speed-up of *Knuthsort*, hence the Footprints of *PFrogsorts* are clearly lower than that of *PKnuthsort*.

Note that the benefit of *Frogsort2* over *Frogsort1* somewhat diminishes when more parallel cores are used.

Note that the linear setup-phase of *Frogsort0* somewhat better parallelizes than the recursive setup of *Frogsorts1,2,3,6*. Hence it is promising to implement variants of *Frogsorts2,3,6* which leverage a linear setup with chunks of a predefined mixture of data and buffer elements.