# Mergesort trilemma

In the Quicksort-dilemma there were trade-offs between efficiency and adaptivity. In merging there are trade-offs between efficiency, adaptivity and reducing buffer memory: one cannot have all three of them: full \(\mathcal{O}(N\log{N})\) efficiency and full \(\mathcal{O}(N)\) adaptivity and low-memory.

## Forgotten art

Let’s rehearse how a generic efficient potentially parallel *Mergesort* is organized. Naive implementations allocate 100% buffer in the merge, copy the data to the buffer (100% moves), merge the two input streams back (100% comparisons and 100% moves) and deallocate the buffer. Obviously better is to allocate the buffer once before recursing and de-allocating once done, however what about the 200% moves?
*Timsort* (T. Peters (2002)) allocates not more than 50% buffer, copies note more than 50% out and merges 100% back, that is 150% moves.
Sedgewick (1990) teached a *Nocopy-Mergesort* that only merges (100% moves) without copying out by alternating between two memory regions of size 100% each.

Sedgewick teaches *Nocopy-Mergesort* with three loop-checks: two on the two input-streams and one on the output loop, obviously two loop-checks on the two input-streams ar enough. Sedgewick explains how to get away with one loop-check on the output streams by using sentinels in a mutual-recursive *Bitonic-Mergesort*, which sorts the left half ascending and the right half descending (from left), which is however not stable. Using the symmetric sorting definition we can fix this, by sorting the right half not descending from left but ascending from right which makes the algorithm nicely symmetric (*Bimesort*). However there are still disadvantages: this sentinel-approach must move data in each merge, hence cannot skip all merging for presorted data. Interestingly this sentinel-approach is not even necessary to get down to one loop-check: D. Knuth (1973) had teached 20 years earlier that one loop-check is sufficient, if only that input-sequence is checked for exhaustion from which the last element was merged. Combining this with *Nocopy-Mergesort* gives an efficient algorithm we name *Knuthsort* in honor of Donald Knuth. Note that Katajainen and Träff (1997) showed that the number of loop-checks can be further reduced to ½ by investigating which of the two input-sequences exhausts first and only checking that one. Be warned that Katajainen’s code reads beyond the last element, we fixed this and named the result *Katasort*. Note further, that Katajainen’s optimization costs an extra comparison, hence we consider it tuned and prefer *Knuthsort* as the prior-art reference for efficient prior-art binary *Mergesort* in contiguous space.

For random data *Knuthsort* (and *Katasort*) are much more efficient than *Timsort*. Conventional wisdom assumes, that Timsort’s inefficient merge is needed to reduce buffer to 50% and to achieve \(\mathcal{O}(N)\) adaptivity to presorted data: after the merge *Timsort* has the data in the same memory-region as before the merge, hence it is easy to skip copying-out and merging-back in case of presorted keys. However, conventional wisdom is wrong, we can have one of the two without compromising on efficiency, either buffer-reduction or full adaptivity. Let’s start with the latter.

## Full adaptivity

The alternating merge of *Nocopy-Mergesort* finally completes in the data memory, not in the buffer. However, for odd recursion depths this starts merging from the buffer memory, hence Sedgewick copies the data to the buffer memory before recursion. Our *Omitsort* no longer uses a pre-determined alternating merge, instead we let the merge functions decide whether it merges (or whether it can omit the merge in case of presorted data): the merge function simply returns the location of the data and the sort function checks whether after its two merges the data are in the same memory region, if not, it copies one to the data region. For fully presorted data no moves are needed anymore, not even Sedgewicks initial moves. *Omitsort* uses \(\mathcal{O}(N)\) comparisons for diagnosing presortedness as non-overlap of the two input sequences.

For descending data, an ascending *Omitsort* cannot omit: each merge must do its work, hence the total cost are \(\mathcal{O}(N\log{N})\). Cheaper would be to do nothing and only reverse the total sequence after checking presortedness throughout the recursion. Our *Octosort* achieves this by no longer requiring a specific order during the merging: the resulting order is simply data-driven and returned by the recursive sort function. Only if the left and right recursive sorts return different directions, the desired (ascending) order is enforced. For perfectly ascending and descending data the are no moves during the recursion, for descending data only a final reversal is needed, hence *Octosort* is like *Timsort* \(\mathcal{O}(N)\) bi-adaptive for pre-sorted data, but unlike *Timsort* much more efficient for non-sorted data. *Octosort* can also be fully parallelized, while *Timsort* has inherently serial tasks.

## Distance minimization

“For decades, the machine balance of compute speed to memory bandwidth has increased 15%–30% per year […] Projections for future machines only exacerbate the current data movement crisis” (Dongarra 2022).

We all have learned that sorting cost is \(\mathcal{O}(N\log{N})\) … for constant access costs in the RAM-model. We all know that the RAM-model is wrong and access costs are not constant: todays memory hierachies are deep, L1, L2, L3, RAM on local socket, RAM on foreign socket, and todays virtualization and cloud techniques might even give us memory on different machines, in different LAN and across WAN networks.

The traditional answers to the data movement crisis are k-ary algorithms (which reduce the number of memory passes), block-access (which is only suitable for one of the cache-layers) and cache-oblivious algorithms (which are complicated). *greeNsort ^{®}* takes a different perspective, the perspective of an

*Ordinal Machine Model*: there is a cost for moving data over distance, but the exact cost of a move (or access) over distance \(d\) is unknown. It could be anything between \(\mathcal{O}(1)\), \(\mathcal{O}(\log{d})\) and \(\mathcal{O}(d)\). An example for the latter - linear move costs - would be sorting N cars in a row in a parking slot next to a road. Let’s assume another free N positions of buffer parking space. A buffer next to the N cars is the best we can hope for, hence merging N cars to the buffer space costs moving each car on average N positions on each recursion level, which gives us total move cost of \(\mathcal{O}(N^2\log_2{N})\) for all cars on all recursion levels. That’s expensive.

Note that even *K-ary Mergesort* for our cars still would be quadratic: \(\mathcal{O}(N^2\log_k{N})\). Contrast this to *Quicksort*, where the move distance is halved on each partitioning in the recursion, therefore the total move costs is \(\mathcal{O}(N^2)\). This is a little appreciated reason why *Quicksort* performs robustly on many different machines. In other words, *Quicksort* zooms into local memory, while *Mergesort* keeps merging over global distances.

## Gapped-merging

A simple trick brings locality to *Mergesort*: for \(N\) elements of data take \(2N\) elements of memory and alternate merging between odd and even positions, this reduces the move distances as the algorithm divides deeper. Unfortunately gapped-merging comes with a couple of disadvantages: reading \(N\) elements actually scans \(2N\), hence gapped *GKnuthsort* is on current CPUs slower than *Knuthsort*. Also gapped-merging requires all elements to have the same size.

## Buffer-merging

Locality requires starting the merging with local buffer gaps between the data. If we want a contiguous result of the sort with only two regions, data and buffer, this requires that we merge not only the data but also the buffer. Rehearse this: buffer becomes a first-class object of merging. Understand that Divide&Conquer *with* buffer and good locality requires Divide&Conquer *of* buffer. Once this is understood, it looks really weird, that in *Mergesort* for decades we only merged data but not buffer. Now let’s look at some ways to merge buffer. Let uppercase letters represent data and lowercase represent buffer.

If the result of buffer merging is one contiguous data region and one contiguous buffer region, we have to make a fundamental asymmetric choice: data left and buffer right (`Ab`

) or data right and buffer left (`aB`

). If we assume `Ab`

before splitting and after merging, standard self-recursion implies to replicate that pattern across all recursion levels, i.e. we merge from `AbCd`

to `ACbd`

. In order to merge `A`

and `C`

, a naive approach would first transport all input streams to the right: `bdAC`

and then merge back to `ACbd`

. 100% extra moves are expensive (*TKnuthsort*). A more efficient approach would only relocate the streams in the left half to the gaps in the right half `dbCA`

before merging to the left `ACbd`

. Note that the order of the streams has changed, hence care is needed to retain stability when merging (*Crocosort* with Knuth’s merge). Unfortunately, with 50% extra moves, this is still as inefficient as *Timsort*.

## Symmetric merging

It is possible to get rid of any extra moves if we leverage symmetric mutual recursion: if in the left branch we sort the data to the left (buffer to the right) and in the right branch we sort the data to the right (buffer left). What we obtain is the data in the outer regions and the buffer in the inner regions: the two inner buffer regions are contiguous without any extra moves. That’s nice, but even nicer is that not more than 50% buffer is needed: merging is done from inner to outer input-streams such that the result is aligned at the outer border. This semi-inplace merging has the following adaptivity properties: for perfectly presorted data the number of comparisons and moves is automatically reduced to 50%, with a non-overlap test the number of comparisons can be reduced to 0%, but at lat least 50% of the data must be moved. Symmetric merging offers two variants: a symmetric variant that sorts one branch left-ascending and one branch right-ascending (*Geckosort*) and a asymmetric variant which uses the same order in both branches (*Frogsort*). *Geckosort* is 25% adaptive to ascending and 25% adaptive to descending keys. *Frogsort* is 50% adaptive to presorting in the implemented order. Note that *FROG* is an acronym of *Flip Recursive Organized Gaps*.^{7}

## Frogsort0

The simple (and first) variant of *Frogsort* was *Frogsort0* (2010): it operates on *triplets* of memory elements, one buffer element symmetrically in the middle between two data elements, i.e. `AbC`

. *Frogsort0* splits and merges the triplets. Rule: if there is an odd number of triplets, the surplus triplet goes to the outer side. The setup of the gapped elements can be done before Split&Merge.

position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|

setup | A | b | C | D | e | F | G | h | I |

merge | A | C | b | e | D | F | |||

merge | A | C | D | F | b | e | h | G | I |

merge | A | C | D | F | G | I | b | e | h |

Odd number of elements can be handled by using one extreme dummy value (see the implementation of *Frogsort0*) or by a specific third recursion function that handles the outermost incomplete triplet (see the implementation of *PFrogsort0*).

## Frogsort1

An alternative is splitting single elements in *Frogsort1*. Rule 1: if there is an odd number of elements, the surplus element goes to the outer side. Rule 2: the size of the buffer is \(N/2\) (integer division). The setup is done in an extra recursion before the Split&Merge Recursion.

position | 1 | 2 | 3 | 4 |
---|---|---|---|---|

top-2 | A | b | C | |

top-1 | A | C | b | D |

top | A | C | D | b |

## Frogsort2

Textbook knowledge tells us that Divide&Conquer should be balanced in order to minimize operations. Symmetric merging allows to reduce the buffer to much less than 50%: let the inner branch be \(p\%\), then symmetric merging needs only \(p\%\) buffer. Yes, this increases the maximum recursion depth and the total number of operations, but it reduces the total %RAM. Hence for the sustainable Footprint measure, we expect an U-shaped function of \(p\). *greeNsort ^{®}* implemented this as

*Frogsort2*. For sorting doubles, surprisingly, not only exists an optimal \(p\) below 10% regarding Footprint, there is also an optimal \(p\) below 20% regarding speed. This is surprising given usual space-speed trade-off expectations, that less RAM implies longer runTime. One reason for this is: imbalanced merging makes for better branch-prediction. With

*Frogsort2*we have a nice algorithm, that can be tuned to specific hardware features using its parameter \(p\). Note that

*Frogsort1*is a special case of

*Frogsort2*at \(p=0.5\).

## Frogsort3 and 6

So far we have *split* the buffer between the two branches. An alternative is to *share* the buffer, i.e. first we send all the buffer down the left branch, and then we send the buffer down the right branch. This allows to use even less buffer. *Frogsort3* does this, until there is enough buffer at a branch to switch to *Frogsort1*. *Frogsort6* does this, until there is enough buffer at a branch to switch to *Frogsort2*. *Frogsort1* is a special case of *Frogsort3* with parameter \(p=0.5\). *Frogsort6* has two parameters \(p3\) and \(p2\), and *Frogsorts* 1,2,3 are special cases of *Frogsort6*.

## Squidsort

*Frogsort* saves 50% compares and moves for presorted data, but not for reverse-sorted data. *Geckosort* saves 25% compares and moves for presorted data and for reverse-sorted data. Tuning *Frogsort* with a non-overlap comparison for presorted data reduces all other compares to 0%. Combining *Frogsort* with the data-driven lazily enforced order of *Octosort* gives *Squidsort*, which needs 0% comparisons and 50% moves for presorted and reverse-sorted data (see *Squidsort1* and *Squidsort2*). *Squidsort* beats *Timsort* and related algorithms such as *Peeksort* and *Powersort* by Munro and Wild (2018), unless data is extremely presorted.

Gaps are crucial for stable sorting. As late as 2006, Bender, Farach-Colton, and Mosteiro (2006) published

*Librarysort*, a Gapped Insertion Sort. Like people leave gaps between books in bookshelves, gaps reduce insertion costs from \(\mathcal{O}(N^2)\) to amortized \(\mathcal{O}(N\log{N})\) in the RAM-model. While the usage of gaps in*Librarysort*is rather probabilistic,*Frogsort*uses gaps in a deterministic, ‘Organized’ way.↩︎