**A Portfolio of efficient algorithms for different sorting
disciplines is the basis for sustainable sorting.**

The most generic (and expensive) sorting algorithms have the following features

- generic sorting sorts by comparison
- generic sorting is stable
- generic sorting can sort elements of varying size
- generic sorting has worst case complexity \(\mathcal{O}(N\log{N})\)
- generic sorting can be implemented with concurrent divide and concurrent conquer

**The greeNsort^{®} Portfolio provides
algorithms for different sorting disciplines, ranked here from most
generic to most specialized.**

A popular prior-art method is an indirect *Quicksort* which swaps
pointers to size-varying elements.^{1} This is relatively memory-parsimonious,
particularly if elements are big, however, it comes with *high
random-access CPU costs*. The *greeNsort ^{®}*
research shows that a direct

**The size-varying greeNsort^{®} algorithms VFrogsort1 and XFrogsort1 needs only 50%
buffer and still fulfill all above criteria for a generic sorting
algorithm.**

Most prior-art algorithms restrict themselves to equally-sized elements, which is faster.

The standard methods for stable sorting which allow for concurrent
execution use 100% buffer^{2}, for example *Mergesort* is the preferred
method for stable sorting in *C++*
if enough memory is available. *Mergesort* can easily be
tuned for adaptivity: an initial comparison checks whether the two
sequences overlap, if not the cost for comparing can be skipped which
reduces the merging to just copying. **The
greeNsort^{®} algorithm Omitsort also skips the cost
of movements, Octosort
further improves adaptivity to also reverse-sorted data. Both
greeNsort^{®} algorithms beat Mergesort with their \(\mathcal{O}(N)\) best case and beat Timsort with less
copying.**

*Timsort* as used in
Python and Java is a highly tuned adaptive *Natural-Mergesort*
which towards the best case of presorted data costs only \(\mathcal{O}(N)\) and which only needs 50%
buffer. However, beyond implementation complexity, the disadvantage of
*Timsort* is much higher cost
for really unsorted data and strong serial logic which hampers
concurrency. **The greeNsort^{®} algorithms Frogsort1 and Geckosort1 achieve the
generality of Mergesort
but save 50% buffer. Frogsort1 is automatically
(without tuning) adaptive to presorted data and Geckosort1 is
automatically adaptive to presorted and reverse-sorted data. Squidsor1t is a variant of
Frogsort1 tuned with
techniques from Octosort
to be bi-directionally most adaptive. Like Mergesort and unlike Timsort, the
greeNsort^{®} algorithms are efficient no-copy
algorithms (not more than one move per merge).**

The *greeNsort ^{®}* research is not aware of prior-art
algorithms that can match the speed of a good

Academic research was obsessed for decades with finding a true stable
inplace (i.e. 0-buffer) Mergesort that can practically compete with
standard *Mergesort*. In
2008, when Robert Sedgewick termed this goal as the *holy sorting
grail*, inplace-mergesort (*IMergesort*) was more than
factor 5 slower than *Mergesort*. In 2014^{3} Andrey Astrelin
achieved to implement *Grailsort*, based on an
algorithm published only theoretically by Huang&Langston
(1989-1992), such that it was not more than 1.5 times slower than
*Mergesort*. Andrey Astrelin also showed that using a true
inplace sort was somewhat academic: he provided variants *Sqrtsort*) that use negligible
\(\mathcal{O}(\sqrt{N})\) buffer and
are only 1.3 times slower than standard *Mergesort*. **The
greeNsort^{®} algorithms Walksort and Jumpsort also work with so
little \(\mathcal{O}(\sqrt{N})\) buffer
and achieve approximately the speed of standard Mergesort, further tuning
for adaptivity is possible**.

It is well known, that sacrificing stability enables effective
inplace-partitioning of equally-sized elements. Further sacrificing
worst-case \(\mathcal{O}(N\log{N})\)
for only average-case \(\mathcal{O}(N\log{N})\) (by using random
pivots instead of deterministic median-pivots) enables
inplace-partitioning with *Quicksort* at about the speed
of *Mergesort* in other
words: when the sacrifices are acceptable, *Quicksort* tends to
be more sustainable than *Mergesort* according to the
*greeNsort ^{®}* footprint definition

Recently *Super Scalar Sample Sort* (SSSS or short S4) has
been developed with the goal of replacing standard sorting libraries.
While it is faster than current standard sorting libraries, it is not
stable^{5},
it is more complex, it requires more buffer^{6} and its serial^{7}
implementation ( *IS4o*) is slower than
the best algorithms in the following section.

**The greeNsort^{®} algorithms Zacksort and Zucksort provide simple and
elegant solutions to the question open since Hoare’s 1961 inception of
Quicksort: how to obtain
early-termination on ties without the cost of extra operations slowing
down the algorithm on non-tied data.**

**A bit of history**: The original *Quicksort* as published by
Hoare (1961) seemed simple and almost perfect, but was vulnerable for
quadratic runtime under certain data input. Sedgewick (1977) compared
different alternatives and recommended a (modified) *Quicksort2* of Singleton
(1969) which had average-case \(\mathcal{O}(N\log{N})\) for all data
inputs, but no longer early-terminated to \(\mathcal{O}(N \log{D})\) in the tied case
(where \(D\) is the number of
*distinct* values). 3-way Quicksort by Wegner (1985) improved and
popularized by Bentley&McIlroy (1993) re-introduced early
terminating on ties by isolating a third partition with pivot-ties, but
due to extra-operations was slower for untied data ( *Quicksort3*). Yaroslavskiy’s
(2009) faster dual-pivot Quicksort ( *Dupisort*) used such
extra-operations to create a real third partition^{8}, but the complex^{9} algorithm
is strongly serial and probably impossible to implement branchless.
Edelkamp&Weiß (2016) published the much faster simple branchless
Block-Quicksort (*Quicksort2B*)^{10}, but
only with a rudimentary and expensive early-termination mechanism. Orson
Peters (2015) combined a branchless algorithm with a proper tuning for
ties and a proper tuning for presorted data, the tuning overhead of
extra operations in his *Pattern-Defeating Quicksort* (*Pdqsort*) is very little, so it
is close to optimal, see the C++ implementation on github.
However, once combined with an expensive comparison function such as
localized string comparison `strcoll`

, it becomes slower than
*Quicksort2*. **The
greeNsort^{®} algorithms Zacksort and Zucksort achieve early
termination on ties at only one extra comparison per
partitioning task, like Hoare’s original they use random pivots but
guarantee convergence for any input pattern at expected cost of \(\mathcal{O}(N \log{D})\), Ducksort additionally is
\(\mathcal{O}(N)\) adaptive to
presorted data**.

Sorting algorithms exploiting specific situations play an important
role in reducing the footprint of sorting.
*greeNsort ^{®}* does focus generic algorithms, however,
it is important to know the following classes of specific algorithms,
which do sort not by comparisons.

Analytic sorting algorithms sort elements by incrementally analyzing the keys for recursive partitioning, in other words, they require the specific situation where the sorting can be determined by analyzing the keys alone, without mutually comparing them. Malte Skarupke (2015) has stressed the importance of offering a convenient API for as many binary data types as possible, his C++ code can be found here.

Incremental analysis of string keys is known to everyone who has seen
a phone book. Strings are recursively partitioned
character-by-character, sorting ends as soon as the number of remaining
elements is one (or small enough to delegate the remaining work to
another algorithm). This method can handle strings of varying lengths,
but unlike generic sorting it cannot handle locale specific collation
orders. The straightforward way of doing this is partitioning into \(d_i\) buckets where \(d_i\) is the number of distinct characters
at the i-th character of the string. This works well in small alphabets,
if the number of distinct characters can become prohibitively high, an
alternative related to *3-way Quicksort* with a fan-out between 2
and 3 is Multi-key
Quicksort.

Incremental byte-by-byte analysis of binary types such as integers
works the same way and is called *MSB-Radixsort*. While
*MSB-Radixsort* supports early termination,
*LSD-Radixsort* dos not: it always processes all bytes. With a
fixed number of *k* bytes in a specific binary type the cost of
*LSD-radix sort* is \(\mathcal{O}(N
\cdot k)\) which makes some people believe that Radixsort has
cost linear with *N*. They claim that the \(\mathcal{O}(N\log{N})\) rule applies only
to generic comparison sorts, but that is a misinterpretation which
ignores that a fixed number of bytes limits the size *D* of the
domain - for example how many different integers the binary type can
represent. For more information see Wikipedia.

Synthetic sorting algorithms sort only keys that stem from a limited domain. This requires, that elements do contain nothing but keys, that there is a limited number of distinct keys, and that those keys can be projected onto consecutive positions. Typically a linear projection is used, i.e. synthetic sorting handles interval-scale data, but unlike generic sorting algorithms not ordinal-scale data (which requires mutual comparisons).

*Counting sort* starts with a vector of counters initialized
to zero, then it takes one pass over the data to project each key to a
position and increase a counter at this position. Once all elements have
been counted, one pass over the counters *synthesizes* the keys
in the order of the counter positions. This obviously has only linear
\(\mathcal{O}(N)\) cost. For more
information see Wikipedia.

If it is known that each key can only occur once, instead of a vector
of counters a vector of bits is sufficient. After initializing all bits
to 0, *Bit sort* takes one pass over the data to project each key
to a position and set that bit to 1. Once all keys have been set, one
pass over the bits *synthesizes* the unique keys in the order of
the bit positions. This obviously has only linear \(\mathcal{O}(N)\) cost and is very memory
parsimonious. Sorting by marking bits has been described in the chapter
*Cracking the Oyster* of the famous book *Programming
Pearls* by Jon Bentley (1999).

The following examples demonstrate the power and limitations of specialized algorithms. Say we want to sort 1 million integers. We use R and our package bit.

```
# package 'bit' version 4.0.0 is not yet on CRAN
# hence installation from github
# devtools::install_github("truecluster/bit")
require(bit)
require(microbenchmark)
```

If the integers are sampled from a small domain (2 million values) with few or no duplicates, specialized radixsort is faster than quicksort and even more specialized bitsort is even faster, particularly if we tell bitsort that there are no duplicates in this example:

```
if (!exists("sx")) {
x <- sample(2e6, 1e6)
sx <- rbind(
quicksort = summary(microbenchmark(sort(x, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(x, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(x, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(x, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sx[,-1], 1)
```

```
## min lq mean median uq max neval
## quicksort 65.1 66.3 67.3 67.0 68.0 70.3 10
## radixsort 50.9 51.1 52.8 52.0 54.5 56.6 10
## multip_bitsort 28.5 30.1 33.1 33.1 35.9 38.6 10
## unique_bitsort 15.6 16.0 16.8 17.0 17.1 19.0 10
```

However, if we sample from a huge domain (2 billion values), the more specialized bitsort algorithms take significantly longer:

```
if (!exists("sy")) {
y <- sample(1e9, 1e6)
sy <- rbind(
quicksort = summary(microbenchmark(sort(y, method = "quick"), times = 10), unit = "ms")
, radixsort = summary(microbenchmark(sort(y, method = "radix"), times = 10), unit = "ms")
, multip_bitsort = summary(microbenchmark(bit_sort(y, has.dup = TRUE ), times = 10), unit = "ms")
, unique_bitsort = summary(microbenchmark(bit_sort(y, has.dup = FALSE), times = 10), unit = "ms")
)
}
round(sy[,-1], 1)
```

```
## min lq mean median uq max neval
## quicksort 65.9 66.5 67.8 67.4 68.2 72.5 10
## radixsort 36.7 38.1 39.2 38.8 39.6 45.5 10
## multip_bitsort 139.5 140.0 143.0 141.1 146.8 149.9 10
## unique_bitsort 1459.3 1465.5 1491.8 1483.9 1527.2 1535.1 10
```

although

*Quicksort*has only average case complexity \(\mathcal{O}(N\log{N})\) and although concurrent conquer is difficult↩︎A standard optimization for reducing the buffer need of such algorithms based on merging is allocating 50% buffer, use this buffer first for sorting the left half, then sorting the right half and finally merging both halves, a limitation of this approach is that both halves cannot be sorted concurrently↩︎

In the same year Robert Sedgewick changed his definition of the

*holy sorting grail*from a \(\mathcal{O}(N\log{N})\) to a \(\mathcal{O}(N)\) best case, which denies Andrey Astrelin the recognition for having achieved the*holy sorting grail*. Here are Segewick lectures of the years 2008, 2009, 2010, 2011, 2012, 2013, 2014. Sadly,*Andrey Astrelin passed away*on January 13th 2017 at the age of 48.↩︎This implies that if in a given system sufficient memory is available that is currently not needed for other tasks, then investing negligible buffer into

*Walksort*or little buffer into*Frogsort2*can eliminate the remaining risk of quadratic runtime.↩︎The original publications do not specify stability, but the implementors clarify:

*“So far there is no stable implementation planned. A stable implementation might be difficult due to the algorithm’s design.”*↩︎The inplace version at least \(\mathcal{O}(\sqrt{N})\))↩︎

Sorting doubles, the parallel implementation IPS4o is faster, we have not compared to a parallel implementation of

*Pattern-Defeating Quicksort*↩︎This improves the average \(\mathcal{O}(N\log_2{N})\) algorithm to \(\mathcal{O}(N\log_3{N})\) regarding moves↩︎

*greeNsort*research discovered unneeded moves in certain situations↩︎^{®}Fun fact: the optimization published by Edelkamp&Weiß (2016) was already suggested in a little known paper of Hoare (1962) in which he gives context and explanations for his 1961 algorithm.↩︎

Copyright © 2010 - 2024 Dr. Jens Oehlschlägel - All rights reserved - Terms - Privacy - Impressum