Chapter 3 Methods&Measurement

3.1 Implementation

The basic philosophy of the greeNsort^® test-bed is implementing all algorithms with a equally low degree of tuning for fair scientific comparison. Some algorithms have additionally implemented with well-defined tuning to presorted values (postfix ‘A’) or a simple approach to avoid branch-misprediction (postfix ‘B’). The latter works well on partitioning, with few compiler-versions on merging. All algorithms have been thoroughly tested (also for stability) and checked for memory-leaks through valgrind.

Each sorting algorithm has been implemented as a static function in a separate C module that contains all required subroutines, this insures fair code locality and reduces linker artifacts. In production code the implied code-duplication for multiple used functions such as Insertionsort is usually avoided, here in the test-bed this redundancy gives more fair comparisons.

Older implementations often have two versions of the sorting algorithm (one rather using pointers, one rather using indexed arrays). It turned out that pointer implementations are more practical and tended to be faster in merging, indexed array implementations were more readable and not slower in partitioning, hence newer implementations did not duplicate and compare this anymore. There are usually two context-embeddings of the sorting algorithm:

a simpler ex-situ version which assigns new memory for data and buffer, copies data to new memory, sorts, and copies back (copying back allows quality assurance)
a more complex in-situ version that only allocates new memory for buffer and re-uses original memory for sorting (which can be more complicated)

To get the picture, see the C and R code for Knuthsort and Frogsort1 in the src and R folder of the greeNsort R-package.

3.2 Simulation

All measurements in this document refer to in-situ versions. For the Split&Merge algorithms with less than 100% buffer, the final in-situ merge is necessarily imbalanced and potentially more expensive, hence choosing the in-situ setting gives conservative evaluations of some greeNsort^® algorithms.

For details see the scripts serial.R, parallel.R and string.R in the inst folder of the greeNsort R-package.

3.3 Basic measurement

For Energy measurement the test-bed uses the (R)unning (A)verage (P)ower (L)imit, Energy measurement of Intel (RAPL) counters of the linux powercap kernel module, see lib_energy.h, lib_energy.c and perf.R. The RAPL counters occasionally overflow, the test-bed detects and corrects overflow (assuming no multiple overflow, which does not happen for the duration of the measurements). RAPL has separate counters for RAM energy (dram) and package energy, which is the sum of CPU energy (core), GPU energy (uncore) and a rest that is not counted separately. After overflow-treatment, the test-bed decomposes those counters into core, unco and rest base. All sorting algorithms in the greeNsort^® test-bed return performance data:

n: number of elements
b: average bytes per element
p: number of active processes (usually 1, 0 for perfsleep)
t: number of active threads per process
size: %RAM relative to data size
secs: RUNtime in seconds from a high-resolution clock⁷
base: basic Energy
core: CPU Energy
unco: GPU Energy⁸
dram: RAM Energy

3.4 Calibration

An idle computer still consumes energy, for keeping the RAM alive, keeping the CPU ready, for background services etc. The test-bed provides functions that measure idle energy (perfsleep, calibrate), estimate idle energy for a given RUNtime (see function calibrate) and optionally adjust measurements for idle energy. The R function optperf returns measurements according to settings in options("greensort_perf_calc") by calling one of rawperf (raw unchanged measurement), difperf (difference to idle measurements) or adjperf (keeps total energy constant by reducing all energy measurements but base and adding the reduction to base). The evaluations in this paper use the greensort_perf_calc='adj' setting.

3.5 Evaluation

The test-bed provides a couple of functions extracting well-defined KPIs from the measurements. For sorting algorithms GPU-energy is not interesting, hence bcdEnergy, the sum of base, core, dram is a reasonable energy measurement (and is not influenced by options("greensort_perf_calc")). The pcdEnergy measurement is similar but adds only the thread-fraction of base (counting only hardware-threads, not hyper-threads, it is influenced⁹ by options("greensort_perf_calc")). This paper uses:

3.6 Benchmark data

The algorithms are measured on the following input data patterns:

permut: randomly permuted numbers from \(1 \dots n\)
tielog2: random sample of \(\log_2{n}\) distinct values
ascall: n distinct ascending numbers
asclocal: n distinct numbers randomly put into \(\sqrt{n}\) presorted sequences of length \(\sqrt{n}\)
ascglobal: n distinct numbers cut into ascending \(\sqrt{n}\) quantiles of length \(\sqrt{n}\) and randomly permuted per quantile

Measurements for these 5 patterns are averaged to a TOTAL KPI for ranking of algorithms. Furthermore the following 3 patterns are measured

descall: n distinct descending numbers
desclocal: n distinct numbers randomly put into \(\sqrt{n}\) reverse-sorted sequences of length \(\sqrt{n}\)
descglobal: n distinct numbers cut into descending \(\sqrt{n}\) quantiles of length \(\sqrt{n}\) and randomly permuted per quantile

The latter are interesting with regard of the symmetry of adaptivity, but not included in the former TOTAL KPI. The decision to include ascending and exclude descending has been made on the rationale that adaptivity to ascending data is more important than to descending because ascending data arises more naturally (e.g. ascending datetime) or can be created at little cost (by reversing). Furthermore, including another three presorted patterns would give too much weight on easy patterns.

3.7 Scripts and results

The measurement are done in three parts serial, string and parallel, most important results are integrated into this document, for more details see:

serial.R with results of serial algorithms on double data in serial.RData (n=2^21, 100 replications)
string.R with results of serial algorithms on null-terminated strings in string.RData (n=2^17, 100 replications)
parallel.R with results of parallel algorithms on double data in parallel.RData (n=2^24, 50 replications for each combination of 1,2,4,8 processes with 1,2,4,8 threads)

Complementing the charts in this paper, we have prepared PDFs (serial.pdf, string.pdf, parallel.pdf) with charts that visualize raw data for exploratory checks such as outliers or suspicious trends (to us the raw data looks good). To deal with few expected outliers, we report the median instead of the mean of measurements.

3.8 Hardware, OS, compiler

All measurements reported here are done on an Intel¹⁰ i7-7700 CPU under ubuntu.20.04 with the 5.13.0-44-generic kernel and compiling the test-bed with gcc.9.4.0. The CPU is run with hyper-threading switched-off in the BIOS: this goes with a clear expectation that algorithms can successfully satisfy up to 4 cores (more threads are unlikely to speed-up) and provides a clear model for the attribution of the base-energy to 4 cores in pcdEnergy.

see lib_timing.h and lib_timing.c↩︎
on my Intel i7-7700 not in RAPL, hence 0↩︎
Note that the reported energy-ratios are more or less the same whatever choice of energy-measure and idle-correction is used↩︎
Measurements on AMD® Ryzen 7 pro 6850u were also done giving more or less similar results, however, RAPL measurements on Intel are more precise. The 8-core AMD has hyperthreading switched on, I report some results of simple tests (one measurement of big problem after 1 second CPU sleep), see Update: Powersort and Update: AMD ↩︎