Home

PLF C++ Library - plf::reorderase

Introduction

Reorderase (portmanteau of 'reorder erase') is an evolution of the classic "swap-and-pop" idiom commonly used to speed up singular erasures on deques and vectors. In essence that technique involves swapping the element you want to erase with the back element of the container, then using pop_back() to erase the element. Since popping the back element involves no element relocation, whereas erasing from a non-back location in a vector/deque requires moving every single element after the erased element backwards by 1, this has a strong performance advantage, particularly when size() is large and/or the element type is large/non-trivially-copyable. The only caveat is, you have to be okay with the order of elements in the container changing when you erase.

We can optimize this technique further by removing the unnecessary swap and simply copying or moving the back element to the location of the element we wish to erase, then popping. This reduces a 5-item operation (allocate swap buffer, copy to swap, copy back to location, copy swap buffer to back, pop) down to a 2-item operation (move/copy + pop) and removes the allocation/deallocation required for the swap. Reorderase (single erasure) does this and also accomodates trivial types, non-movable types and types which might throw when copying (the latter requires a temporary buffer in case the copy fails).

Reorderase goes further: the same technique used for single erasures can be extended to ranged erasures (ie. erase(first, last)) with great results - again, provided that one does not care about element reordering. And it can also be used to construct alternatives to std::erase_if, std::erase and std::partition which have strong performance advantages with larger element types.

In addition a few new algorithms become apparent: unordered erase_if-style erasure for only a sub-range of a container's elements. This basically takes elements from the back of the container to fill in the holes. A slightly different method of partitioning to the 3 major vendor implementations is introduced (plf::partition), which performs better on some CPU architectures and worse on others (older architectures tend to benefit more). As well as it's destructive equivalent (plf::destructive_partition), which leaves moved-from elements at the end of the range instead of swapping.

A performance summary when using std::vector follows, using averages from benchmarks across four different element types (4-byte, 8-byte, 40-byte and 490-byte) and numbers of elements randing from 10 to ~920000/870000 (depending on test). The numbers of elements are increased by a factor of 1.1 for each measurement, hence there are more measurements of lower numbers of elements and the results are thererby skewed toward low numbers of elements. Reorderase's performance advantages increase with larger numbers of elements, hence the averages below can be seen as very conservative. To clarify, 100% faster == twice as fast. 200% faster == three times as fast:

See the individual benchmarks for more specifics for different types and numbers of elements.

Motivation

Although the swap-and-pop idiom is relatively well-known in computing circles, a lot of people do not know about it. In addition, many who do know the technique use the original swap technique rather than the optimized move-and-replace one. This implementation is the first I've seen for ranged erasures, or erase_if alternatives (have subsequently seen Brent Freidman's implementation from the 2015 unstable_remove paper attempt, which is similar in terms of logic but not as optimized). Non-back vector erasure and non-front/back deque erasure have appalling performance, but many programmers in my experience do not appear to realise the extent to which this hampers them.

Benchmarks

License

plf::reorderase is under a permissive zlib license. This means: Software is used on 'as-is' basis. It can be modified and used in commercial software. Authors are not liable for any damages arising from its use. The distribution of a modified version of the software is subject to the following restrictions:

Download

Download here (4kb zip file) or view the repository

The reorderase function library is a simple .h header file, to be included with a #include command.
In addition if you are interested in benchmarking you can also download the plf benchmark suite.

Function Descriptions and syntax

const_iterators are not supported for the obvious reason that the content of the position and first iterators will get written to.

You might ask, why return an iterator? With regular vector/deque erase member functions, the location of the next element post-erasure is always the same place as the first element you've erased, ie. for singular erasures, position is the next element's location post-erasure, as every element gets shifted back by one. Same for range erasure - first is always the returned location. Therefore returning an iterator from erase is a waste of CPU cycles. Well, unfortunately the optimisations for this algorithm for deque (erasing from the front where appropriate instead of moving elements) mean that the return value isn't necessarily first or position - it could be position + 1, or last.

Okay, last question: these algorithms are only for random_access containers, is there any point to making them compatible with bidirectional containers like colony, std::list, etc? No. Those containers do not have performance issues with erasure like deque/vector/static_vector do. It's only deques and vectors that have the massive amount of element relocation caused by erasures.

Benchmarks

Benchmark results for reorderase v1.00 under GCC 11.2 x64 on an Intel i7 9750 are below. The benchmark test code can be downloaded above. Benchmark methodology is the same as those on the main benchmarks page. The measurements are done on a logarithmic scale, hence the averages presented in the text are skewed toward lower numbers of elements, since there are more samples of those. Higher numbers of elements consistently show the strongest performance improvements from reorderase, so this skews against reorderase.

Singular erasures comparison

To start we'll analyse singular element erasures. Here we erase 25% of all elements in a container while iterating.

Click images or hover over to see results at linear scale instead

test result graph

Singular erasures of ints results in a skewed average of reorderase being 104014% faster than vector.erase, with a maximum of 1165378% faster for 870000 elements. First measurement being a factor of ten faster than vector.erase is at 3000 elements. Minimum speed increase is 10% at 10 elements.

test result graph

Singular erasures of doubles result in a skewed average of reorderase being 197548% faster than vector.erase, with a maximum of 2098764% faster for 920000 elements. First measurement being a factor of ten faster than vector.erase is at 2000 elements. Minimum speed increase is 15% at 10 elements.

test result graph

Singular erasures of small structs result in a skewed average of reorderase being 1255606% faster than vector.erase, with a maximum of 12990540% faster for 920000 elements. First measurement being a factor of ten faster than vector.erase is at 300 elements. Minimum speed increase is 52% at 10 elements.

test result graph

Singular erasures of doubles result in a skewed average of reorderase being 2485098% faster than vector.erase, with a maximum of 25974010% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 55 elements. Minimum speed increase is 135% at 10 elements.

The skewed average across all types and numbers of elements is 2375914% for reorderase.

Range erasures comparison

Here we erase 25% of all elements via erasing random ranges of elements, with a maximum random range of the square root of the number of elements.

Click images or hover over to see results at linear scale instead

test result graph

Range erasures of ints result in a skewed average of reorderase being 10826% faster than vector.erase, with a maximum of 92351% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 13957 elements. Minimum speed increase is 20% at 10 elements.

test result graph

Singular erasures of doubles result in a skewed average of reorderase being 15327% faster than vector.erase, with a maximum of 90208% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 6500 elements. Minimum speed increase is 49% at 10 elements.

test result graph

Singular erasures of small structs result in a skewed average of reorderase being 15315% faster than vector.erase, with a maximum of 63566% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 1050 elements. Minimum speed increase is 98% at 10 elements.

test result graph

Singular erasures of doubles result in a skewed average of reorderase being 12128% faster than vector.erase, with a maximum of 74438% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 450 elements. Minimum speed increase is 117% at 10 elements.

The skewed average across all types and numbers of elements is 13624% faster for reorderase.

std::erase_if comparison

Here we erase 25% of all elements via std::erase_if, the reorderase equivalent being reorderase_all_if.

Click images or hover over to see results at linear scale instead

test result graph

erase_if of ints results in a skewed average of reorderase_all_if being 7% faster than std::erase_if, with a lot of variability. Numbers of elements under 19 perform marginally (up to 2%) better with std::erase_if, probably due to the low amount of element movement and greater simplicity of the erase_if algorithm.

test result graph

erase_if of doubles results in a skewed average of reorderase_all_if being 7% slower than std::erase_if, with a lot of variability. Uncertain why doubles perform worse than int's here, but the finding is consistent across multiple machines. Not worth it here.

test result graph

erase_if of small structs results in a skewed average of reorderase_all_if being 22% faster than std::erase_if, with a minimum of 7% and a maximum of 52%.

test result graph

erase_if of large structs results in a skewed average of reorderase_all_if being 154% faster than std::erase_if, with a minimum of 50% and a maximum of 282%.

The skewed average across all types and numbers of elements is 43% faster for reorderase, though this is heavily weighted by the large struct results.

Version history

Contact: footer
plf:: library and this page Copyright (c) 2023, Matthew Bentley