Home

PLF C++ Library - Benchmarks

Last updated 21-8-2017, plf::colony v4.00, plf::list v1.00

  1. Test setup
  2. Tests design
  3. Raw performance benchmarks (against standard containers)
  4. Comparitive performance benchmarks (against modified standard containers)
  5. Unordered low-modification scenario test
  6. Unordered high-modification scenario test
  7. Ordered low-modification scenario test
  8. Ordered high-modification scenario test
  9. Referencing scenario test
  10. Sort performance
  11. Overall performance conclusions

Test machine setup

The test setup is an an Intel Xeon E3-1241 (Haswell core), 8GB ram, running GCC 7.1 x64 as compiler. OS is a stripped-back Windows 7 x64 SP1 installation with most services, background tasks (including explorer) and all networking disabled. Build settings are "-O2 -march=native".

The source code for the benchmarks can be found in the colony page downloads section.

General test design

The first (roughly) 10% of all test runs are discarded in order to 'warm up' the cache and get statistically-meaningful results. Tests are based on a sliding scale of number of runs vs number of elements, so a test with only 10 elements in a container may average 100000 runs to guarantee a more stable result average, whereas a test with 100000 elements may only average 10 runs. This tends to give adequate results without overly lengthening test times. I have not included results involving 'reserve()' as yet.

Insertion: is into empty containers for the raw tests, entering single elements at a time. For the 'scenario' tests there is also ongoing insertion at random intervals.
Erasure: initially takes place in an iterative fashion for the raw tests, erasing elements at random as we iterate through the container. The exception to this is the tests involving a remove_if pattern (pointer_deque and indexed_vector) which have a secondary pass when using this pattern.
Iteration: is straightforward iteration from the start to end of any containers. Typically there are more runs for iteration than the other tests due to iteration being a much quicker procedure, so more runs deliver a more stable average.

Raw performance tests

Before we begin measuring colony against containers or container modifications which do not invalidate links on erasure or insertion, we need to identify which containers are good candidates for comparison based on raw performance without regard to linkage invalidation. With that in mind the following tests compare colony against the main standard library containers. Tests are carried out on the following types: (a) a 8-bit type ie. char, (b) a 32-bit type ie. int, (c) a 64-bit type ie. double, (d) a small struct containing two pointers and four scalar types (40 bytes on x64), and (e) a large struct containing 2 pointers, 4 scalar types, a large array of ints and a small array of chars (490 bytes on x64).

The first test measures time to insert N elements into a given container, the second measures the time taken to erase 25% of those same elements from the container, and the third test measures iteration performance after the erasure has taken place. Erasure tests avoid the remove_if pattern for the moment to show standard random-access erasure performance more clearly (this pattern is explored in the second test). Both linear and logarithmic views of each benchmark are provided in order to better show the performance of lower element amounts.

Erasure tests forward-iterate over each container and erase 25% of all elements at random. If (due to the variability of random number generators) 25% of all elements have not been erased by the end of the container iteration, the test will reverse-iterate through the container and randomly erase the remaining necessary number of elements until that 25% has been reached.

Insertion Performance

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph

plf::colony and plf::list show strong insertion performance, with plf::colony slightly out-performing. std::vector performs well for small types, and std::deque performs well until large structs (libstdc++'s deque implementation has a memory block limit of 512 bytes, making it a glorified linked-list in this case). plf::list generally outperforms std::list by 333% on average.

Erase Performance

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph

Without remove_if std::deque and std::vector perform poorly, while plf::colony and plf::list lead the way. plf::list has the best erasure performance until very large amounts of memory are involved for some reason, on average outperforming std::list by 81%.

Post-erasure Iteration Performance

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph

std::vector easily has the best iteration performance, although for all but large structs std::deque has almost equal performance. plf::colony comes in third for larger-than-scalar types, while plf::list comes third for scalar types, 16% faster on average than std::list.

Comparative performance tests

This is an additional test purely for colony. Colony is primarily designed for scenarios where good insertion/erasure/iteration performance is required while guarantee'ing stability for outside elements referring to elements within the container, and where ordered insertion is unimportant. The two containers from the raw performance tests which may compare both in performance and usage (after modification) are std::deque and std::vector. std::list does not meet these requirements as it has poor insertion and iteration performance. plf::list does, but it's comparitive performance is already known so will not be included here.

pointer_deque and indexed_vector

Because std::deque does not invalidate pointers to elements upon insertion to the back or front, we can guarantee that pointers won't be invalidated during unordered insertion. This means we can use a modification called a 'pointer-to-deque deque', or pointer_deque. Here we take a deque of elements and construct a secondary deque containing pointers to each element in the first deque. The second deque functions as an erasable iteration field for the first deque ie. when we erase we only erase from the pointer deque, and when we iterate, we iterate over the pointer deque and access only those elements pointed to by the pointer deque. In doing so we reduce erase times for larger-than-scalar types, as it is computationally cheaper to reallocate pointers (upon erasure) than larger structs. By doing this we avoid reallocation during erasure for the element deque, meaning pointers to elements within the element deque stay valid.

We cannot employ quite the same technique with std::vector because it reallocates during insertion to the back of the vector upon capacity being reached. But since indexes stay valid regardless of a vector reallocates, we can employ a similar tactic using indexes instead of pointers; which we'll call an indexed_vector. In this case we use a secondary vector of indexes to iterate over the vector of elements, and only erase from the vector of indexes. This strategy has the advantage of potentially lower memory usage, as the bitdepth of the indexes can be reduced to match the maximum known number of elements, but it will lose a small amount of performance due to the pointer addition necessary to utilise indexes instead of pointers. In addition outside objects refering to elements within the indexed_vector must use indexes instead of pointers to refer to the elements, and this means the outside object must know both the index and the container it is indexing; whereas a pointer approach can ignore this and simply point to the element in question.

We will also compare these two container modifications using a remove_if pattern for erasure vs regular erasure, by adding an additional boolean field to indicate erasure to the original stored struct type, and utilizing two passes - the first to randomly flag elements as being ready for erasure via the boolean field, the second using the remove_if pattern.

vector_bool and deque_bool

A second modification approach, which we'll call a vector_bool, is a very common approach in a lot of game engines - a bool or similar type is added to the original struct or class, and this field is tested against to see whether or not the object is 'active' (true) - if inactive (false), it is skipped over. We will also compare this approach using a deque.

packed_deque

packed_deque is an implementation of a 'packed_array' as described in the motivation section earlier, but using deques instead of vectors or arrays. As we've seen in the raw performance benchmarks, (GCC) libstdc++'s deque is almost as fast as vector for iteration, but about twice as fast for back insertion and random location erasure. It also doesn't invalidate pointers upon insertion, which is also a good thing. These things become important when designing a container which is meant to handle large numbers of insertions and random-location erasures. Although in the case of a packed-array, random location erasures don't really happen, the 'erased' elements just get replaced with elements from the back, so erasure speed is not as critical, but insertion speed is critical as it will always consume significantly more CPU time than iteration.

With that in mind my implementation uses two std::deque's internally: one containing structs which package together the container's element type and a pointer, and one containing pointers to each of the 'package' structs in the first deque. The latter is what is used by the container's 'handle' class to enable external objects to refer to container elements. The pointer in the package itself in turn points to the package's corresponding 'handle' pointer in the second deque. This enables the container to update the handle pointer when and if a package is moved from the back upon an erasure.

Anyone familiar with packed array-style implementations can skip this paragraph. For anyone who isn't, this is how it works when an element is erased from packed_deque, unless the element in question is already at the back of the deque. It:

  1. Uses the pointer within the package to find the 'handle' pointer which pointed to the erased element, and adds it to a free list.
  2. Moves the package at the back of the container to the location of the package containing the element being erased.
  3. Uses the pointer in the package which has just been moved to update the corresponding handle pointer, to correctly point to the package's new location.
  4. Pops the back package off the first deque (should be safe to destruct after the move - if it's not, the element's implementation is broke).

In this way, the data in the first deque stays contiguous and is hence fast to iterate over. And any handles referring to the back element which got moved stay valid after the erasure.

Tests

Since neither indexed_vector nor pointer_deque will have erasure time benefits for small scalar types, and because game development is predominantly concerned with storage of larger-than-scalar types, we will only test using small structs from this point onwards. In addition, we will test 4 levels of erasure and subsequent iteration performance: 0% of all elements, 25% of all elements, 50% of all elements, and 75% of all elements.

Insertion

Click images or hover over to see results at linear scale instead

test result graph

plf::colony outperforms the others above 100 elements.

Erasure

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph

For greater-than 100 elements, and for erasure percentages less than 75%, plf::colony comes in third, behind vector_bool and deque_bool. Above 50% erasures it's erasure performance is eclipsed by indexed_vector with a remove_if idiom.

Post-erasure Iteration

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph

Here we see that plf::colony's iteration performance is weaker than the others when no erasures have occured, but progressively becomes stronger than vector_bool and deque_bool as erasures increase. The drop in vector_bool and deque_bool's iteration performance around the 2500 element mark is due to the limits of the Haswell processor's branch decision history table. When compared to the results on a Core2 processor (which has a very small branch decision history table), it becomes clear that vector_bool and deque_bool's iterative performance for low numbers of elements will drop as soon as there is competition for the CPU from multiple processes or multiple containers. As such their better performance at lower numbers of elements can be discounted.

'Real-world' scenario testing - unordered low modification

While testing iteration, erasure and insertion separately can be a useful tool, they don't tell us how containers perform under real-world conditions, as under most use-cases we will not simply be inserting a bunch of elements, erasing some of them, then iterating once. To get more valid results, we need to think about the kinds of use-cases we may have for different types of data, in this case, video-game data.

In this test we simulate the use-case of a container for general video game objects, actors/entities, enemies etc. Initially we insert a number of small structs into the container, then simulate in-game 'frames'. To do so we iterate over the container elements every frame, and erase (at random locations) or insert 1% of the original number of elements for every minute of gametime ie. 3600 frames, assuming 60 frames-per-second. We measure the total time taken to simulate this scenario for 108000 frames (half an hour of simulated game-time, assuming 60 frames per second), as well as the amount of memory used by the container at the end of the test. We then re-test this scenario with 5% of all elements being inserted/erased, then 10%.

In these tests we will be measuring up to 100000 elements, as higher numbers use too much memory with the indexed_vector and pointer_deque containers.

Click images or hover over to see results at linear scale instead

Performance results

test result graph
test result graph
test result graph

Overall packed_deque takes the lead, with plf::colony close behind. plf::list performs poorly.

Memory results

test result graph
test result graph
test result graph

Understandably the pointer_deque and indexed_vector implementations consume more memory, as they cannot free it on the fly.

'Real-world' scenario testing - unordered high modification

Same as the previous test but here we erase/insert 1% of all elements per-frame instead of per 3600 frames, then once again increase the percentage to 5% then 10% per-frame. This simulates the use-case of continuously-changing data, for example video game bullets, decals, quadtree/octree nodes, cellular/atomic simulation or weather simulation.

Performance results

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph

Once again we see packed_deque and plf::colony taking the lead, however above 3% erasures per frame and 1000 elements, plf::colony takes the lead. plf::list has good performance until (assumably) the limit of the Haswell branch decision history table is reached (around 2500 decisions) at which point it's jumps can no longer be cached and overall performance declines. However it still outperforms pointer_deque and indexed_vector at high modification levels.

Memory results

test result graph
test result graph
test result graph

Once again we see the high memory usage of pointer_deque and indexed_vector corresponding to the level of modification.

'Real-world' scenario testing - ordered low modification

These tests are primarily for plf::list. Here we repeat the low modification test previously displayed, but with ordered (non-back) insertion as well as non-back erasures. Since linked lists excel at non-back insertion we can expect to see different performance results to the unordered tests above. We will ignore memory results as these have already been explored above. The only new container introduced here is "pointer_colony" - a modification of plf::colony to allow for ordered usage. What it does is very similar to the pointer_deque - there is a deque of pointers to elements within the colony, whose locations stay stable because a colony never invalidates pointers to non-erased elements. However unlike the pointer_deque, the pointer_colony will free up and re-use erased element memory on the fly, meaning you will not have the memory bloat associated with the indexed_vector and pointer_deque.

Since we already established in the unordered tests above that the remove_if idiom performs worse in low modification scenarios, we include only non-remove_if variants in this test. It should be noted that while these results are for small structs, the same tests involving int's performed almost exactly the same in terms of performance differences between containers and gave the same overall results.

Performance results

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph

plf::list out-performs std::list by roughly 25% on average, but is outperformed by everything else. std::deque appears to have the best performance up until around 50000 elements, when pointer_colony starts to overtake.

'Real-world' scenario testing - ordered high modification

Here we repeat the high modification test, but with ordered insertion. Since we have already established in the unordered tests that remove_if variants work better in the high modification scenario, we include only those results here.

Performance results

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph

plf::list outperforms all contenders at high levels of modification. Again, the difference in performance between std::list and plf::list is 25% on average.

'Real-world' scenario-testing: referencing with interlinked containers

In order to completely test plf::colony against a packed-array-style implementation, it is necessary to measure performance while exploiting both container's linking mechanisms - for colony this is pointers or iterators, for a packed array this is a handle, or more specifically a pointer to a pointer (or the equivalent index-based solution). Because that additional dereferencing is a performance loss and potential cache miss, any test which involves a large amount of inter-linking between elements in multiple containers should lose some small amount of performance when using a packed_deque instead of a colony. Since games typically have high levels of interlinking between elements in multiple containers (as described in the motivation section), this is relevant to performance concerns.

Consequently, this test utilizes four instances of the same container type, each containing different element types:

  1. A 'collisions' container (which could represent collision rectangles within a quadtree/octree/grid/etc)
  2. An 'entities' container (which could representing general game objects) and
  3. Two subcomponent containers (these could be sprites, sounds, or anything else).
test class diagram

This is actually a very low number of inter-related containers for a game, but we're reducing the number of components in this case just to simplify the test. Elements in the 'entities' container link to elements in all three of the other containers. In turn, the elements in the collisions container link back to the corresponding elements in the entities container. The subcomponent containers do not link to anything.

In the test, elements are first added to all four containers and interlinked (as this is a simplified test, there's a one-to-one ratio of entity elements to 'collision' and subcomponent elements). The core test process after this point is similar to the modification scenarios tested earlier: we iterate over the entity container once every frame, adding a number from both the entities and subcomponents to a total. We also erase a percentage of entities per-frame (and their corresponding subcomponents and collision blocks) - similar to the earlier tests.

However during each frame we also iterate over the 'collisions' container and erase a percentage of these elements (and their corresponding entities and subcomponents) at random as well. This could be seen as simulating entities being removed from the game based on collisions occurring, but is mainly to test the performance effects of accessing the subcomponents via a chain of handles versus a chain of pointers. Then, again during each frame, we re-add a certain number of entities, collision blocks and subcomponents back into the containers based on the supplied percentage value. Since both colony and packed_deque essentially re-use erased-element memory locations, this tests the efficacy of each containers mechanism for doing so (packed_deque's move-and-pop + handle free-list, versus colony's stack + skipfield).

Since neither container will grow substantially in memory usage over time as a result of this process, a longer test length is not necessary like it was for the earlier modification scenario-testing with indexed_vector and pointer_deque. Testing on both plf::colony and plf::packed_deque showed that both of their test results increased linearly according to the number of simulated frames in the test (indexed_vector and pointer_deque have a more exponential growth). Similarly to the modification tests above, we will start with 1% of all elements being erased/inserted per 3600 frames, then 5% and 10%, then move up to 1% of all elements being erased/inserted per-frame, then 5% per-frame, then 10% per-frame.

Performance results

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph
test result graph

For all results plf::colony outperforms packed_deque.

Sort performance

std::vector and std::deque are sorted using std::sort, while plf::colony, plf::list and std::list are sorted using their internal sorting functions.

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph

For every type except large structs std::vector + std::sort performs best, then for large structs plf::list performs best, with std::list and plf::colony coming 2nd and 3rd. plf::list is 72% faster than std::list when averaging all tests.

Overall Performance Conclusions

While individual context graphs (insertion, erasure, iteration) can show interesting results, when real-world situations are taken into account, plf::colony outperforms all contenders in unordered situations where elements are being inserted and removed on the fly and where maintaining stable links (eg. iterators/pointers/indexes/etc) to elements within a container is critical.

plf::list shows better performance than std::list across all tests, and finds it's place in terms of performance in high-modification real-world tests where both insertion and erasure are ordered. In this scenario it outperforms all other containers, though at low levels of modification other containers outperform it. plf::list's sort performance is twice as fast as std::list.

Contact: footer
plf:: library and this page Copyright (c) 2017, Matthew Bentley