PLF C++ Library - Blog

26-10-2020 - Colony 6 is released

Colony v6.00 is designed to be more in line with the current proposed draft for the standard library. It also contains numerous improvements such as:

For reference, one should not expect perfect performance results under GCC9/10 presently - there's a performance bug which's been discovered here which gets in the way, at least on a small subset of tests. Hopefully it will be fixed. Thankfully the bug does not exist in GCC8 nor any other compiler and overall, performance is better.

For anyone who's been following colony development you already know that some of the constructors have been modified to take a minimal struct called plf::limits - this contains a min and max size_t member, and a constructor. It allows block capacity limits to be supplied by the user to constructors without creating overload conflicts with the fill constructor. It is also used now by the reshape() and block_limits() functions - which are the new names for the set_block_capacity_limits() and get_block_capacity_limits() functions. In consultation with SG14, the feeling was that 'reshape' better described the function's behaviour, as it will adjust existing memory block sizes where necessary and reallocate elements.

In addition, there were several helper functions which were easily replicated by combinations of other functions - these have been removed:

The main point of doing this was to present a minimal interface for the user, and only provide supplementary functions where critical - for example, get_iterator_from_pointer() still exists because it is not replicable with other functions.

I can expect this implementation to change subtley as the standardisation process continues - if it continues. For anyone wanting a more concrete solution which will not change (but without the above benefits as have been explained), the final version of colony v5 + it's documentation will be maintained here. Any serious bugs that are discovered, will be fixed in this version as well going forward.

Stay safe, keep healthy, and keep 'colonising'-

Matt out.

26-06-2020 - Indiesort

plf::indiesort grew out of the plf::list project, which used a similar technique for it's internal sort function. Unlike indiesort, plf::list's sort is able to just change the previous/next pointers rather than reallocating elements, so it's a lot faster. But indiesort has it's place outside of plf::list. For one, it's much faster for use with std::list than std::list's own internal sort function, asides from when dealing with large types. And for other containers, it does a great job of cutting down sort time (compared to std::sort) when dealing with large or non-trivally-copyable/movable types - up to 130% faster in a vector or array.

How does it achieve this? The basic approach is, create an array with pointers to each element, then sort the pointers instead of the elements, via the value of the elements they point to. The difficult part was then finding a way of redistributing the elements again without just brute force copying them all into a temporary buffer then back to the range/container (which is what I initially did with colony's sort()). Eventually I worked out the most efficient solution, which only requires one element to be copied into a memory buffer as opposed to all of them.

Hence, most elements only get copied once using this technique, some twice, and some not at all (if they haven't moved during sorting). So for any situation where copying an element is expensive, it shines. And of course, the sort techniques which rely on random-access are typically much, much faster than those which don't (another reason why std::list::sort() is slow).

11-12-2019 - Naming things is hard

There's a point in time in every man's life where he reflects upon the mistakes he made. Today is not that day. Today is the day I tell you about how I'm fudging my mistake. Without further ado, here is the problem:

The original skipfield pattern used by colony was the 'advanced jump-counting skipfield pattern'. How did it get that name? Well, for starters, a jump-counting skipfield pattern does exactly that - it counts the jumps between one non-skipped item and the next, in an efficient manner. The difference between what I previously called the 'advanced jump-counting skipfield' and the 'simple jump-counting skipfield' was that one was, to my way of thinking at the time, simpler. The 'advanced' pattern did some more complicated stuff. It was also (to my way of thinking at the time, again) more useful essentially - more versatile.

However several years on I started to think about using the 'simple' version in more advanced scenarios - thinking about this only became possible because of some of the different ways I was organising colony on a structural level - but I started to notice that, actually, when viewed in a slightly different manner, it wasn't so much 'simpler' as 'different'. Doing the same things that I did with the 'advanced' version was possible, but required additional data.

To be specific I'm talking about changing a skipped node to an unskipped node. Originally I thought this could not be done with the simple version with any efficiency - while the advanced version stored information in it's nodes about their distance from the beginning of a run of skipped nodes (a 'skipblock'), the simple version didn't. Then I realised it could be done at least for the start and end nodes of skipblock, which do store information. Then I realised (much later) that you could, if you had prior knowledge of the start and end locations of a given skipblock, do anything you wanted to any of the nodes within that skipblock.

As my understand of this pattern grew, and I replaced the skipfield in colony with the 'simple' version, I faced a problem. For starters, it was no longer simpler than the 'advanced' version - it was merely different. But I'd already named, and published, a paper on the advanced version which explicitly stated aspects of the simple version, which were no longer true from my understanding. For this reason, because I needed a new name for this more complex version of the 'simple' pattern, I just called it the Bentley pattern as a working title.

Now, if you know me, or you've been to New Zealand you're realise that this kind of self-aggrandisement (naming an algorithm after oneself) doesn't necessarily sit well with me, or my culture - even though, in computer circles that's what we do all the time - that's where the term 'boolean' comes from (George Boole). But I couldn't think of anything else. However now I feel like I've come up with a solution, although it invalidates the original paper, given that I'm changing both the names for the advanced jump-counting skipfield and the 'simple' jump-counting skipfield.

The fudge is changing 'advanced' to 'high complexity' and 'simple' to 'low complexity', although in this case complexity explicitly refers to time-complexity, not overall complexity of the algorithms. Specifically the low-complexity jump-counting pattern allows for mostly O(1) completion of all operations, but requires additional operational data, which the high-complexity algorithm does not. Many of the procedures of the high-complexity jump-counting skipfield have undefined time complexity, as the amount of work done is determined by the length of any skipblocks involved.

In actual operational practice the performance difference between using the low-complexity and high-complexity algorithms in colony was good but not maximal, as time complexity is not a very good predictor of performance, but any performance increase is worth having. At any rate, I hope that clears things up. For future purposes the current algorithm driving the skipfields in colony is the low-complexity jump-counting pattern, and the previous one was the high-complexity jump-counting pattern. The papers written have been updated to reflect this, and hopefully someday I will find a publisher for the second. They are both available on the main page.

On the plus side I've now been able to re-integrate some of the data I had to cut out of the high-complexity jump-counting paper, in order to make it fit within the original journal's page limits. So now there's info on intersection, parallel processing... etcetera etcetera. Don't get me wrong, they were great - they just had strict limits on how much they could publish. With other journals I've dealt with, the general academic 'stance' is chronically bad, prone to valuing style over substance and valuing the worst criteria as the highest criteria. A lot of grandstanding and adolescent behaviour with a desire to look 'authoritative'. And from what I hear, a lot of academia is prone to this.

24-10-2019 - Yet another guest post :)

Once again Arne has allowed me to spool off my brain noise into bits and bytes, representing characters and other assorted connotations of meaning. Much thanks to him and the post is here, this time the topic is Time complexity and why it's relevance is entirely dependant on hardware, essentially.

01-8-2019 - SIMD and colony

I made a modification to plf::colony which enables it to be used with SIMD gather techniques, using the skipfields as masks for processing. I'd like some feedback on how useful this is, and whether it's worth keeping or throwing away. Feel free to email me (see main page).

The way it works in colony is that the skipfield is 0 when an item is unskipped, non-zero when it is skipped. Providing a developer direct access to the skipfield allows them to use SIMD to create a gather mask by transforming the skipfield in parallel to whatever the particular architecture requires eg. for AVX2 and above, this's a vector of integers where each element whose highest bit is one, will be read into SIMD registers.

The new function (with the underwhelming name of "get_raw_memory_block_pointers()"), returns a pointer to a dynamically-allocated struct of the following type:

struct raw_memory_block_pointers : private uchar_allocator_type { // array of pointers to element memory blocks: aligned_pointer_type *element_memory_block_pointers; // array of pointers to skipfield memory blocks: skipfield_pointer_type *skipfield_memory_block_pointers; // array of the number of elements in each memory block: skipfield_type *block_sizes; // size of each array: size_type number_of_blocks; };

There is a destructor so all the sub-struct deallocations are taken care of upon 'delete' of the returned struct.

By using these data sets a programmer can create an appropriate mask for each memory block by processing the skipfield into a separate array/vector, then using that mask perform a Gather operation on the element memory blocks to fetch active elements into SIMD registers. And if desired, and if the processor supports it, a subsequent Scatter back into the colony element memory blocks.

I am wondering how useful this is in comparison to manually creating a mask by iterating over the colony conventionally. The latter approach also ignores memory block boundaries, so might be more useful in that respect, even if you can't use SIMD to create the mask in that case.

There's a demo of how to use the new function in the test suite, which I've just updated in the repo. Let me know how you get on.

05-6-2019 - Win10 Simplifier

This is a tool I've been working on for a while, and deem ready for public service. It automates a lot of changes I make to typical Win10 installations, as well as automating the running of a load of tools such as ShutUp10 and Win10 Debloater. Hope you find it useful.

16-4-2019 - Constexpr all the things, including That Thing

I never really 'got' constexpr, being a construct that largely does what an optimizing compiler does anyway, but then I read some article by some guy, and it made a bit more sense. Still, I've never found a use for it, much less one that resulted in a performance difference. Then I ran across one scenario where a user had a moveable-but-non-copyable type, and that caused issues with some of the functions which consolidated a colony. Reason was that they were using a function which copy-constructed the existing colony into a new colony, then moved the new colony to the old. This is a shrink-to-fit as far as a colony is concerned.

The copy-construction was obviously a problem for a non-copyable type, so I told the function to move the elements instead if they were movable. However, the branch where elements were copy-constructed was still compiled, resulting in type violation errors, despite the decision being for-all-intents-and-purposes compile-time. This is where constexpr came in handy. With "if constexpr" you can guarantee that the path not taken will not actually be compiled, and by using type traits in conjunction with it you can avoid type errors for branches with operations which would be illegal on that type.

That's, of course, if your compiler supports "if constexpr" properly. Up until some time last year, MSVC had a ton of bugs around constexpr, and all compilers currently seem to require a flag specifying that C++17 is required before they'll enable "if constexpr". With any luck, before the year is out, that might change (being 2 years on since 2017). Regardless, all three constainers - plf::stack, plf::list and plf::colony - now use constexpr at all type-traits-determined if statements (plus a couple of other places) when C++17 is available/specified. In some scenarios I found this also reduces code size and increases performance, depending on the functions the program uses. Yell out if you get any bugs.

30-1-2019 - Memory retention

I've been toying with the idea of retaining some or all memory blocks within colony when they become empty, then simply reusing them when the rest of the blocks become full. The problem with this is that it adds a bit of code, fills up your memory faster, and doesn't have any performance advantages. I've benchmarked this on core2 and haswell, and basically the best, most optimized solution I could come up with (which was - only retain blocks of maximum size, or the last block in the chain), just scraped by as being equal to the version of colony without block retention.

Unfortunately, that performance gain you get from not deallocating and then reallocating later, is pretty small when compared to the small amount of extra code and handling necessary to store the list of empty groups. So what's my recommendation? Well, use an allocator. Colony, like most containers, takes an allocator as a template argument, so you can use a pool allocator or similar to reuse old memory space rather than freeing to the system all the time. Of course, the allocator has it's own (small) amount of overhead, but it's better than bogging colony down with the code that 90% of people won't use.

For anyone who's interested, the benchmark results are here. Peace out.

16-1-2019 - Full benchmarks for v5 available + plf::timsort removed + containers updated

Here. The biggest noticable difference, as the previous benchmarks were done with colony v4.00, is erasure. Big, big improvements there.

Also, plf::timsort has been removed as fairly major changes are coming through for the original gfx::timsort and the changes I made to the code are not substantial enough to justify maintaining a fork. Instead the original non-forked project, gfx::timsort is now supported internally in plf::colony and plf::list, and will be used instead of std::sort whenever the header for that code is included before the colony/list header in any project. However, as always, in most cases you should stick to using std::sort, except in the areas where timsort excels (sequences which are already partially sorted, or partially/fully reversed).

In addition, all containers have had their compiler/library detection code updated to better reflect the status of libc++ (should not disable type trait support when using clang + libc++ now). Reverse() within plf::list has been simplified/optimized, with the following results on haswell:
Where erasures have occured prior to reversal, up to 38% performance increase for lower numbers of elements and 7% performance increase on average. Where erasures have not occured prior to reversal, variable results but no change to performance on average. For between 10 and 120 elements, average roughly 8% performance increase, between 120 and 1000 elements average 3% performance loss, for between 1000 and 100000 elements on average there is no change to performance.

Interestingly for me, std::list's reverse() performance got one hell of a lot better between whatever version of libstdc++ GCC 7.1 and 8.2 use respectively. At 7.1, plf::list was ~492% faster than std::list's reverse. Now, even though plf::list's reverse() has gotten faster, for gcc 8.1 it's only 70% faster on average than std::list. I'd have to go back and re-run the tests to make sure this wasn't some benchmarking mistake, but eh - better things to do...

10-1-2019 - A summary of performance differences between colony v4.5 and v5.01

Tests in question were run on a haswell processor, under GCC 8.1.

Referencer tests (multiple collections of interlinked small structs): Widely variable performance differences depending on number of elements, but on average 2% performance gain for higher modification rates and 0.8% performance loss for lower modification rates.
General use tests (insertion, erasure and iteration on the fly measured over time): Up to 6% performance gain for high modification rates and numbers of elements under 19000. Average gain of 0.5%.

Raw tests:
Large struct: no change to insertion or iteration, 3% average performance improvement for erasure with up to 9% improvement for very large numbers of elements.
Small struct: no change to insertion or iteration, for erasure there was up to 3% performance decrease for numbers of elements under 30, up to 5% performance improvement for numbers of elements over 200000, on average no significant difference.
Int: no change to iteration, insertion: up to 9% worse performance for under 150 elements, up to 25% better performance for more than 130000 elements, erasure: 2% worse performance on average.
Double: no change to iteration, widely variable results for insertion, on average 2% worse, for erasure more consistently 1% better on average.
Char: 45% faster iteration on average (not sure how that works but I imagine bumping the minimum alignment width to 2 * sizeof(short) ie. 32-bits has something to do with it), widely variable results for insertion but 1% slower on average, 2% faster erasure on average.

The datasheets for the results above are here.

9-1-2019 - Differences between performance of colony using char vs short for skipfield

Basically colony is usually most effective with the default unsigned short skipfield type - I've come across no benchmark scenarios yet where unsigned int gives a performance advantage, even for vast numbers of elements. There's a number of reasons for that - cache space saving, the removal of memory blocks when they become empty and the skipfield-type-imposed limitation on the size of those memory blocks and subsequent statistical likelihood of them being empty. Basically, the skipfield patterns used for colony require that the number of elements per block cannot be larger than the maximum size representable by the skipfield type ie. 65535 for unsigned short, 255 for unsigned char. But, sometimes you only want a small number of elements - under 1000 to be specific. Where this is the case, you may want to swap to using unsigned char instead of unsigned short for your colony skipfield type.

Skipfield type is the second parameter in the colony template, so plf::colony<small_struct, unsigned char> temp; will give you an unsigned char skipfield type for your container instance. How does this work out performance-wise? Well, it varies from CPU to CPU. For core2 you can sometimes get a 2%-20% performance advantage over unsigned short, depending on the number of elements you have and ratio of insertion/erasure-to-iteration for your container usage. Very high levels of modification don't benefit significantly from an unsigned char skipfield regardless of number of elements. On ivy bridge, there is seldom a performance advantage for using unsigned char. Mostly it is a performance detriment.

Here's the results for Haswell (GCC 8.1) when using unsigned char skipfields instead of unsigned short skipfields:

Referencer tests (multiple collections of interlinked small structs): Up to 500% (!) loss for > 65000 elements, high modification rates, on average 26% loss. Up to 5% gain for numbers of elements under 1000 and low modification rates.
General use tests (insertion, erasure and iteration on the fly measured over time): Up to 4% loss for large numbers of elements. Average loss of 2%.

Raw tests (large struct and int not measured):
Small struct: up to 10% loss for iteration, average 6%. Widely variable results for insertion, on average 4% gain. Erasure 1% loss on average.
Double: up to 14% loss for iteration, average 3%. Widely variable results for insertion, on average 1% loss. Erasure 3% loss on average.
Char: up to 10% gain for iteration, average 6%. Widely variable results for insertion, on average 3% loss. No change for erasure.

Memory usage is a different story again and is of course consistent across CPUs. Across the board, in terms of the benchmarks I was running (with small structs around 48 bytes) the amount of memory saving going to an unsigned char skipfield was about 2% for under 1000 elements. Of course this will be more if you're storing a scalar type of some sort, and hence the skipfield takes up a larger part of the overall memory cost. But once you get up to larger numbers of elements, you can sometimes end up saving 50% memory overall - this is not due to the skipfield type per se, merely due to the effect it has on limiting the size of the element memory blocks. You can also accomplish this with the change_block_sizes() functions. Since with an unsigned char skipfield type you can't have a memory block larger than 255 elements, this means that your unused capacity in a new memory block is never going to be more than 254 elements. However, with an unsigned short skipfield type the unused space in a new memory block could be up to 65535 elements wide. This can be a significant amount of memory if your element type is large.

It should be noted that the largest performance decrease above 1000 elements when using a skipfield type of unsigned char instead of unsigned short, was 2% on core2 and 7% on ivy bridge. And the bulk of the performance increases for using an unsigned char type occured on core2 and under 256 elements. Hence my recommendation is: if you want performance, stick with the default skipfield type for the most part. But depending on your CPU and context, for under 1000 elements you may or may not get a significant performance increase using an unsigned char skipfield type. And if your focus is on saving memory, change to unsigned char and/or change your minimum and maximum memory block sizes.

To throw an additional spanner into the works, I decided to customise a version of colony for small sizes and unsigned char skipfields, so that the size_type and difference_type typedefs used throughout the template did not allow for more than 32767 elements in the colony total. Basically this is to see whether a specialized 'small colony' container would be useful as opposed to having one container with a template parameter for the skipfield. I found that this was not worthwhile - the performance saving was generally 0 while the memory saving was between 4% and 2% when the number of elements was under 40, and less than 1% if the number of elements was over 40 (some variability but I put this down to undiagnosed bugs arising from not static_cast'ing the new size_types in all circumstances - and I don't have time or interest in pinning it down). Adding that to the additional overhead required to keep two versions of colony maintained, it turned out to be a worthless gesture, but one worth trying at least.

The datasheets for the results above are here.

13-10-2018 - Making in a film in C++ then upscaling it with weird nerd sh*t

~11 years ago a made a film called o2 in C++ - rather, I generated the visuals in C++ by using SDL and just breaking things. Mostly I used sin & cos & rgb values (which is immediately apparent when you watch it). Sometimes I ended up using the values left over in memory from previous programs to accidentally make random stuff (like in the opening sequence) - I'm not actually sure if you can still do this on modern OS's (some of my experiences suggest not).

Anyway, this footage was cut together and I used it to make a 15-minute music video comprising five of the songs from this electronic album. Unfortunately the video seems to be a litmus test for how awful your video codec is. Divx (remember that?) couldn't compress it at all, Xvid did the best job in 2007, but it took several years before h264 became widespread enough (and the encoders for it good enough) so I could actually make a decent non-lossless compress of the film.

Unfortunately youtube dials down the bitrate hard enough that even with a great encode, the video still turned it into an unviewable sea of shifting mpeg blocks. Even when upscaled to 720p with lanczos, the bitrate was not sufficient to avoid the clustertruck of minecraftification. But lately I've been wondering if it was ever going to be possible to make this viewable online. Luckily, I discovered Waifu2x. Waifu2x is an attempt to use a neural net to upscale anime videos & pictures using sample sets to tell the upscaler what the result should look like at a higher resolution. It is extremely precise and crazy. Plus, Super-Nerdcore.

So, I thought I'd give that a shot - it needs an nvidia GPU and CUDA and A BUNCH OF CUDA STUFF but it all works well. The process was surprisingly easy, once I found the right tools. Anyway, I upscaled this film from it's original SD resolution to 4k - and it worked. If anything, it made the upscaled film look better. So, I finally uploaded it to youtube, it converted to 1080p (the film's still in 4:3 aspect ratio, and youtube don't like that for 4k), and it's still not great - though sufficiently more viewable due to the higher bitrate afforded by 1080p, there's still quite a lot of minecraftification happening. So instead, here's a 3GB download of the film at 4k: enjoy.

For reference, the lossless-compressed (UTvideo) version of the film is 106GB.

05-10-2018 - Colony 5 - the new mechanisms

As mentioned on the colony page and in countless as-yet unaccepted standards proposals, any colony implementation is structured around 3 aspects:

In colony v4.5 the third of these was changed from a stack of memory locations to a series of per-memory-block free lists. In colony v5.0 both the second and third change; the third changes from per-memory-block singly-linked free lists of erased element locations, to per-memory-block doubly-linked free lists of skipblocks. A skipblock, in colony terminology, refers to any sequence of contiguous erased elements - including sequences of only one erased element. By recording skipblock locations instead of singular erased element locations, we improve performance by decreasing the number of skipfield nodes which need to be altered upon erasure or insertion. But, this also enables a further improvement.

By only storing skipblocks we are able to change the second aspect above (the skipfield) to the Bentley pattern, instead of using the Advanced Jump-counting skipfield pattern. Without going into it in great detail, the Bentley pattern is a refinement of the simple variant of the jump-counting skipfield pattern. It does not require the middle nodes of a skipblock to be updated upon changes to the skipblock - however, only the beginning and end nodes of the skipblock may be changed. But since we're only keeping records of skipblocks instead of individual erased nodes, this means we can choose which erased element within the skipblock to reuse, and only change a beginning or end node.

In addition, reusing the beginning or end node of a skipblock (current implementation reuses the beginning because of a benchmarked performance advantage in terms of multiple insertions and thus reusing multiple sequential nodes from left to right instead of from right to left) is advantageous for iteration performance; reusing a middle node means splitting a skipblock in two, increasing the overall number of skips during iteration.

Without writing a paper here (I am writing a paper - just not right here), the simple upshot of this is that all skipfield updates for single insertions and erasures are now O(1) amortized, and skipfield updates for multple insertions/erasures are O(1) per-memory-block affected - as opposed to previous versions where all these were undefined in terms of time complexity. Insertion, erasure and iteration become faster in the context of multiple erasures and insertions over time; as the number of skipblocks - and the subsequent number of jumps required during iteration - is reduced substantially.

A lingering question might be, why the change from a singly-linked per-memory-block free list to doubly-linked? Well, turns out if you're storing/reusing singular erased element locations, singly-linked is fine - you make the newly-erased element the new head of the free list and point it's 'next' index to the previous head - regardless of scenario. However if you're storing whole skipblocks, when you reach the scenario where you're erasing an element that is directly between two skipblocks, you have to join the skipblocks and remove the record of one of the skipblocks from the free list - which requires either:

  1. Traversing the whole singly-linked free list to find the skipblock previous to the removed skipblock, and changing it's 'next' index to point to the skipblock after the removed skipblock, or
  2. Changing the singly-linked free list to a doubly-linked free list and updating 'next' and 'previous' indexes respectively for the skipblocks before and after the removed skipblock in the free list sequence.

Across multiple benchmarks the second solution works out to be faster as the first requires jumping all over the memory block in the case of many non-contiguous erasures and subsequently many skipblocks. When the elements in question are large, this can have huge cache effects, since the free list index info is stored within the erased element's memory space, not it's skipfield node. The performance cost of having to update and maintain both a previous and next index is minimal by contrast.

A side-effect of this is that elements are now over-aligned to (sizeof(skipfield type) * 2) (or the type alignment if that is larger). So, assuming the default skipfield type of unsigned short, a colony storing char or short has that type overaligned to 32bits (assuming unsigned short is 16 bits on that platform), in order to store both a next and previous index when an element gets erased and becomes part of a free list. The performance cost of this is minimal, the storage cost is larger (though not as large as it was before v4.5 when colony was still using a pointer stack), but, colony was never designed for small types - if you store chars or shorts in a colony you are already wasting space due to the skipfield node being larger/as large as the element itself. So not a huge loss from my perspective.

So yeah - big update, big change, for the better. Hope everyone gets some use out of it!

02-08-2018 - In the name of Elegance

I did another guest post on Arne's blog about the dangers of letting area-specific 'elegant' solutions dominate a language.

10-07-2018 - The results of Assumptions

The next time someone tells you that std::fill_n or std::fill is as fast as memset, get them to benchmark it.

My results:

Xubuntu 18, Core2Duo E8500 CPU, GCC 7.3

Results in release mode (-O2 -march=native): ============================================= 2018-07-10 21:28:11 Running ./google_benchmark_test Run on (2 X 3800.15 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 6144K (x1) ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 16488 ns 16477 ns 42460 memory_fill 16493 ns 16493 ns 42440 memory_memset 8414 ns 8408 ns 83022 Results in debug mode (-O0): ============================= ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 87209 ns 87139 ns 8029 memory_fill 94593 ns 94533 ns 7411 memory_memset 8441 ns 8434 ns 82833 Results in -O3 (clang at -O2 is much the same): ================================================ ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 8437 ns 8437 ns 82799 memory_fill 8437 ns 8437 ns 82756 memory_memset 8436 ns 8436 ns 82754
The code (using google benchmark):
#include <cstring>
#include <algorithm>
#include <benchmark/benchmark.h>

double total = 0;

static void memory_memset(benchmark::State& state)
	int ints[50000];

	for (auto _ : state)
		std::memset(ints, 0, sizeof(int) * 50000);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

static void memory_filln(benchmark::State& state)
	int ints[50000];
	for (auto _ : state)
		std::fill_n(ints, 50000, 0);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

static void memory_fill(benchmark::State& state)
	int ints[50000];
	for (auto _ : state)
		std::fill(std::begin(ints), std::end(ints), 0);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

// Register the function as a benchmark

int main (int argc, char ** argv)
    benchmark::Initialize (&argc, argv);
    benchmark::RunSpecifiedBenchmarks ();
	printf("Total = %f\n", total);
    return 0;

Note: The for-loop counting the array contents is necessary for the benchmark loops not to get optimized out by clang (which detects that the arrays are unused and removes them).

10-06-2018 - Google benchmark and some strange results

So I took part in an exercise to try and understand google benchmark, as a small part of the larger task of understanding the google toolchains en generale. Google documentation is... lets say, "sparse", sometimes inaccurate and typically lacking in reasonable explanations for newbs, so it took quite a bit of searching to understand what the different macros were doing. Final result: figured out they were running each process a certain number of times to get an estimate as to how many iterations would be necessary to achieve statistically-meaningful results, then running the benchmark post-warmup. Plan to submit a merge request to the google benchmark docs, at some point when I have time.

At any rate, I was surprised to find the insertion (push_back) results did not match my own benchmarks in windows (I was in this case running goog-bench on xubuntu 18 on a core2). Deque, bless it's little libstdc++ heart, was outperforming it by a factor of 2 - which was unusual. For 100000 insertions, libstdc++'s deque should've been performing approximately 781 allocations per-run, as it allocates memory blocks in 512-byte chunks; whereas colony, with it's memory block growth factor of 2, should've performed approximately 12 allocations total. Memory allocation being an expensive operation in general, under my windows benchmarks colony had largely outperformed deque in terms of insertion. So the only conclusion I could come to...

... was that windows memory allocation sucks. I fired up Win7, checked my libstdc++/gcc version numbers (gcc 7.3 x64 on both linux and windows/mingw), downloaded python, cmake, google benchmark & test, and recompiled the test under nuwen mingw. Sure enough, same GCC, same optimization flags (-o2 -march=native), same compilation under Codelite, but now colony outperformed deque by a factor of 2. I re-ran tests both on linux and windows, just to make sure there weren't any fluke values, but it was consistent. BTW, all speedstep/power-saving/services/AV/networking/UAC/etcetera/etcetera/etcetera are disabled on both setups. I also tested a box with an Ivy bridge CPU and Windows 10, and the results were exactly the same. So there you have it; memory allocation under windows is breathtakingly slow. But that wasn't the only surprise. Take a look at the comparitive results below:

Windows results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 897430145 benchmark_deque_creation 128 ns 128 ns 4985723 benchmark_vector_creation 1 ns 1 ns 897430145 benchmark_colony_insertion 281589 ns 275333 ns 2493 benchmark_deque_insertion 546001 ns 546003 ns 1000 benchmark_vector_insertion 736900 ns 736903 ns 1122 benchmark_colony_erasure 713045 ns 713048 ns 897 benchmark_deque_erasure 183300495 ns 183301175 ns 4 benchmark_deque_erasure_remove_if 336472 ns 328826 ns 2040 benchmark_vector_erasure 312000513 ns 312002000 ns 2 benchmark_vector_erasure_remove_if 260000 ns 260002 ns 2640 benchmark_colony_iterate_and_sum 121685 ns 121686 ns 6410 benchmark_deque_iterate_and_sum 95949 ns 95949 ns 7479 benchmark_vector_iterate_and_sum 69534 ns 69535 ns 8974
Linux results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 885853600 benchmark_deque_creation 42 ns 42 ns 16841287 benchmark_vector_creation 1 ns 1 ns 886048673 benchmark_colony_insertion 427091 ns 427091 ns 1639 benchmark_deque_insertion 253210 ns 253210 ns 2764 benchmark_vector_insertion 700716 ns 700709 ns 997 benchmark_colony_erasure 743468 ns 743458 ns 939 benchmark_deque_erasure 94806854 ns 94806240 ns 7 benchmark_deque_erasure_remove_if 297223 ns 297222 ns 2355 benchmark_vector_erasure 96288165 ns 96288299 ns 7 benchmark_vector_erasure_remove_if 169655 ns 169655 ns 4100 benchmark_colony_iterate_and_sum 120794 ns 120793 ns 5801 benchmark_deque_iterate_and_sum 95539 ns 95539 ns 7300 benchmark_vector_iterate_and_sum 69133 ns 69133 ns 10100

Note: Both remove_if and non-remove_if erase benchmarks randomly remove 1 in every 8 elements from the total in the container.
You can see non-remove_if erasure (erasing randomly during iteration) in both vector and deque is 2-3x as slow under windows compared to linux - this indicates that, not only are memory Allocations slower, but also memory Copies are slower. However this particular result did not show up on the Win10/Ivy Bridge setup (where the non-remove_if erase results were comparitively equal between linux and windows), indicating that either Win10 is better at memory copies or windows is better at handling memory copies on Ivy Bridge CPUs than Core2's. The "creation" benchmarks (instantiate an instance of a template, check it's size, then destroy it) show the initial allocation/deallocation of deque (unlike vector and colony, libstdc++'s deque implementation allocates it's first memory block upon instantiation instead of upon first insertion) was roughly 4x slower in windows! Lastly, vector erasure with remove_if is almost twice as fast under linux as opposed to windows.

So how did windows memory system get so inefficient? Or perhaps a better question, how did linux's memory management get so efficient? Perhaps that's the right question. Millions of contributors worldwide with an emphasis on performance and stability not "new features" or shiny interfaces, probably. One might ask, why not test this under MSVC? Well, I already did that with my regular benchmarks, and they showed much the same results as mingw, only slower, so that's probably not necessary. Given that colony was initially designed for game engines and games almost universally are designed for windows on the desktop/laptop, this is not a great loss for colony, at least in it's original use-case. And of course it's non-remove_if erasure results are still the best out of all containers. The last thing to note from these results is that colony insertion is noticably slower under linux than under windows. I'm not sure why this should be, on the same compiler, same processor etc. So there's something to look into there.

For anyone wanting to run these benchmarks themselves, here is the code:

UPDATE 19-06-2018: I tried doing a simple google benchmark with memory allocation/deallocation only. This was consistent with the findings for the allocation/deallocation of deque - about 37ns for Xubuntu (GCC) and about 176ns for Windows 7 (mingw-GCC - ~220ns under MSVC2017). The results were almost exactly the same on the Ivy Bridge/Win10 setup. In addition I also tested memset, which was 20-30% slower on windows, depending on the CPU/platform. The .cpp for this test has been included in the zip above.

20-04-2018 - Colony 4.5 vs colony 4.0 benchmark results

Performance increase/decrease for raw performance tests on average across all numbers of elements, erasing 25% of all elements during erasure:
Operation Char Int Double Small Struct Large Struct
Insert +2% +3% +5% +4% 0%
Erase +0% +26% +13% +10% +5%
Iterate -43% -1% 0% 0% 0%

The negative result for char iteration is due to the fact that the char type has to be overaligned to the same size as the skipfield type (unsigned short) in order for the free list mechanism to work in v4.5 (v4.0 used a pointer stack). This can be mitigated by using unsigned char for the skipfield type in colony's template arguments (only really useful for small collections < 1024 elements).

Erasure performance for small structs, across all numbers of elements, erasing 75% of all elements:

+35% average, +58% max (low number of elements), +20% min (high number of elements)

General use-case benchmarking, small struct, on average across all numbers of elements and rates of modification:

Low-modification: +1%,
High-modification: +2.5%

Referencer use-case benchmarking, small struct, on average across all numbers of elements and levels of modification:

+2%, increased lead vs packed-array types

02-04-2018 - Colony 4.5 released - major update.

So here's the thing. colony 4.5 is a lot faster than it's predecessor, due to ditching a stack for erased locations and employing per-memory-block free-lists (plus a global intrusive list of blocks-with-erasures) instead. It also uses less memory, and enables erase to be more exception-safe due to a lack of allocation. Also it supports overaligned types! Here's how that all came about:

Someone at the container-focused C++ committee meeting in Jacksonville said colony shouldn't use a stack for erasures because that could create an allocation exception upon expanding the stack. Now that's partially true, erase can throw exceptions anyway because (a) a destructor might throw an exception - for example, if an object has a file open and can't close it on destruction for some reason - and (b) an invalid/uninitialized iterator could be supplied to the function. Regardless, it's a good point, and I started looking into how to make that not happen. I came back to an old idea I'd had for using per-memory-block free lists instead of a stack for recording erased/reusable locations for future insertions.

The main reason I didn't go with that in the first instance was because I didn't really understand how explicit casting works in C++; originally I thought that I would need to use unions to make a free list work in this case, which don't support basic object-oriented functionality like constructors etc. When in reality, if you have a pointer to something in C++, you can reinterpret_cast anything to it, provided you don't overflow bit boundaries and mess something up. I don't particularly see why C++ doesn't allow one to do this with just regular objects, but I'm sure there's some eldritch reasoning.

The second reason I thought it wouldn't work was because free lists typically utilize pointers, and while fast, pointers are large(ish) - so using colony with scalar types would mean overaligning them to 64-bit most of the time - not cool man!!! Because of my experiences writing plf::list, I was more experienced with thinking through this approach with pointers in mind. But, I realised eventually that since the free list is per-block I could use indexes relative to the start of that memory block, instead of pointers, and that these indexes only had to be of the same type as the skipfield (unsigned short by default) - so this would work fine for everything except char - where the type would have to be overaligned to unsigned short. Not a huge loss since I don't see people using char with colony much, if at all... (and people can always change the skipfield type to unsigned char for small collections and get better performance anyway)

So what I wound up with was (a) per-block free-lists of indexes via reinterpret_cast'ing of the 'next' free list index into the current index's erased element's memory space, and (b) per-block free list heads (also an index) and (c) a global intrusive singly-linked list of all memory blocks which have erasures (with a global list head). Pretty cool. No additional memory allocations necessary to record the erasures. No wasted memory (asides from the extra per-block head, and 'next' pointer for the global groups-with-erasures list). Better performance due to a lack of allocations upon erasure. Lastly, no having to process the stack to remove erased locations when a memory block is freed to the OS. When you get rid of a memory block now, it's free list of course goes with it- so all you have to do is remove it from the global intrusive list of memory blocks with erasures (pretty quick).

The nice thing about this exercise is how easy it was - I didn't expect the upgrade to be anything but a bundle of heavily-bugged freak-outs, but the initial work was done in about 3 hours, while the rest of it - fixing up capacity(), setting up overaligning for char under C++11 etc, finding the edge-cases, optimizing, took about a week or so. The side-benefit of setting up for overaligning when the skipfield type is larger than type T ended up being support for over-aligned types, which is also cool. I don't forsee a hell of a lot of use of SIMD with colony (for the reasons stated here) but at least it's more possible now.

By the way, rules for colony behaviour when the skipfield type is larger than the element type are as follows:

Also, colony is now exactly three years old since first release! ... Happy Birthday?!

3-12-2017 - 'auto': Dangerous Gimmick, or merely the innocuous result of too many programmers huffing the 'terse code' paint?

For 30 years, comp sci teachers have been trying to educate programmers out of using meaninglessly terse variable, class and object names. But now, the battle appears to be being lost, as too many programmers are failing to recognise that typenames are not an exception.

For the sake of argument, let's say I'm stupid. I like to write code like "int i = 10;". 'i' has an inherent meaning. Sure, I can look at the code and infer the meaning, but that increases cognitive load - or maybe I'm the kind of programmer that takes a perverse pride in my ability to rapidly infer meaning out of meaningless code. If the latter, I'm even dumber than I thought because I failed to realise that those brain cycles could be better spent on other things.

Let's separate this out into 4 concepts: "terseness", "aesthetics", "readability" and "meaningfulness". Compare and contrast the following two paragraphs:

"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements."


"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements. This makes it beneficial for programs where there is substantial interlinking between different sets of elements and where elements are being erased and inserted continuously in real time."

The first paragraph is more terse. Aesthetically, there is no substantial difference between the two, unless you aesthetically prefer shorter writings - this is because aesthetics are largely subjective and meaningless, typically based more on familiarity than logic. The readability of both paragraphs is the same - it's just that the second paragraph makes you read more. But the meaningfulness of the second paragraph is greater. The second sentence in the paragraph *can* be inferred from the first, but doing so (a) increases the cognitive load of the reader and (b) increases the chance they might create an incorrect inference in their own minds, purely through the flexibility of the human brain and thinking.

So we can see that readability and aesthetics do not necessarily have anything to do with terseness or meaningfulness, but terseness and meaningfulness can be opposed. And while there are certainly ways of writing which are both more terse and more meaningful at the same time, often there's a tradeoff. Many programmers appear to side on the 'fewer keystrokes' way of thinking, but this is a mistake if you understand that code is always read more than it is written, even if it's a program that only you yourself will ever read. Going over and over your code is made more tedious and more time-consuming if you waste cycles decrypting the semantics of it.

For this reason, teachers of comp sci will tell you to give your variables and classnames real names, ones which indicate their purpose. Even if "i" is a counter in a loop, just calling it "counter" negates other possible meanings of the variable, and makes the code more meaningful as a result. So why, now, are programmers in multiple languages opting for code similar to "auto i = 0;"?

Let's, again, break this down into multiple categories. The two I'm familiar with are "type cascading" and "terseness". We already understand how terseness is problematic, but to go into more depth, look at the following two sections of C++ code:

for (auto i = some_stuff.begin(); i != some_stuff.end(); ++i)
   // do stuff with 'i'


for (vector<int>::iterator current_stuff = some_stuff.begin(); current_stuff != some_stuff.end(); ++current_stuff)
   // do stuff with 'current_stuff'

So, we're forcing the reader to infer three things in the first section: the type of 'some_stuff', the type of 'i' (iterator or pointer to an element, , presumably) and the meaning of 'i' in the enclosed code. Now, if the loop is reasonably short and uncomplicated, perhaps the omission of a meaningful name for 'i' is not that important - it's right there, after all. If the loop contains a lot of code however, 'i' makes it less meaningful, particularly if there are other meaningless variable names like 'j'. But the worse of the two is the type inference. Not only does the reader have to infer the type of 'i' by going back through the code and finding out what 'some_stuff' refers to, but they must now also infer whether 'i' is an iterator or a pointer.

So while we save keystrokes with the first section, and from a certain point of view it could be considered more aesthetically " pleasing", it is also much more meaningless, and creates more cognitive load, rather than less. This is bad code: not because it's terse, but because you waste more human energy and brain cycles in the code's lifetime.

Type-cascading is another anti-pattern. If you're using 'auto' across multiple functions which call each other in a heirarchy, where the initial type is specified at point 'A', all that means is that at point 'B' you have to navigate back through multiple function calls to find the actual type - wasting time and human energy. It's something that can be done more meaningfully using typedef's, but overall in terms of code meaningfulness should be avoided, where possible. Templates are an example of useful type cascading, but it can be taken to extremes, and when it does, the signal-to-noise ratio becomes strong.

Now, I'm not saying 'auto' is incredibly worthless or that there couldn't be some scenario where it's even necessary - but the way it's used in general in C++ is wrong - it increases terseness at the cost of meaningfulness, and sometimes it doesn't even increase terseness (see 'int', 'char')! It comes across to me as a gimmick, something to make the language more attractive to programmers who're used to the same feature being present in another language. But it's a mistake to think that in all those years promoting meaningful naming in code, that we should leave type definitions out of that equation.

Addendum: I wrote a related article called "Terseness: how little is too much?" on Arne Mertz's blog, which explores the concept of meaningful naming in general.

Contact: footer
plf:: library and this page Copyright (c) 2017, Matthew Bentley