Thursday, August 25, 2016

strings as types with c++17 constexpr lambdas

Recently I stumbled upon a question by @arne_mertz of Simplify C++ fame (if you don't read that blog, start now!) about using string literals as types. In 2013 I wrote about strings as types, and the technique used works, but it's not exactly elegant.

The problem is to get from "string literal", to something like

1
2
3
4
5
6
7
template <char... c>
class String
{
  ...
};

String<'s', 't', 'r', 'i', 'n', 'g', ' ', 'l', 'i', 't', 'e', 'r', 'a', 'l'>

This, it turns out, is not very easy.

C±±11 gave us constexpr functions, and C++14 added some helpful utilities like integer_sequence<>, and from there you may think code like the below would do the trick.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
template <char... c>
class String
{
  ...
};

template <std::size_t N, std::size_t ... I>
constexpr auto make_string_type(char const (&array)[N], std::index_sequence<I...>)
{
  return String<array[I]...>{};
}

#define STRING_TYPE(x) make_string_TYPE(x, std::make_index_sequence<sizeof(x)>{})

Unfortunately things aren't that simple. Although make_string() is constexpr, the parameter array loses its constexpr property in the function, so operator[] does not give a constexpr result, and thus String<array[I]...> is ill formed and gives a compilation error.

A possible way to get around that, is to let the macro create and call a lambda, and have the lambda contain the string literal.

1
#define STRING_TYPE(x) [](){ /*something*/ x /*something*/}()

Looking into the crystal ball, we see C++17 offering constexpr lambdas, and that gives an opening.

1
2
3
#define STRING_TYPE(x)                                         \
  string_builder([](std::size_t i) constexpr { return x[i]; }, \
                 std::make_index_sequence<sizeof(x)>{})

The idea is that string_builder() calls the lambda for each index in the std::index_sequence<> from 0 to the length of the string literal. The constexpr lambda returns the character in position i of the string literal.

A possible implementation of string_builder() is

1
2
3
4
5
template <typename F, std::size_t ... I>
constexpr auto string_builder(F f, std::index_sequence<I...>)
{
  return String<f(I)...>{};
}

This almost seems like magic, but it's not that weird. f is the constexpr lambda that returns a char for a position in the string literal. I... are all indexes from 0 to sizeof the string literal. Since f is constexpr, String<f(I)...> is well formed.

So, what's the cost of doing this? Well, the cost is build time. Runtime has no overhead compared to a fixed string global somewhere.

Look at this example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
template <char ... c>
struct String
{
  static char const buffer[sizeof...(c)];
};

template <char ... c>
char const String<c...>::buffer[sizeof...(c)] = { c... };

void func(char const*);

int main()
{
  auto n = STRING_TYPE("nonsense");
  func(n.buffer);
}

Using clang++ (svn trunk on 2016-08-25) the output from clang++ -std=c++1z str.cpp -O1 -S is

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
.text
        .file   "str.cpp"
        .globl  main
        .p2align        4, 0x90
        .type   main,@function
main:                                   #@main
        .cfi_startproc
# BB#0:
        pushq   %rax
.Ltmp0:
        .cfi_def_cfa_offset 16
        movl    String<(char)110, (char)111, (char)110, (char)115, (char)101, (char)110, (char)115, (char)101, (char)0>::buffer, %edi
        callq   func(char const*)
        xorl    %eax, %eax
        popq    %rcx
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc

        .type   String<(char)110, (char)111, (char)110, (char)115, (char)101, (char)110, (char)115, (char)101, (char)0>::buffer,@object # @String<(char)110, (char)111, (char)110, (char)115, (char)101, (char)110, (char)115, (char)101, (char)0>::buffer
        .section        .rodata._ZN6StringIJLc110ELc111ELc110ELc115ELc101ELc110ELc115ELc101ELc0EEE6bufferE,"aG",@progbits,String<(char)110, (char)111, (char)110, (char)115, (char)101, (char)110, (char)115, (char)101, (char)0>::buffer,comdat
        .weak   String<(char)110, (char)111, (char)110, (char)115, (char)101, (char)110, (char)115, (char)101, (char)0>::buffer
String<(char)110, (char)111, (char)110, (char)115, (char)101, (char)110, (char)115, (char)101, (char)0>::buffer:
        .asciz  "nonsense"
        .size   String<(char)110, (char)111, (char)110, (char)115, (char)101, (char)110, (char)115, (char)101, (char)0>::buffer, 9


        .ident  "clang version .0.0 (trunk 279733)"
        .section        ".note.GNU-stack","",@progbits

The above is perhaps not obvious, but at line 12, the address to the beginning of String<>::buffer is stored in register edi, in preparation for the function call on line 13.

Line 25 shows that the buffer is the string "nonsense".

So, there is no run time overhead what so ever. However, each unique string used is its own symbol, which may increase link time, and the compile time computation required to get the string from the string, is of course not for free.

Wednesday, May 18, 2016

Succinct and helpful C++ template compilation errors

We've all experienced them, the long and unhelpful compilation errors from templates, usually referring to some internal header you didn't even know existed. Finding the source of the error can be painful, and not unusually the clue, if there is any, is some where in the middle of the long list of messages.

Yes yes, concepts are coming to C++. GCC 6 has them, and they are in a TS. Concepts can help get the error message to the source of the error, and some times to give a good idea of why the error occurred.

This post will show one technique available from C++11 and later. It, and some variants of it, are used extensively in the Trompeloeil C++14 mocking frame work to provide short compilation error messages that explains to the user what went wrong.

In all fairness, this technique is not my invention, and unfortunately I've forgotten where I first learned of it. If you know who pioneered it, please let me know so I can attribute it properly.
[edit: Shafik Yaghmour directed me to discussions on reddit pointing to Eric Niebler being the likely originator of this technique, first used in the Boost/Proto library.]
[edit again: Eric Niebler himself isn't entirely sure he pioneered the technique, but this youtube video makes it likely.]

Problem introduction


Here's a simple toy example. Imagine that everything except the main() function is a template library.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <string>
#include <iostream>
#include <utility>

void internal_func(const std::string& s)
{
  std::cout << s << '\n';
}

template <typename ... T>
void do_with_string(T&& ... t)
{
  internal_func({std::forward<T>(t)...});
}

int main()
{
  do_with_string("foobar", 3U);
  do_with_string(std::string("bar"));
}

The function do_with_string() accepts any parameters that a std::string can be constructed from. As long as it's used correctly, it works nicely.

But when a programmer makes a mistake, things turn unpleasant:

1
2
3
4
5
6
int main()
{
  do_with_string("foobar", 3U);
  do_with_string(std::string("bar"));
  do_with_string(3.1); // Error, can't make a string from a double
}

There's what clang++-3.8 says:


t.cpp:13:18: error: type 'double' cannot be narrowed to 'char' in initializer list [-Wc++11-narrowing]
  internal_func({std::forward<T>(t)...});
                 ^~~~~~~~~~~~~~~~~~
t.cpp:20:3: note: in instantiation of function template specialization 'do_with_string<double>' requested here
  do_with_string(3.1);
  ^
t.cpp:13:18: note: insert an explicit cast to silence this issue
  internal_func({std::forward<T>(t)...});
                 ^~~~~~~~~~~~~~~~~~
                 static_cast<char>()
1 error generated.
   ^

There are thee huge problems here.
  1. The highlited problems refer to internal functions that, in a real world library, would likely be something that the user of the library never even knew existed. 
  2. The highlited information doesn't say what failed. The complaint that 'double' cannot be narrowed to 'char' is true but nonsensical. The actual problem is that the entire parameter pack expansion cannot be used to construct a std::string, and that information is completely missing.
  3. Even though the root location of the error is shown, the do_with_string(3.1) call, this information is in the middle. Again, in a real world template library this could be in the middle of hundreds of lines of disinformation.
g++-5.2 does ever so slightly better in this case, in that the failing call site is the first line of the error list, but the rest of the problems are there for g++-5.2 as well.


t.cpp: In instantiation of ‘void do_with_string(T&& ...) [with T = {double}]’:
t.cpp:20:21:   required from here
t.cpp:13:16: warning: narrowing conversion of ‘std::forward<double>((* & t#0))’ from ‘double’ to ‘char’ inside { } [-Wnarrowing]
   internal_func({std::forward<T>(t)...});
                ^
t.cpp:13:16: warning: narrowing conversion of ‘std::forward<double>((* & t#0))’ from ‘double’ to ‘char’ inside { } [-Wnarrowing]
   ^


static_assert()


A popular method to give more information is to use static_assert(). With it you can provide a helpful error message that gives the user an explanation.

1
2
3
4
5
6
7
template <typename ... T>
void do_with_string(T&& ... t)
{
  constexpr auto legal = std::is_constructible<std::string, T&&...>{};
  static_assert(legal, "it must be possible to form a std::string from the args");
  internal_func({std::forward<T>(t)...});
}

On line 4, the constant legal becomes either std::true_type or std::false_type depending on whether std::is_constructible<> concludes that std::string can be constructed from T&&... or not.

The static_assert() on line 5 causes a compilation error with the explicit error message string if legal is false.

The difficulty is figuring out methods to check whether the function can succeed or not. In this case it was relatively easy.

Here's what clang++-3.8 says when the faulty main() is compiled:


t.cpp:14:3: error: static_assert failed "it must be possible to form a std::string from the args"
  static_assert(legal, "it must be possible to form a std::string from the args");
  ^             ~~~~~
t.cpp:22:3: note: in instantiation of function template specialization 'do_with_string<double>' requested here
  do_with_string(3.1); // Error, can't make a string from a double
  ^
t.cpp:15:18: error: type 'double' cannot be narrowed to 'char' in initializer list [-Wc++11-narrowing]
  internal_func({std::forward<T>(t)...});
                 ^~~~~~~~~~~~~~~~~~
t.cpp:22:3: note: in instantiation of function template specialization 'do_with_string<double>' requested here
  do_with_string(3.1); // Error, can't make a string from a double
  ^
t.cpp:15:18: note: insert an explicit cast to silence this issue
  internal_func({std::forward<T>(t)...});
                 ^~~~~~~~~~~~~~~~~~
                 static_cast<char>()
2 errors generated.

It is an improvement, but there is a lot of noise and disinformation. There is also the repetition of blaming the call to do_with_string(3.1) twice for different offenses.

g++-5.2 is slightly better in that it doesn't try to give conflicting reasons for the failure, but the extra warnings are misleading at best.


t.cpp: In instantiation of ‘void do_with_string(T&& ...) [with T = {double}]’:
t.cpp:22:21:   required from here
t.cpp:14:3: error: static assertion failed: it must be possible to form a std::string from the args
   static_assert(legal, "it must be possible to form a std::string from the args");
   ^
t.cpp:15:16: warning: narrowing conversion of ‘std::forward<double>((* & t#0))’ from ‘double’ to ‘char’ inside { } [-Wnarrowing]
   internal_func({std::forward<T>(t)...});
                ^
t.cpp:15:16: warning: narrowing conversion of ‘std::forward<double>((* & t#0))’ from ‘double’ to ‘char’ inside { } [-Wnarrowing]

Both compilers give the necessary helpful message, but there's a lot of unhelpful cruft. In a real world example, the helpful message might be difficult to find among all the uninteresting ones.

The problem is that even though a static_assert() is triggered, the compiler continues to try to make sense out of the function, and it fails to do so, and this causes the unhelpful extra messages.

The solution, as so often, is another level of indirection.

Tag dispatch


The trick is to not call internal_func() directly from do_with_string(), but to use an indirection via a function that takes an extra parameter saying whether the call can succeed or not. Since the flag legal is already of either std::true_type or std::false_type, the indirection functions can be selected on those types. Note that it is really the type that differs, not just different boolean values.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
template <typename ... T>
void do_with_string_(std::false_type, T&& ...);

template <typename ... T>
void do_with_string_(std::true_type, T&& ... t)
{
  internal_func({std::forward<T>(t)...});
}

template <typename ... T>
void do_with_string(T&& ... t)
{
  constexpr auto legal = std::is_constructible<std::string, T&&...>{};
  static_assert(legal, "it must be possible to form a std::string from the args");
  do_with_string_(legal, std::forward<T>(t)...);
}

The new indirection functions do_with_string_() on lines 1-2 and 4-8 are selected on the first parameter type, which is provided from the call at line 15. Note that the failure function doesn't have to be implemented, just declared.

This removes the cruft.

clang++-3.8 says:


t.cpp:23:3: error: static_assert failed "it must be possible to form a std::string from the args"
  static_assert(legal, "it must be possible to form a std::string from the args");
  ^             ~~~~~
t.cpp:31:3: note: in instantiation of function template specialization 'do_with_string<double>' requested here
  do_with_string(3.1); // Error, can't make a string from a double
  ^
1 error generated.

Excellent!

g++-5.2 is equally helpful:


t.cpp: In instantiation of ‘void do_with_string(T&& ...) [with T = {double}]’:
t.cpp:31:21:   required from here
t.cpp:23:3: error: static assertion failed: it must be possible to form a std::string from the args
   static_assert(legal, "it must be possible to form a std::string from the args");
   ^

Both give the helpful message, and the source of the failing call, and not much else, and specifically no misleading messages from internal functions that the user doesn't even know about.

Unfortunately I do not feel comfortable enough with concepts to show examples in a future C++, but perhaps someone else can complement this post and write an alternative article from the viewpoint of a concepts world?

Friday, January 1, 2016

A flexible lexicographical comparator for C++ structs

We've all hand crafted comparison operators for structs with many members, and we've all cursed the tedium. It's all right for equality comparison, but lexicographical ordering relations is a different story when there are more than two members.

Hopefully all C++ developers have by now learned about the std::tie()-idiom.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
struct S
{
  int a;
  int b;
  int c;
};

bool operator<(const S& lh, const S& rh)
{
  return std::tie(lh.a, lh.b, lh.c)
       < std::tie(rh.a, rh.b, rh.c);
}
std::tie() creates a std::tuple<> with references to the parameters, and the comparison operators for std::tuple<> does its work memberwise, creating a lexicographical compare. It's all done inline and is as efficient as can be.

But what if you want different ordering relations at different times? Take as an example the classic employee record, where you may some times want to sort employees by name, some times by age, some times by employee number, some times by years employed, some times by... you get the idea.

You can, of course, write comparator classes for each, implementing each function with std::tie() as above, but it's still a repetitive tedium. Isn't that what computers are good at?

What if we could create a comparator by listing the ordering relation and which members, like this?
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
struct S
{
  int a;
  int b;
  int c;
};

std::vector<S> ss = ...

auto comparator = lex_compare<std::less<>>(&S::a, &S::b, &S::c);

std::sort(std::begin(ss), std::end(ss), comparator);
Pointers to members are an old obscure construct in C++, so rarely used that they deserve a short intro before continuing. &S::a is a pointer to the member a in a struct S. It is used to access the member a in any instance of struct S. A short example:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
struct S
{
  int a;
  int b;
  int c;
};

auto ptr = &S::a;

S s1;
S s2;

s1.*ptr = 1; // s1.a = 1
s2.*ptr = 2; // s2.a = 2

ptr = &S::b

s1.*ptr = 3; // s1.b = 3
s2.*ptr = 4; // s2.b = 4
Above, the type of ptr is int S::*, which means that it can be reassigned to any member in struct S that is of type int. That's why the assignment on line 16 works, both members a and b are of type int.

Now it is not a far stretch of the imagination to see that with a number of pointers to members, we can access the member variables to compare in the desired order. For example, pointers to members in a std::tuple<> can be used like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
struct S
{
  int a;
  int b;
  int c;
};

S s = ...

auto ms = std::make_tuple(&S::a, &S::c);

s.*std::get<0>(ms) = 1; // s.a;
s.*std::get<1>(ms) = 1; // s.c;
This doesn't look like much of an improvement, but it's a start to get the compiler engaged in generating code for us. Time to switch to templates.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
template <typename ... Ms>
class comp
{
public:
  comp(std::tuple<Ms...> m) : ms(m) {}
  template <typename T>
  bool operator()(T const& lh, T const& rh) const
  {
     return // magic
  }
private:
  std::tuple<Ms...> ms;
};
Above, Ms is a template parameter pack. At this stage we know nothing about it, except that it is 0 or more types. The member variable ms holds values of all such types, as provided by the constructor. This is all that is known currently. The previous example also showed how the members of a std::tuple<...> can be accessed, but this requires knowledge at compile time of their position in the std::tuple<...>.

All that is needed to reveal the magic is available, but only indirectly. A first start is operator sizeof...(Ms), which will tell, as a constexpr which can be used as a template parameter, how many elements the parameter pack, and thus the std::tuple<Ms...>, holds.

C++14 introduced in the standard library a set of templates that are really handy for this. Two nearly indispensable tools are std::index_sequence<Ns...> and its helper std::make_index_sequence<N>.

These two takes us almost all the way to the magic.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
template <typename ... Ms>
class comp
{
public:
  comp(std::tuple<Ms...> m) : ms(m) {}
  template <typename T>
  bool operator()(T const& lh, T const& rh) const
  {
     using idxs = std::make_index_sequence<sizeof...(Ms)>;
     return compare(idxs{}, lh, rh);
  }
private:
  std::tuple<Ms...> ms;
};
Disappointed? You shouldn't be. idxs is a type std::index_sequence<...> that holds all indexes, from 0 up until, but not including, the number of elements in the std::tuple<Ms...>, in sequence. This is all that is required to pick the members and do the comparison. It is, however, necessary to go through the extra indirection. The compare function then becomes:
1
2
3
4
5
6
template <size_t ... Ns, typename T>
bool compare(std::index_sequence<Ns...>, T const& lh, T const& rh) const
{
  return std::tie(lh.*std::get<Ns>(ms)...)
       < std::tie(rh.*std::get<Ns>(ms)...);
}
 Wait, what? Is that it? Yes, it is. Although perhaps a little explanation is useful.

Earlier I mentioned that std::make_index_sequence<N> generates a sequence of all indexes from 0 up until, but not including N. Those are the Ns parameter pack above. The construction lh.*std::get<Ns>(ts)... is repeated for every index is Ns, so the function effectively becomes (provided the parameter pack size is 3):
1
2
3
4
5
6
template <typename T>
bool compare(T const& lh, T const& rh) const
{
  return std::tie(lh.*std::get<0>(ms), lh.*std::get<1>(ms), lh.*std::get<2>(ms)))
       < std::tie(rh.*std::get<0>(ms), rh.*std::get<1>(ms), rh.*std::get<2>(ms));
}
This is what's wanted, isn't it?

Getting to the end goal is more or less trivial. A function to create the comparator object, and use a compare argument instead of always using operator<.
1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
template <typename Comparator, typename ... Ms>
class memberwise_comparator : private Comparator
{
public:
  memberwise_comparator(std::tuple<Ms...> t) : ms(t) {}
  template <typename T>
  bool operator()(T const& lh, T const& rh) const
  {
    using idxs = std::make_index_sequence<sizeof...(Ms)>;
    return compare(idxs{}, lh, rh);
  }
private:
  template <size_t ... Ns, typename T>
  bool compare(std::index_sequence<Ns...>, T const& lh, T const& rh) const
  {
    const Comparator& comp = *this;
    return comp(std::tie(lh.*std::get<Ns>(ms)...),
                std::tie(rh.*std::get<Ns>(ms)...));
  }
  std::tuple<Ms...> ms;
};

template <typename Comp, typename ... T>
auto lex_compare(T ... t)
{
  return memberwise_comparator<Comp, T...>{std::make_tuple(t...)};
}
Using inheritance for the comparator is just to enable the empty base class optimization, since there is no way as written to provide a comparator object that may hold state, and most state less comparators have no data, so it reduces the size of the comparator object slightly.

What about performance then? After all, this is C++ and we care about performance, and also everybody knows that this type of encapsulated abstraction and templates cause bloat and kills the optimizer? Or?

Here's a short example function, compiled in a separate translation unit, so that it can be followed.

1
2
3
4
5
bool is_sorted(std::vector<S> const& v)
{
  auto comparator = lex_compare<std::less<>>(&S::b, &S::a);
  return std::is_sorted(std::begin(v), std::end(v), comparator);
}
 When compiled with g++ 5.2 for x86_64 and optimized with -O3, the resulting code is:
        .file   "sorts.cpp"
       .section        .text.unlikely,"ax",@progbits
.LCOLDB0:
       .text
.LHOTB0:
       .p2align 4,,15
       .globl  _Z9is_sortedRKSt6vectorI1SSaIS0_EE
       .type   _Z9is_sortedRKSt6vectorI1SSaIS0_EE, @function
_Z9is_sortedRKSt6vectorI1SSaIS0_EE:
.LFB7573:
       .cfi_startproc
       movq    8(%rdi), %r8
       movq    (%rdi), %rdx
       cmpq    %rdx, %r8
       je      .L11
       leaq    12(%rdx), %rax
       cmpq    %rax, %r8
       je      .L3
       movl    16(%rdx), %esi
       movl    4(%rdx), %edx
       cmpl    %esi, %edx
       jle     .L6
       jmp     .L3
       .p2align 4,,10
       .p2align 3
.L13:
       movl    %ecx, %esi
.L6:
       cmpl    %edx, %esi
       jg      .L8
       movl    -12(%rax), %edi
       cmpl    %edi, (%rax)
       jl      .L3
.L8:
       addq    $12, %rax
       cmpq    %rax, %r8
       je      .L3
       movl    4(%rax), %ecx
       movl    %esi, %edx
       cmpl    %esi, %ecx
       jge     .L13
.L3:
       cmpq    %rax, %r8
       sete    %al
       ret
       .p2align 4,,10
       .p2align 3
.L11:
       movl    $1, %eax
       ret
       .cfi_endproc
.LFE7573:
       .size   _Z9is_sortedRKSt6vectorI1SSaIS0_EE, .-_Z9is_sortedRKSt6vectorI1SSaIS0_EE
       .section        .text.unlikely
.LCOLDE0:
       .text
.LHOTE0:
        .ident  "GCC: (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010"
       .section        .note.GNU-stack,"",@progbits
The entire loop is between .L13 and .L3, and the two element comparison is between .L6 and .L8. All overhead of indirection is gone. Compilers are good at this.

I thought this was cool and just wanted to share. Work can be made to allow stateful comparators, and to ensure more helpful compiler error messages in case of usage mistakes.

Monday, August 3, 2015

Cache optimizing a priority queue


I must begin with saying that if you found this because you have a performance problem, you should almost certainly look elsewhere. It is highly unlikely that your performance problem is caused by your priority queue. If, however, you are curious, or you have done careful profiling and found out that the cache characteristics of your priority queue are causing your performance problem, and you cannot fix that by altering your design, by all means read on.

A priority queue is typically implemented as a binary heap. The std::priority_queue<> class template in the C++ standard library is such an example. There is a reasonably good explanation for how they work on Wikipedia, but I'll go through some operations anyway since it leads naturally to the optimization process.

The heap is a partially sorted tree-like structure. Below is a heap with the letters 'a'-'i'. A parent node always has higher priority than its children. In this example, 'a' has the highest priority. There is no order between the children, though. Either can have higher priority.


The numbers next to the nodes are the node indexes. A interesting property of the binary heap is that the index of a parent is always half of the index of a child (rounded down.) Another interesting property of the binary heap is that it is always packed towards lower indexes. There are never any holes in the structure.

If a new node is added, space is made for index 11. Then the new value is compared with that of the parent, i.e. index 5. If the new value has higher priority, the parent is moved to index 11, and the new value is compared with the parent of index 5, i.e. index 2. If index 2 has higher priority, the new value is inserted at index 5 and the insertion is done.

Removing the highest priority element is more involved. When 'a' is removed above, index 1 becomes free, which is forbidden, so the child with highest priority is moved to it. So 'b' is moved to index 1 and index 3 becomes free. The child of index 3 with highest priority  is 'c', which is moved to index 3, which makes index 6 free. Since index 6 has no children, the phase shifts to insertion of the last element at index 6. So the last element is compared with the parent of index 6, i.e. 3. Since index 3 (which now holds 'c') has higher priority than 'j', the job is done and 'j' is moved from index 10 to index 6. Index 10 becomes free, but since its the last element that is legal and the heap shrinks in size so that index 9 becomes the last element.

Both insertion and popping the top element are O(log2(n)).

The clever thing about this is that it fits very well in an array. The above heap could as well have been drawn as below.


Traversing the nodes by doing division or multiplication by 2 on indexes is very fast compared to following pointers in a traditional tree structure.

A drawback, however, is that when the number of nodes becomes large, the distance between parent and child becomes such that only the ones nearest the root share a cache line. Assume for the sake of illustration that each cache line can hold 8 nodes. With a large heap, the result will become as illustrated below, where each color is one cache line in size.


The top 3 levels are fine, as they are all in one cache line, but after that each step towards a child or towards a parent is on a new cache line. This can become a performance problem since the caches are severely underutilized. You read a whole cache line only to access 1/8 of it. If you pop the highest priority element, the first cache-line will get accesses to at least 5/8 of its data, which is good, but the rest will get at most 2/8 accessed.

This leads to the idea of another arrangement, a heap of mini-heaps, or B-heap as it is also called. There is a brief explanation on Wikipedia.

If the mini-heaps hold 7 elements, the cache graph above instead becomes like this.


The grey triangles each represent a 7 element mini-heap. This is better, but not that good, and the index calculations become very complicated. Within each mini-heap, you make the multiplication or division by two, but the index to the parent of the top of a mini-heap and the index of children in the last row of a mini-heap is very different. Both involves modulo 7 calculations, which are not very fast. I implemented this and ran some benchmarks, but it was slower than std::priority_queue<>, despite having substantially fewer cache misses, which was rather disappointing.

Another idea is to deliberately waste space and although each mini-heap will hold only 7 values, it will consume the space of 8 values and thus a whole cache line. The result is illustrated below.


This is much better. The lower image illustrates the layout in an array where the numbers to the left are the array offset to the beginning of the mini-heap. Offset 0 in each mini-heap is not used, which makes the parent/child calculations within a mini-heap multiplication or division by 2. With mini-heap size of 8 (or any power of two) the checks for mini-heap root or last child in mini-heap becomes simple bit masking operations, which are very cheap. Calculating hops between mini-heaps is more elaborate, but is limited to multiplications or divisions by 8, and those are fast.

In the above case, inserting a new element is guaranteed to touch at most 2 cache lines, even if the added element will replace the root. Compare with the first binary heap layout which would touch 4 cache lines to replace the root. Also, the more elements in the B-heap, the better it becomes, because of the huge fan out. Each mini-heap has 8 mini-heap children, so the number of cache-lines to touch when traversing between leaf and root are log8(n).

After implementation, I ran a number of performance measurements. The graphs presented here are generated on an Intel i7-4500U CPU running Ubuntu 15.04. The test programs are compiled with g++ 4.9.3 using the flags -O3 and -std=c++14. The CPU has a 32K L1d cache, and each cache line is 64 bytes.

Of the different measurements made, the most telling one is where a heap is filled to a certain number of elements, then operated on. In pseudo-code:

for (size) queue.push() 
for (cycles) {
  for (n) queue.push()
  for (n) queue.pop() 
}
I have rather arbitrarily chosen 'n' to be 320. It's large enough to exercise the memory a bit. Pushing is done from a large array pre-populated with random numbers.

The first measurement is a priority queue of ints. There are room for 16 elements in each cache line, so measurements are made with mini-heaps of 8, 16, 32 and 64 elements. The graph displays the relative speed of the B-heaps with different mini-heap sizes over std::priority_queue<int>.

Two interesting observations here. It's always faster than std::priority_queue<int>, and the size of the mini-heaps does not seem to matter much.

However, a priority queue of just integers is almost always uninteresting. Usually the priority is associated with something, a message for example. A reasonable implementation is to let the priority queue hold the a std::pair<int, std::unique_ptr<msg>>, and prioritize on the int.


This is wild and difficult to say anything about. There are even examples of worse performance than std::priority_queue<>. I do not understand what causes this.

This leads to another idea, though. The priority queue could hold two data structures. One being the B-heap of priorities, the other an array holding the pointers. They are connected via the indexes. There are two advantages with this. One is that it holds the B-heap of priorities tight. Popping the top object always involves moving the highest priority child up, so reading the value of the lesser prioritized child is a wasted read and can evict more valuable data from the L1 cache. With a tighter B-heap structure that waste is reduced. Also, the structure holding the pointers is only touched when actual moves are being made, it's not touched during the tests. These together is likely to give better performance.


Well, it is still all over the place, but at least it's now consistently outperforming std::priority_queue<>. We can also see that a mini-heap size of only 8 elements is consistently bad. I expected a mini-heap size of 16 elements to be the best performer since 16 ints takes up exactly one cache line, but this is obviously not the case. The bigger mini-heaps outperforms it most of the time. I do not have an answer for why this is so.

Yet another idea is to continue with the separate storage for the values, but keep the unique_ptr<> in an array outside the priority queue, and use the index into that array as the data associated with the priority. This has two advantages in addition to the advantages outlined above - it further shrinks the data set being operated on, and ints are very cheap to move, relative to unique_ptr<>.


This is consistently much faster than std::priority_queue<std::pair<int,int>>. It is still difficult to say much about the preferred mini-heap size, except that 8 is too small and suffers, and that my idea of 16 being the optimum was wrong.

Source code for the priority queue, including the benchmark program, is available from github.

Friday, June 5, 2015

Performance observations on a C++ vector of lambdas

[ edit 2015-Jun-7 : The source code is available on github ]

When writing unit tests, you typically don't care much about execution speed, but compile time performance is important. After all, if building your unit test program takes 3 minutes, Test Driven Development becomes so painful it requires super human determination, but if the build takes 3 seconds, TDD becomes a joy.

I became interested in this in part from observations when making the Trompeloeil header only C++14 mocking framework, and also from the interesting blog post "Template Code Bloat Revisited: A Smaller make_shared" by Jason Turner (@lefticus.) Here's hoping that the information presented can aid other developers of C++ unit test tools in shortening the test program build times.

With the set up in place, I decided to measure run time performance as well, since the extra effort was very minor and makes the findings interesting to a much wider audience than just makers of unit test tools.

Setup


This simple lambda is repeated 1000 times in the source.

1
[](int n) { return n + __COUNTER__; }

__COUNTER__ is not needed, but it makes sure each function body differs a bit, making it a more difficult for the compiler to be clever and just make one lambda type.

The lambdas are stored in a std::vector<>. There are two representations of the vector. The simple and straight forward representation being:
1
std::vector<std::function<int(int)>> v;
The less straight forward representation uses an interface class:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
template <typename sig>
struct func;

template <typename R, typename ... A>
struct func<R(A...)>
{
  virtual ~func() = default;
  virtual R call(A...) = 0;
};

template <typename Lambda, typename R, typename ... A>
struct func_n : public func<R(A...)>
{
  template <typename U>
  func_n(U&& u) : t(std::forward<U>(u)) {}
  R call(A... a) override { return t(a...); }
  Lambda t
};

std::vector<std::unique_ptr<func<int(int)>>> v;
This version has two variants. The first is to create std::uniue_ptr<func<int(int)>> and populate the vector with. The second is to create std::unique_ptr<func_n<Lambda, int, int>>> and let the conversion constructor for std::unique_ptr<> work. The second version is inspired by the blog post by Jason Turner (@lefticus,) where the idea is that the more specialized std::unique_ptr<> is costly to create and convert from, even though it is never needed.

To further the complications, there are separate measurements of populating the vector with .push_back() and with .emplace_back(). The idea is that the potential run time performance gain with .emplace_back() comes with a compile time penalty.

The vector is not pre-reserved to capacity, which should not matter for compile time or evaluation time performance, but almost certainly impacts population time performance.

The contenders in this shootout are g++4.9, g++-5.1 and clang++3.6. clang++ uses libstdc++ from g++4.9. I have measurements with clang++3.6 using libc++, but they came out so exceptionally unflattering that I suspect my libc++ installation is flawed somehow.

The machine that the measurements are made on is an Intel X980 i7@3.33GHz with 12G ram, running X86_64 Gentoo.

Compile time measurements are from a single run of each type since they take several seconds. The run time measurements are repeated 20 times, and the shortest time found is used, hoping that it has the least disturbance from other activities on the computer.

Compilation performance


Here the time (in seconds) required to compile the single source file program is measured.


Unoptimized builds


All builds are with -std=c++14 -ggdb

push_back
It is clear that compilation time with std::function<sig> is roughly twice as long as with the hand crafted function class template regardless of compiler. The expected difference between creating a std::unique_ptr<> to the interface and a std::unique_ptr<> to the derived template and convert, is no where to be seen.

emplace_back
Here it is obvious that std::function<sig> suffers even more when emplace_back() is used, and now requires more than 3 times longer time to compile than the unique_ptr<> to hand crafted interface alternative. For the latter, the compile times don't seem affected.

Optimized builds


All builds are with -std=c++14 -O

push_back
This one is weird. For g++4.9 and g++5.1, compilation times are slightly shorter with optimized builds using std::function<> than with unoptimized builds. clang++3.6 on the other hand needs quite a long time to compile the source. Again the expected difference between std::unique_ptr<> to the interface and std::unique_ptr<> to the derived class is not seen.

emplace_back
Clang++3.6 works very hard with optimized builds using std::function<>. It looks bad when you glance at the graph, but read the numbers - it shoots out above the graph an order of magnitude above that of the g++ compilers, requiring a whopping 338 seconds to compile the source. The std::unique_ptr<> alternatives seem unaffected by optimization. For g++ there is extra suffering compared to push_back() when using std::function<>, even more so than with unoptimized builds.

Populate time


Here the time needed to populate the vector with the 1000 lambdas is measured. The program is run 20 times, and the time from the fastest run is shown. Only optimized builds are measured.

push_back (µs)
It is not obvious if the differences are in the compilers or in the library versions, but given that clang++ uses libstdc++ from g++4.9 and shows very similar performance, I'm leaning towards a library improvement for std::function<> in g++5.1 being shown.


emplace_back (µs)
The expected performance boost of emplace_back() does turn out to be a disappointing performance pessimization when using g++. I guess item 42 in Effective Modern C++ is correct that while emplacement can improve performance, it might not always do so. With clang++3.6 a performance improvement can be seen, but it is very minor. However, given that the vector capacity was not pre-reserved, the move constructor called when growing the vector is likely to dominate. Perhaps the improvement with clang++3.6 would be greater with pre-reserved capacity.

Evaluation performance


Here the time is measured to call every lambda in the vector. Again, only optimized builds are measured, and the time shown is from the fastest run of 20.

evaluate all (µs)
Here std::function<> shines. I have not tried to find the cause for the performance difference, but I am guessing that the improvement comes from better locality of reference if these small lambdas can be stored in the std::function<> objects themselves, instead of being spread all over the heap as is the case with the std::unique_ptr<> implementations.

Conclusions


Estimating performance is difficult, there is no other way to put it. That enabling optimizations can reduce compilation time, as is shown with g++ and push_back() of std::function<>, is baffling. That emplacement slows down compilation time is perhaps not surprising, but neither is the size of the cost obvious. std::function<> is very fast to call, at least in this example, but it is obviously costly to create both in run time and compilation time. Clang++ choking completely on optimized builds with std::function<> is rather unexpected, and that it doesn't show any run time performance advantage for it is disappointing.

The expected conversion cost from unique_ptr<derived> to unique_ptr<base> is not seen at all, but perhaps the situation would be different if the source was split into several files.

The only things clear are that std::function<> is slow to compile and fast to call.