The surprising results of a performance comparison between std::vector<> operator= and memcpy() with a raw array.
An application profiling effort by a colleague and I led us to investigate the performance tradeoffs between std::vector<> and a raw array (uint8_t blah[]).The application was a c++ rewrite of a c# application. The rewrite was undertaken because the overhead of running a c# application on our memory and cpu constrained ARM Linux system was affecting system performance and responsiveness. Using Mono to run C# applications on Linux resulted in ~40MB of ram overhead for our application that itself allocated tens of thousands of bytes at most. Startup time left much to be desired. It took a few seconds to start the application up from mmc on the 1GHz ARM processor. We verified the long startup time by printing an “I’m started” string as the first function in Main. We tried upgrading to a newer version of Mono to see if that would improve performance. We last used a 4.x version, both to build and run the application, but we didn't see any significant improvements in startup time or ram overhead.
Startup time of the overall system is important because the product regularly starts up from a fully powered off state. The c# application is one of several running in the system. Users end up waiting while the system is starting up and waiting users can become irritated users. In addition, we were seeing periodic cpu spikes. These spikes were likely due to some issues with our implementation rather than the Mono runtime or garbage collection. We could probably have fixed these cpu spikes but given the startup time and memory overhead issues we decided to focus on the rewrite into c++ instead.
The rewrite ended up being very effective in improving performance and reducing latency. Button responsiveness was greatly improved due to reduced latency. Other applications benefited by the additional available cpu time. Switching from c# to c++ reduced cpu load from 7-13% down to 0-2%. Ram usage went from 40MB down to below 100k. The c++ version also starts up much more quickly, possibly because of the reduced overhead of not having the .net memory management and other systems loading. However, the c++ version didn’t start off at 0-2% cpu. That came after several hours of profiling and puzzling over the code.
During the c++ rewrite we were keen to see how well we were doing as soon as we had a partially functional system. Using ‘top’ and ‘htop’ on our embedded target indicated that the c++ app was using between 7-10% cpu. This was an improvement over the 7-13% in the c# implementation but it still seemed like a lot for an application that was communicating periodically via serial and sleeping in the interim.
As this is a Linux application, we took advantage of being able to develop and run it on desktop/laptop Linux during our profiling efforts. Running on the desktop hardware made it much easier to profile, rebuild, and retest.
Profiling with valgrind’s callgrind tool and kcachegrind for visualization revealed that a lot of time was being spent in std::copy<>.
The application does communication via serial and uses a request/response protocol to a piece of hardware. Each one of these messages, typically several bytes to a couple of dozen bytes in length, was being represented by a std::vector
Switching to using references to the std::vector<>s made the application more efficient and brought its overall cpu usage down from 7-10% to 0-2%. There was still the question of whether the use of std::vector<> was introducing overhead that wouldn’t have been present were a raw uint8_t message[] being used.
There are a range of ways to measure the performance differences between std::vector
I had created an initial test. A colleague of mine, Matt, rewrote the tests and pulled in the python pandas library for visualization, in this case via a box plot. That was extended to test a range of array sizes and with CMake targets for generating the data sets and the images used here. You can find the link to the github project here.
These plots show the time spent copying a byte array via memcpy() vs. std::vector<>::operator=().
Here are the results:
We didn't expect the result that std::vector<> assignment was only slightly slower than memcpy(). There was an expectation that the use of a c++ container would result in a more significant performance impact due to the per-element nature of a std::vector<>. Instead the overhead was relatively small. But why?
c++ templates and gcc’s optimizer
std::vector<>'s assignment operator uses std::copy. std::copy looks like:From http://www.cplusplus.com/reference/algorithm/copy/
template<class InputIterator, class OutputIterator>
OutputIterator copy (InputIterator first, InputIterator last, OutputIterator result)
{
while (first!=last) {
*result = *first;
++result; ++first;
}
return result;
}
OutputIterator copy (InputIterator first, InputIterator last, OutputIterator result)
{
while (first!=last) {
*result = *first;
++result; ++first;
}
return result;
}
The compiler appears to look at the loops and types involved in the std::vector<> template code and converts them into a memcpy() because the type, uint8_t, is an integer type, and not a class. This conversion to memcpy() means that while you are using a std::vector<> container for its capabilities you are also getting an optimized implementation for assigning/copying std::vector<uint8_t>.
The source code for these tests and the generation of the images can be found in a github repository
After cloning the repository you can build like:
$ mkdir build
$ cmake ../
$ make
You can generate the images in this article with:
$ make images
and can build the assembly files like:
make TimeToCopyArray.s
make TimeToCopyVector.s
Note that these results may be highly compiler dependent. Here on Fedora 22 gcc is:
gcc (GCC) 5.1.1 20150618 (Red Hat 5.1.1-4)Your assembly output and results may be different depending on the compiler you are using.
Open up main.cpp.s:
_Z16TimeToCopyVectorii (std::vector<>)
|
_Z15TimeToCopyArrayii (memcpy())
|
There are 4 separate instances:
.L19:
movq 8(%rsp), %rdx
leaq (%r14,%rcx), %rsi
subq %rsi, %rdx
movq %rdx, %rax
sarq $2, %rax
testq %rax, %rax
je .L55
movq %r8, %rdi
addl $1, %ebp
movq %r9, 16(%rsp)
call memcpy
cmpl %ebp, %r15d
leaq (%rbx,%r13), %r8
movq 16(%rsp), %r9
jne .L21
.p2align 4,,10
.p2align 3
.LEHB2:
call _Znwm
testq %r12, %r12
movq %rax, %rcx
je .L14
movq %r13, %rdx
movq %r14, %rsi
movq %rax, %rdi
call memcpy
movq %rax, %rcx
.L61:
movq %rcx, %rdx
movq %r14, %rsi
movq %rbx, %rdi
movq %r9, 32(%rsp)
movq %r8, 24(%rsp)
movq %rcx, 16(%rsp)
call memcpy
movq 32(%rsp), %r9
movq 24(%rsp), %r8
movq 16(%rsp), %rcx
jmp .L19
.p2align 4,,10
.p2align 3
.L58:
movq %r13, %rdx
movq %r14, %rsi
movq %rbx, %rdi
movq %r9, 16(%rsp)
call memcpy
leaq (%rbx,%r13), %r8
movq 16(%rsp), %r9
jmp .L16
|
.L8:
movq %r13, %rdx
movq %rbp, %rsi
movq %r12, %rdi
addl $1, %ebx
call memcpy
cmpl %ebx, %r14d
jne .L8
|
The sections of assembly that represent the inner loops for the two functions under test.
Comparing the two sections of assembly code makes it clear that while std::vector<uint8_t> has a handful of additional instructions and separate implementations that may depend on the vector size or memory alignment it is ultimately calling memcpy() to copy the elements. The additional instructions are likely responsible for the additional time spent in the std::vector<> assignment operator vs. a straight memcpy().
Surprisingly std::vector<> assignment appears to be nearly as efficient as memcpy(). For a large number of elements the difference in performance of std::vector<> assignment to memcpy() shrinks to almost zero.
My recommendation is to begin your implementation using std::vector<>. Only after profiling and identifying an issue with std::vector<> assignment to then consider switching to using raw arrays. Using std::vector<> will give you the bounds checking protection and other helpful things like bundling your data and the length of data together. You also still have the option to retrieve the raw array from the std::vector<> via std::vector : data().
I'd be interested in your thoughts on the topic. Please consider forking the github repository and/or commenting.
Very interesting. Does memcopy is also called when a std::vector of a primitive type is resized ?
ReplyDelete