Skip to main content

Performance of std::vector vs. arrays

The surprising results of a performance comparison between std::vector<> operator= and memcpy() with a raw array.

An application profiling effort by a colleague and I led us to investigate the performance tradeoffs between std::vector<> and a raw array (uint8_t blah[]).

The application was a c++ rewrite of a c# application. The rewrite was undertaken because the overhead of running a c# application on our memory and cpu constrained ARM Linux system was affecting system performance and responsiveness. Using Mono to run C# applications on Linux resulted in ~40MB of ram overhead for our application that itself allocated tens of thousands of bytes at most. Startup time left much to be desired. It took a few seconds to start the application up from mmc on the 1GHz ARM processor. We verified the long startup time by printing an “I’m started” string as the first function in Main. We tried upgrading to a newer version of Mono to see if that would improve performance. We last used a 4.x version, both to build and run the application, but we didn't see any significant improvements in startup time or ram overhead.

Startup time of the overall system is important because the product regularly starts up from a fully powered off state. The c# application is one of several running in the system. Users end up waiting while the system is starting up and waiting users can become irritated users. In addition, we were seeing periodic cpu spikes. These spikes were likely due to some issues with our implementation rather than the Mono runtime or garbage collection. We could probably have fixed these cpu spikes but given the startup time and memory overhead issues we decided to focus on the rewrite into c++ instead.

The rewrite ended up being very effective in improving performance and reducing latency. Button responsiveness was greatly improved due to reduced latency. Other applications benefited by the additional available cpu time. Switching from c# to c++ reduced cpu load from 7-13% down to 0-2%. Ram usage went from 40MB down to below 100k. The c++ version also starts up much more quickly, possibly because of the reduced overhead of not having the .net memory management and other systems loading. However, the c++ version didn’t start off at 0-2% cpu. That came after several hours of profiling and puzzling over the code.

During the c++ rewrite we were keen to see how well we were doing as soon as we had a partially functional system. Using ‘top’ and ‘htop’ on our embedded target indicated that the c++ app was using between 7-10% cpu. This was an improvement over the 7-13% in the c# implementation but it still seemed like a lot for an application that was communicating periodically via serial and sleeping in the interim.

As this is a Linux application, we took advantage of being able to develop and run it on desktop/laptop Linux during our profiling efforts. Running on the desktop hardware made it much easier to profile, rebuild, and retest.

Profiling with valgrind’s callgrind tool and kcachegrind for visualization revealed that a lot of time was being spent in std::copy<>.

The application does communication via serial and uses a request/response protocol to a piece of hardware. Each one of these messages, typically several bytes to a couple of dozen bytes in length, was being represented by a std::vector. An oversight in the first versions of rewrite was that std::vector<>s were being passed by value in several locations. Passing by value meant that lots of std::vector<>s were being constructed and copied into, and that creation involved std::copy<>.

Switching to using references to the std::vector<>s made the application more efficient and brought its overall cpu usage down from 7-10% to 0-2%. There was still the question of whether the use of std::vector<> was introducing overhead that wouldn’t have been present were a raw uint8_t message[] being used.

There are a range of ways to measure the performance differences between std::vector and uint8_t message[] but we were interested in the performance differences when copying std::vector<> vs. uint8_t message[].

I had created an initial test. A colleague of mine, Matt, rewrote the tests and pulled in the python pandas library for visualization, in this case via a box plot. That was extended to test a range of array sizes and with CMake targets for generating the data sets and the images used here. You can find the link to the github project here.

These plots show the time spent copying a byte array via memcpy() vs. std::vector<>::operator=().

Here are the results:

64 bytes, std::vector slower by 14.3%
128 bytes, std::vector slower by 11.4%
256 bytes, std::vector slower by 25.3%
1024 bytes, std::vector slower by 63.2%
2048 bytes, vector slower by 60.1%
4096 bytes, vector slower by 57.5%
8192 bytes, vector slower by 13.5%
16384 bytes, vector slower by 12.4%
32768 byets, vector slower by 11.7%
65535 bytes, vector slower by 8.9%
1048576 bytes, vector slower by 1.8%

We didn't expect the result that std::vector<> assignment was only slightly slower than memcpy(). There was an expectation that the use of a c++ container would result in a more significant performance impact due to the per-element nature of a std::vector<>. Instead the overhead was relatively small. But why?

c++ templates and gcc’s optimizer

std::vector<>'s assignment operator uses std::copy. std::copy looks like:

From http://www.cplusplus.com/reference/algorithm/copy/
template<class InputIterator, class OutputIterator>
 OutputIterator copy (InputIterator first, InputIterator last, OutputIterator result)
{
 while (first!=last) {
   *result = *first;
   ++result; ++first;
 }
 return result;
}


The compiler appears to look at the loops and types involved in the std::vector<> template code and converts them into a memcpy() because the type, uint8_t, is an integer type, and not a class. This conversion to memcpy() means that while you are using a std::vector<> container for its capabilities you are also getting an optimized implementation for assigning/copying std::vector<uint8_t>.

The source code for these tests and the generation of the images can be found in a github repository

After cloning the repository you can build like:
$ mkdir build
$ cmake ../
$ make


You can generate the images in this article with:
$ make images

and can build the assembly files like:

make TimeToCopyArray.s
make TimeToCopyVector.s

Note that these results may be highly compiler dependent. Here on Fedora 22 gcc is:
gcc (GCC) 5.1.1 20150618 (Red Hat 5.1.1-4)
Your assembly output and results may be different depending on the compiler you are using.

Open up main.cpp.s:


_Z16TimeToCopyVectorii (std::vector<>)
_Z15TimeToCopyArrayii (memcpy())
There are 4 separate instances:
.L19:
movq 8(%rsp), %rdx
leaq (%r14,%rcx), %rsi
subq %rsi, %rdx
movq %rdx, %rax
sarq $2, %rax
testq %rax, %rax
je .L55
movq %r8, %rdi
addl $1, %ebp
movq %r9, 16(%rsp)
call memcpy
cmpl %ebp, %r15d
leaq (%rbx,%r13), %r8
movq 16(%rsp), %r9
jne .L21
.p2align 4,,10
.p2align 3


.LEHB2:
call _Znwm
testq %r12, %r12
movq %rax, %rcx
je .L14
movq %r13, %rdx
movq %r14, %rsi
movq %rax, %rdi
call memcpy
movq %rax, %rcx

.L61:
movq %rcx, %rdx
movq %r14, %rsi
movq %rbx, %rdi
movq %r9, 32(%rsp)
movq %r8, 24(%rsp)
movq %rcx, 16(%rsp)
call memcpy
movq 32(%rsp), %r9
movq 24(%rsp), %r8
movq 16(%rsp), %rcx
jmp .L19
.p2align 4,,10
.p2align 3


.L58:
movq %r13, %rdx
movq %r14, %rsi
movq %rbx, %rdi
movq %r9, 16(%rsp)
call memcpy
leaq (%rbx,%r13), %r8
movq 16(%rsp), %r9
jmp .L16


.L8:
movq %r13, %rdx
movq %rbp, %rsi
movq %r12, %rdi
addl $1, %ebx
call memcpy
cmpl %ebx, %r14d
jne .L8

The sections of assembly that represent the inner loops for the two functions under test.

Comparing the two sections of assembly code makes it clear that while std::vector<uint8_t> has a handful of additional instructions and separate implementations that may depend on the vector size or memory alignment it is ultimately calling memcpy() to copy the elements. The additional instructions are likely responsible for the additional time spent in the std::vector<> assignment operator vs. a straight memcpy().

Surprisingly std::vector<> assignment appears to be nearly as efficient as memcpy(). For a large number of elements the difference in performance of std::vector<> assignment to memcpy() shrinks to almost zero.

My recommendation is to begin your implementation using std::vector<>. Only after profiling and identifying an issue with std::vector<> assignment to then consider switching to using raw arrays. Using std::vector<> will give you the bounds checking protection and other helpful things like bundling your data and the length of data together. You also still have the option to retrieve the raw array from the std::vector<> via std::vector : data().

I'd be interested in your thoughts on the topic. Please consider forking the github repository and/or commenting.

Comments

  1. Very interesting. Does memcopy is also called when a std::vector of a primitive type is resized ?

    ReplyDelete

Post a Comment

Popular posts from this blog

Debugging an imprecise bus access fault on a Cortex-M3

This information may apply to other cortex series processors but is written from practical experience with the Cortex-M3. Imprecise bus access faults are ambiguous, as noted by the term "imprecise". Compared to precise bus errors, imprecise errors are much trickier to debug and especially so without a deep understanding of arm processors and assembly language. Imprecise and precise flags are found in the BusFault status register, a byte in the CFSR (Configurable Fault Status Register). BusFault status register bits The definition for imprecise and precise bits is: [2] IMPRECISERR Imprecise data bus error: 0 = no imprecise data bus error 1 = a data bus error has occurred, but the return address in the stack frame is not related to the instruction that caused the error. When the processor sets this bit to 1, it does not write a fault address to the BFAR. This is an asynchronous fault. Therefore, if it is detected when the priority of the current pr

Graco Swing By Me - Battery to AC wall adapter modification

If you have one of these Graco battery powered swings you are probably familiar with the cost of C batteries! The swing takes four of them and they only last a handful of days. I'm not sure if the newer models support being plugged into the wall but ours didn't. If you are a little familiar with electronics and soldering, here is a rough guide on how you can modify yours to plug in! I wasn't sure how exactly to disassemble the swing side where the batteries were. I was able to open up the clamshell a bit but throughout this mod I was unable to determine how to fully separate the pieces. I suspect that there is some kind of a slip plate on the moving arm portion. The two parts of the plastic are assembled and the moving arm portion with the slip plate is slid onto the shaft. Because of the tension in that slip plate it doesn't want to back away, and because of the mechanicals that portion of the assembly doesn't appear accessible in order to free it. I was

Memory efficient queuing of variable length elements

In embedded environments memory can be a critical driver of the design of data structures and containers. Computing resources have been expanding steadily each year but there are still a wide range of systems with far less than a megabyte of memory. On systems with tens of kilobytes of memory, structures are often designed to be compact to maximize data density. Rather than splurging on memory aligned elements that would be faster for the processor to access, a developer will typically use types with minimal sizes based on the known range of values that the element is intending to hold. Fixed sized buffers At my day job a fixed size pool of messages was implemented to hold message data. While this achieved one design goal of using statically allocated buffers, avoiding dynamic allocations that might fail at runtime, it isn't efficient if there is a wide range of message sizes. It isn't efficient because each message uses a message buffer. With small message sizes the buff