Performance of std::vector vs. arrays

The surprising results of a performance comparison between std::vector<> operator= and memcpy() with a raw array.

An application profiling effort by a colleague and I led us to investigate the performance tradeoffs between std::vector<> and a raw array (uint8_t blah[]).

The application was a c++ rewrite of a c# application. The rewrite was undertaken because the overhead of running a c# application on our memory and cpu constrained ARM Linux system was affecting system performance and responsiveness. Using Mono to run C# applications on Linux resulted in ~40MB of ram overhead for our application that itself allocated tens of thousands of bytes at most. Startup time left much to be desired. It took a few seconds to start the application up from mmc on the 1GHz ARM processor. We verified the long startup time by printing an “I’m started” string as the first function in Main. We tried upgrading to a newer version of Mono to see if that would improve performance. We last used a 4.x version, both to build and run the application, but we didn't see any significant improvements in startup time or ram overhead.

Startup time of the overall system is important because the product regularly starts up from a fully powered off state. The c# application is one of several running in the system. Users end up waiting while the system is starting up and waiting users can become irritated users. In addition, we were seeing periodic cpu spikes. These spikes were likely due to some issues with our implementation rather than the Mono runtime or garbage collection. We could probably have fixed these cpu spikes but given the startup time and memory overhead issues we decided to focus on the rewrite into c++ instead.

The rewrite ended up being very effective in improving performance and reducing latency. Button responsiveness was greatly improved due to reduced latency. Other applications benefited by the additional available cpu time. Switching from c# to c++ reduced cpu load from 7-13% down to 0-2%. Ram usage went from 40MB down to below 100k. The c++ version also starts up much more quickly, possibly because of the reduced overhead of not having the .net memory management and other systems loading. However, the c++ version didn’t start off at 0-2% cpu. That came after several hours of profiling and puzzling over the code.

During the c++ rewrite we were keen to see how well we were doing as soon as we had a partially functional system. Using ‘top’ and ‘htop’ on our embedded target indicated that the c++ app was using between 7-10% cpu. This was an improvement over the 7-13% in the c# implementation but it still seemed like a lot for an application that was communicating periodically via serial and sleeping in the interim.

As this is a Linux application, we took advantage of being able to develop and run it on desktop/laptop Linux during our profiling efforts. Running on the desktop hardware made it much easier to profile, rebuild, and retest.

Profiling with valgrind’s callgrind tool and kcachegrind for visualization revealed that a lot of time was being spent in std::copy<>.

The application does communication via serial and uses a request/response protocol to a piece of hardware. Each one of these messages, typically several bytes to a couple of dozen bytes in length, was being represented by a std::vector. An oversight in the first versions of rewrite was that std::vector<>s were being passed by value in several locations. Passing by value meant that lots of std::vector<>s were being constructed and copied into, and that creation involved std::copy<>.

Switching to using references to the std::vector<>s made the application more efficient and brought its overall cpu usage down from 7-10% to 0-2%. There was still the question of whether the use of std::vector<> was introducing overhead that wouldn’t have been present were a raw uint8_t message[] being used.

There are a range of ways to measure the performance differences between std::vector and uint8_t message[] but we were interested in the performance differences when copying std::vector<> vs. uint8_t message[].

I had created an initial test. A colleague of mine, Matt, rewrote the tests and pulled in the python pandas library for visualization, in this case via a box plot. That was extended to test a range of array sizes and with CMake targets for generating the data sets and the images used here. You can find the link to the github project here.

These plots show the time spent copying a byte array via memcpy() vs. std::vector<>::operator=().

Here are the results:

64 bytes, std::vector slower by 14.3%

128 bytes, std::vector slower by 11.4%

256 bytes, std::vector slower by 25.3%

1024 bytes, std::vector slower by 63.2%

2048 bytes, vector slower by 60.1%

4096 bytes, vector slower by 57.5%

8192 bytes, vector slower by 13.5%

16384 bytes, vector slower by 12.4%

32768 byets, vector slower by 11.7%

65535 bytes, vector slower by 8.9%

1048576 bytes, vector slower by 1.8%

We didn't expect the result that std::vector<> assignment was only slightly slower than memcpy(). There was an expectation that the use of a c++ container would result in a more significant performance impact due to the per-element nature of a std::vector<>. Instead the overhead was relatively small. But why?

c++ templates and gcc’s optimizer

std::vector<>'s assignment operator uses std::copy. std::copy looks like:

From http://www.cplusplus.com/reference/algorithm/copy/

template<class InputIterator, class OutputIterator>
OutputIterator copy (InputIterator first, InputIterator last, OutputIterator result)
{
while (first!=last) {
*result = *first;
++result; ++first;
}
return result;
}

The compiler appears to look at the loops and types involved in the std::vector<> template code and converts them into a memcpy() because the type, uint8_t, is an integer type, and not a class. This conversion to memcpy() means that while you are using a std::vector<> container for its capabilities you are also getting an optimized implementation for assigning/copying std::vector<uint8_t>.

The source code for these tests and the generation of the images can be found in a github repository

After cloning the repository you can build like:

$ mkdir build

$ cmake ../

$ make

You can generate the images in this article with:

$ make images

and can build the assembly files like:

make TimeToCopyArray.s

make TimeToCopyVector.s

Note that these results may be highly compiler dependent. Here on Fedora 22 gcc is:

gcc (GCC) 5.1.1 20150618 (Red Hat 5.1.1-4)

Your assembly output and results may be different depending on the compiler you are using.

Open up main.cpp.s:

_Z16TimeToCopyVectorii (std::vector<>)

_Z15TimeToCopyArrayii (memcpy())

There are 4 separate instances:

.L19:

movq 8(%rsp), %rdx

leaq (%r14,%rcx), %rsi

subq %rsi, %rdx

movq %rdx, %rax

sarq $2, %rax

testq %rax, %rax

je .L55

movq %r8, %rdi

addl $1, %ebp

movq %r9, 16(%rsp)

call memcpy

cmpl %ebp, %r15d

leaq (%rbx,%r13), %r8

movq 16(%rsp), %r9

jne .L21

.p2align 4,,10

.p2align 3

.LEHB2:

call _Znwm

testq %r12, %r12

movq %rax, %rcx

je .L14

movq %r13, %rdx

movq %r14, %rsi

movq %rax, %rdi

call memcpy

movq %rax, %rcx

.L61:

movq %rcx, %rdx

movq %r14, %rsi

movq %rbx, %rdi

movq %r9, 32(%rsp)

movq %r8, 24(%rsp)

movq %rcx, 16(%rsp)

call memcpy

movq 32(%rsp), %r9

movq 24(%rsp), %r8

movq 16(%rsp), %rcx

jmp .L19

.p2align 4,,10

.p2align 3

.L58:

movq %r13, %rdx

movq %r14, %rsi

movq %rbx, %rdi

movq %r9, 16(%rsp)

call memcpy

leaq (%rbx,%r13), %r8

movq 16(%rsp), %r9

jmp .L16

.L8:

movq %r13, %rdx

movq %rbp, %rsi

movq %r12, %rdi

addl $1, %ebx

call memcpy

cmpl %ebx, %r14d

jne .L8

The sections of assembly that represent the inner loops for the two functions under test.

Comparing the two sections of assembly code makes it clear that while std::vector<uint8_t> has a handful of additional instructions and separate implementations that may depend on the vector size or memory alignment it is ultimately calling memcpy() to copy the elements. The additional instructions are likely responsible for the additional time spent in the std::vector<> assignment operator vs. a straight memcpy().

Surprisingly std::vector<> assignment appears to be nearly as efficient as memcpy(). For a large number of elements the difference in performance of std::vector<> assignment to memcpy() shrinks to almost zero.

My recommendation is to begin your implementation using std::vector<>. Only after profiling and identifying an issue with std::vector<> assignment to then consider switching to using raw arrays. Using std::vector<> will give you the bounds checking protection and other helpful things like bundling your data and the length of data together. You also still have the option to retrieve the raw array from the std::vector<> via std::vector : data().

I'd be interested in your thoughts on the topic. Please consider forking the github repository and/or commenting.

Travelling on Spirit airlines out of Boston Logan airport? Here are some tips.

I attended CES 2017 in Las Vegas. Booking the trip late I ended up on Spirit airlines. It was both non-stop, making it six hours to Las Vegas from Boston, and affordable, less than $300 for a one way trip compared to around $700 with JetBlue. Here are some tips that might help you when travelling on Spirit from Boston Logan airport. Eat Spirit is located in the B-terminal, gates B-37 and 38, with its own TSA security checkpoint. While it does have restrooms and places to sit the food selection is limited to a single food stand. I'd recommend eating at the Legal C Bar (number 77 in the image below) prior to going through the terminal security checkpoint. The food and service there were great. Drink The water and other drinks are cheaper if you buy them at the food cart rather than on the flight. Seats The seats on Spirit don't recline. They do this to reduce weight, seat cost, seat maintenance costs, and so seats don't impact the free space of other passengers,

Chris M.

Search This Blog

Performance of std::vector vs. arrays

The surprising results of a performance comparison between std::vector<> operator= and memcpy() with a raw array.

c++ templates and gcc’s optimizer

Labels

Comments

Post a Comment

Popular posts from this blog

Debugging an imprecise bus access fault on a Cortex-M3

Travelling on Spirit airlines out of Boston Logan airport? Here are some tips.

Graco Swing By Me - Battery to AC wall adapter modification