TBB concurrent memory allocation problems and solutions

In the multi-threaded program, common memory allocation will become a serious performance bottleneck. This article describes how to use the Threading Building Blocks scalable memory allocator to avoid memory allocation competition and false sharing.

Memory allocation is not only the programming of the basic tasks, but also in multi-core programming affect the efficiency of a challenge. In C + + where we can use a custom memory allocator instead of std:: allocator, Threading Building Blocks to provide a std:: allocator compatible and scalable memory allocator.

Each memory allocator has its own characteristics. TBB scalable allocator is committed to the scalability and speed. In some cases, this comes at a price: a waste of virtual space. Specifically, when the distribution of 9k to 12k block can be that it would waste more space.

Memory allocation problem

In the multi-threaded program, common memory allocation will become a serious performance bottleneck. This is because the common memory allocator to use a global lock from a single global heap allocate and free memory blocks, each thread allocate and free memory will lead to competition. It is this competition, and frequent memory allocation procedures may reduce the multi-core brings advantages. Using C + + Standard Library (STL) program may be more serious, because their memory allocation is often hidden.

Here sat a piece of code, it concurrent implementation of a memory application and release operations function:

#include <iostream>
#include <tbb/parallel_for.h>
#include <tbb/tick_count.h>
#include <vector>

using namespace tbb;

void alloctask( int )
 //vector Structure and the destructor to free up space when you apply for and  
 std::vector<int> data(100);

int main()
 tick_count t1 = tick_count::now(); // Used to record the time spent  
 parallel_for(0,100000,1,alloctask); // Hundreds of time  alloctask( Concurrent  )
 tick_count t2 = tick_count::now();

 std::cout << (t2-t1).seconds() << std::endl;
 return 0;

You can run this code see the time it takes, we can easily put the above code modified to use the TBB scalable memory allocator, this rate will speed up almost double (in the dual dual-core CPU, is expected to more cores CPU will perform better).

Note: If the previous read site on the TBB cycle of the article, people may wonder, there are no task_scheduler_init, and parallel_for parameters are different. In fact, the code is based on the latest TBB2.2 version, this version is no longer mandatory for task_scheduler_init. parallel_for also a few more overloaded for convenience.

"Fake sharing" is a concurrent program in another serious problem, it often occurs when multiple threads using the memory block close with you. In the processor core has called "cache lines" of the high-speed cache, it can only be by the same thread access the same cache, otherwise it will lead to cache switching, which can easily cause hundreds of clock cycles of the waste .

To illustrate why the false sharing such a performance loss, we can see when two threads access close together memory caused by the additional overhead. Assume that the buffer has 64 bytes, two threads sharing the cache.

First, the program defines two arrays, containing 1000 float (4 bytes):

float A_array [1000];
float B_array [1000];

As the compiler of the order of distribution, the two arrays is likely to be close together. Consider the following actions:

  1. Thread A writes A_array [999];
    Processor will contain A_array [999] The elements of 64 bytes into the cache
  2. Thread B is written B_array [0];
    Additional costs: the processor must flush the cache, the A_array [999] Save to memory. To contain B_array [0] of 64 bytes included in the cache and set the thread A, the cache is marked as invalid.
  3. Continue to work, thread A writes A_array [1];
    Additional costs: the processor must flush the cache in order to B_array [0] saved in memory. Back to thread A load cache and set the thread B's cache is marked as invalid.

See, even if the thread A and thread B to use their own memory or will cause significant overhead. Solution to leave the shared approach is to array according to the cache boundary alignment.

Memory allocator

TBB scalable memory allocator can be used to solve the above-mentioned problems, TBB provides two distributors: scalable_allocator and cache_aligned_allocator, they were defined in tbb / scalable_allocator.h and tbb / cache_aligned_allocator.h years.

  • scalable_allocator solve the distribution of competition, it did not completely prevent false sharing. However, each thread from a different memory pool access memory, this can also be a certain procedure to avoid false sharing occurs.
  • cache_aligned_allocator solve the distribution of competition and false sharing. As the allocation of memory is the cache size in multiples of it takes more space, particularly the allocation of a large number of small space. We should determine the false sharing has become a performance bottleneck when using cache_aligned_allocator. In your program are using the two distributor to test performance to determine the ultimate use of which is a good idea.

In the STL container to use dispenser

scalable_allocator and cache_aligned_allocator and std:: allocator are compatible, we can use std:: allocator as to use them. The following example demonstrates the use of cache_aligned_allocator as std:: vector's allocator.

std::vector< int, cache_aligned_allocator<int> >;

Now we can put the previous code had changed a bit:

void alloctask( int )
    //vector Structure and the destructor to free up space when you apply for and  
    std::vector<int, scalable_allocator<int> > data(100);

Comparing the efficiency of your computer, how much it improved ^ _ ^

Place of malloc, free, realloc and calloc

TBB as malloc, free, realloc and calloc can provide the corresponding extended version:

#include "tbb\scalable_allocator.h"

void * scalable_malloc (size_t size);
void   scalable_free (void* ptr);
void * scalable_realloc (void* ptr, size_t size);
void * scalable_calloc (size_t nobj, size_t size);

Instead of new and delete

To completely overload C + + in the new and delete, we are to achieve the following four pairs of this new / delete operations:

void* operator new(std::size_t size) throw(std::bad_alloc);
void* operator new(std::size_t size, const std::nothrow_t&) throw( );
void* operator new[](std::size_t size) throw(std::bad_alloc);
void* operator new[](std::size_t size, const std::nothrow_t&) throw( );
void  operator delete(void* ptr) throw( );
void  operator delete(void* ptr, const std::nothrow_t&) throw( );
void  operator delete[](void* ptr) throw( );
void  operator delete[](void* ptr, const std::nothrow_t&) throw( );

We can use the front said the scalable_malloc () and scalable_free () to achieve these steps:

#include "tbb\scalable_allocator.h"

void* operator new (size_t size) throw (std::bad_alloc)
    if (size == 0) size = 1;
    if (void* ptr = scalable_malloc (size))
        return ptr;
    throw std::bad_alloc ( );
void* operator new[] (size_t size) throw (std::bad_alloc)
    return operator new (size);
void* operator new (size_t size, const std::nothrow_t&) throw ( )
    if (size == 0) size = 1;
    if (void* ptr = scalable_malloc (size))
        return ptr;
    return NULL;
void* operator new[] (size_t size, const std::nothrow_t&) throw ( )
    return operator new (size, std::nothrow);
void operator delete (void* ptr) throw ( )
    if (ptr != 0) scalable_free (ptr);
void operator delete[] (void* ptr) throw ( )
    operator delete (ptr);
void operator delete (void* ptr, const std::nothrow_t&) throw ( )
    if (ptr != 0) scalable_free (ptr);
void operator delete[] (void* ptr, const std::nothrow_t&) throw ( )
    operator delete (ptr, std::nothrow);
标签: lt, performance bottleneck, scalability, vector, concurrent implementation, free memory, serious performance, stl, memory blocks, iostream, core programming, c standard library, threading building blocks, memory allocator, virtual space, allocation problem, memory allocation problems, global heap, custom memory, allocation procedures
分类: CPP
时间: 2010-07-29


  1. DB2 insufficient memory allocation

    DB2 insufficient memory allocation 1, the phenomenon, the problem description Erected in our own DB2 database ...
  2. Memcached memory allocation mechanism introduced

    Slab Allocation Mechanism: order memory for reuse By default, memcached recently introduced a mechanism called ...
  3. JVM 系列一:Native memory allocation 导致JVM Crash

    JVM Crash抛出如下信息: # # There is insufficient memory for the Java Runtime Environment to continue. # Native memor ...
  4. Java memory management (1. Memory allocation)

    On the Java memory allocation, and many problems are vague and can not run through a comprehensive understandi ...
  5. Fundamentals of Java memory management - Java memory allocation

    1. Introduction Now extract a Java5 memory management white paper statement: One strength of the Java ™ 2 Plat ...
  6. Memory allocation methods and their differences

    1, from the static storage area allocation. Memory when compiled in the program has allocated well, this piece ...
  7. C language memory allocation: malloc () function and alloc () function

    This article describes the C language, malloc () function and alloc () function. C language with the memory al ...
  8. SUN JVM memory management and server memory allocation and optimization of non-

    Part for a briefing on SUN JVM memory management mechanism. This mainly explains the JVM with performance-rela ...
  9. Thinking C + + memory allocation

    Quote: Yang Lihui the C + + memory allocation of the five methods Do not know whether it was her copy of the a ...
  10. Tomcat memory leak causes and solutions

    Tomcat causes memory overflow Tomcat in a production environment is very vulnerable to bad memory settings of ...
  11. [Daily small note] memory allocation method and common mistakes

    Memory operations for application developers, it is always a minefield. In this area, always keep the laying o ...
  12. C + + memory allocation Cheats-new, malloc, GlobalAlloc Detailed

    C + + memory allocation Cheats-new, malloc, GlobalAlloc Detailed _______ Just for memory allocation and can no ...
  13. Oracle database data files to grow automatically Problems and Solutions

    Oracle database data files to grow automatically Problems and Solutions Recently encountered a rather distress ...
  14. advanced programming unix environment Notes - memory allocation calls and stack space

    # Include <stdlib.h> void * malloc (size_t size); allocates size bytes of space void * calloc (size_t no ...
  15. JAVA virtual machine memory allocation and recovery mechanisms (transfer)

    Java to the memory divided into two kinds: One is the stack memory, heap memory, one is. In the function defin ...
  16. java heap and the stack (memory allocation strategy)

    Compiled in accordance with principles of the memory allocation strategy point of view, the program run-time m ...
  17. java memory allocation mechanisms in

    Java the memory is divided into two kinds: One is the stack memory, heap memory, one is. Defined in the functi ...
  18. C in the memory alignment problems

    Yesterday the interview, test a few channel data memory alignment problems, and now brought analyze. 1. Look a ...
  19. Frequent distribution of free memory performance problems caused by the analysis of (Reprinted)

    Phenomenon A pressure testing, found that the performance measured object is not ideal, the specific performan ...