Global synchronisation in OpenCL

There are a lot of things to recommend OpenCL for GPGPU programming but there is not a lot of sympathy for people trying to synchronise between workgroups.

The problem

Officially speaking, work items (I'll use the CUDA term "threads" from now on because it's a bit more intuitive) within a single workgroup can talk to each other all they like through local memory and global memory and synchronise within that workgroup using calls to barrier(...) as required, but workgroups can't talk to each other so the only opportunity for global synchronisation occurs when the kernels terminate and control returns to the host environment.

This is done on purpose of course, because although OpenCL is commonly used as a wrapper for GPU programming, it's designed with a view to massively parallel computation across devices, potentially mixing GPUs, CPUs, a herd of sharks with electrocortexes fitted, whatever you have lying around. It allows other sexy assumptions that make programming easier, too -- as each workgroup executes one batch of work, regardless of order, we can assign N batches of work to M compute units, even if N is much greater than M. It just doesn't matter, even if there is a single compute unit we will get all the work done eventually. New work is dispatched as new compute units become available.

The standard advice is quite rightly to split work into separate kernels and synchronise in the host code. Absolutely all the time this is completely the correct thing to do. Global synchronisation is completely the wrong thing to do and you're a bad person for even thinking about it, you're not getting into the spirit of OpenCL, you should just restructure the code considering the workgroup paradigm more carefully, and if that doesn't work then you should get away from the computer immediately and take up goat herding instead.

BUT ... it sure is tempting to take advantage of the relatively high speeds of private and local memory compared to global memory! The super-fast L1 and L2 caches definitely get wiped out between kernel invocations; once again, this follows naturally from the idea that work units are not necessarily executing in parallel. Global memory access is slooooow and generally a bottleneck, it does seem like a heck of a waste to blow away our cache if we only need synchronisation to compute a couple of values, as in many iterative matrix calculations.

What if we can guarantee that we have enough compute units that the work can be done in parallel? If we hack our computation environment to ensure that everything can run simultaneously, then if the stars align then everything may just run at the same time and maybe we can get some work done. We're losing portability, but we already know we're bad programmers, so let's try it anyway. Worst possibility is that the hardware doesn't support the hack, and the GPU locks up because a thread is deadlocked on something that it's supposed to do itself, and you have to reboot. If that's the worst thing that happens to you today then you're living a blessed life anyway.

The plausible solutions

Jnana Panuganti provides a nice survey of approaches to this problem, and provides OpenCL implementations (and benchmarks) of some CUDA code from a couple of existing research papers.

Efficient Synchronization Primitives for GPUs describes several mutex-based solutions with an API and a lot of measurements of efficiencies on some common devices:

  • spin-lock (loop until an atomic read-swap-store operation succeeds)
  • back-off (a spin-lock with a memory access thrown in to introduce a delay between attempts)
  • fetch-and-add (a first-come-first-served extension of the spin-lock using an atomic increment operation)

A different approach altogether is described in Inter-Block GPU Communication via Fast Barrier Synchronization, which uses barrier(...) calls to synchronise accesses to global memory. This doesn't require any atomic operations, and while it does requires busy-waiting this is restricted to only a few threads. Panuganti's benchmarks indicate that this is certainly the most lightweight and scalable of the considered techniques so let's look more carefully at it.

Decentralised barrier synchronisation

Consider this simple synchronisation technique:

  1. Initialise a global variable counter to the total number of threads
  2. Each thread does an atomic decrement on counter
  3. Each thread loops until counter reaches 0

Intuitively this seems a little wasteful since we already have the ability to synchronise threads within a single workgroup, so perhaps we can make use of that:

  1. Initialise a global variable counter to the total number of workgroups
  2. One thread (the "master thread") per workgroup does an atomic decrement on counter
  3. Each master thread loops until counter reaches 0
  4. All threads wait on a memory barrier

This centralised barrier synchronisation does work, but it still requires that atomic operation (and competition for the single memory location) which is pretty slow, particularly on the Nvidia Tesla available with Amazon's EC2 GPU instance. How about if we eliminate the competition for that memory location by allocating a global array of flags with one flag per workgroup:

  1. The master thread in the Ith workgroup sets the flag in the Ith position of flags
  2. The Ith thread in the master workgroup waits until the Ith flag is set
  3. All threads in the master workgroup wait on a global memory barrier
  4. The ith thread in the master workgroup clears the flag in the Ith position of flags
  5. The master thread in the Ith workgroup waits until the Ith flag is cleared
  6. All threads in each workgroup wait on a global memory barrier, taking care of intra-group synchronisation

For simplicity, the "master thread" is thread #0 in each workgroup, and the "master workgroup" is workgroup #0. We can implement the waits in steps 2 and 5 with busy-waits. Here's the code:

static void
global_sync(volatile __global int *flags)
{
    const size_t thread_id = get_local_id(0);
    const size_t workgroup_id = get_group_id(0);

    if (thread_id == 0) {
        flags[workgroup_id] = 1;
    }

    if (workgroup_id == 0) {
        if (thread_id < get_num_groups(0)) {
            while (flags[thread_id] != 1) ;
        }
        barrier(CLK_GLOBAL_MEM_FENCE);

        if (thread_id < get_num_groups(0)) {
            flags[thread_id] = 0;
        }
    }

    if (thread_id == 0) {
        while (flags[workgroup_id] != 0) ;
    }
    barrier(CLK_GLOBAL_MEM_FENCE);
}

We do assume that each entry in the flags array is not equal to 1 going in. A good way to ensure that would be to allocate a zeroed array in the first place. After synchronisation, the flags are reset to 0 so it is ready to go again. Note also that this code assumes that the number of threads per workgroup is no greater than the number of workgroups, but if you're doing all this synchronisation work it would be pretty weird not to max out the number of threads per workgroup.

We can test the code by introducing some delays on a per-thread basis, easily done using (for instance) some global atomic operation:

// Remember the global atomic extension pragma:
// #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

__kernel void
test_sync(
    __global int *scratch,
    __global int *tmp,
    __global int *out,
    __global int *flags
)
{
    int id = get_global_id(0);

    // Every thread finishes the atomic op at different times
    atom_add(scratch, id);

    // Do some work dependent on each workgroup ...
    tmp[id] = get_group_id(0);

    // ... synchronise ...
    global_sync(flags);

    // ... do some work that depends on a different workgroup
    out[id] = tmp[get_global_size(0) - id - 1];
}


__kernel void
test_nosync(
    __global int *scratch,
    __global int *tmp,
    __global int *out,
    __global int *flags
)
{
    // The same as test_sync(...) but without synchronisation

    int id = get_global_id(0);
    atom_add(scratch, id);
    tmp[id] = get_group_id(0);
    out[id] = tmp[get_global_size(0) - id - 1];
}

Fire up the kernels with several workgroups and several threads per workgroup (making sure at all times that the number of threads per workgroup is MORE than the number of workgroups!), and we can see the difference:

test_sync:

9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

test_nosync:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 6 6 0 0 0 0 0 6 6 6 0 0 0 0 
5 5 5 5 0 0 0 0 5 5 5 5 0 0 0 0 
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
3 3 3 3 0 0 0 0 3 3 3 3 3 3 3 3 
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We see that without synchronisation, the workgroups with higher IDs tend not to have executed out[id] = get_group_id(0) before that value is used by the workgroups with lower IDs in the following line. This also gives us some minor insight into the vagaries of the scheduling and dispatch systems at the heart of OpenCL.

Final thoughts

For certain memory usage patterns, global synchronisation may offer some very significant efficiencies, but if you're going to use it you should be well aware that it's a dirty, dirty hack. It is very much at odds with the OpenCL design toiled over by the good folks at Khronos, but by the same token there's little chance that they will in future introduce a "proper way" of doing global synchronisation within a kernel.

Bearing that in mind it should come as no surprise if it fails horribly on a particular device. The approach described here is relatively benign but I still wouldn't assume that it is likely to work on older hardware. Nvidia seems to have a good reputation for playing nicely with those who play fast and loose between workgroups, and there are tantalising hints even in AMD's own documentation that inter-group communication is plausible (albeit with those newer global atomic operations).

Most importantly, watch out for the ways in which this method can deadlock your system: the important failure modes are allocating more workgroups than threads, and allocating more workgroups than can be executed simultaneously on your hardware. I don't think it would be too surprising that I had to reboot this computer a few times in testing the code!

tomm

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • Twitter
  • RSS

2 Response to "Global synchronisation in OpenCL"

  1. Unknown says:
    1 January 2016 at 09:44

    Hi, TOMM.

    After reading your article and the original paper, I've decided to test the codes described by you and the other authors. All them seem pretty simple and easy to implement, but I have two questions that I hope you can answer:

    1. Your code is slightly different from the one presented in “Section B. GPU Lock-Free Synchronization”. Why? It's seems to me that the original version, using two arrays, is the right way to implement the barrier. Why did you change it?
    2. I'm using a R9 290X to test the both versions and, sadly, they freeze up my GPU all the time. I followed all the instructions you gave in your article but I'm still unable to properly run the code. I'm sure I'm doing something stupid but I can't see what. Can you help me?

    Thanks in advance.

  2. jlgreathouse says:
    24 March 2016 at 22:44

    Walid, to answer your second question: the reason the example code deadlocks your R9 290X is because the L1 caches in the compute units of AMD GCN GPUs are not coherent with one another.

    What this means is that the accesses to the flag[] variables within global_sync() will not pass be coherent between workgroups. By way of example: if workgroup X gets to "while (flags[thread_id] != 1) ;" and reads flags[thread_id] as zero, that value will be cached in its compute unit's L1 cache. Later, workgroup Y writes a non-zero value into that same flags[thread_id] location. However, workgroup X is still spin-looping on its locally cached value. It will never see the new value, and will thus spin-loop forever.

    The way to get around this on AMD hardware is to replace accesses to synchronization variables like this with atomic functions.

    In other words, if you replace the accesses to flags[] on lines 8, 13, 18, and 23 of the global_sync() function, this should work. For example:
    Instead of "flags[workgroup_id] = 1;" you could do "atomic_or(&flags[workgroup_id], 1);"
    Instead of "while (flags[thread_id] != 1) ;", you could do "while (atomic_or(&flags[thread_id], 0);".

    This should work on AMD GCN GPUs. However, I think this also demonstrates what's mentioned in the article: global synchronization in OpenCL 1.x is dangerous and non-portable.

Post a Comment