The C++ Memory Model

Data Race: Load Modify Store

  • A data race is what the memory model is there to help understand and prevent

  • Here: Load-Modify-Store-Conflict

  • ⟶ Integer access is not atomic

#include <thread>
#include <array>
#include <iostream>
#include <cassert>

int main()
{
    const unsigned long numtimes = 1000*1000*1000UL;
    unsigned long counter = 0;

    auto threadfunc = [&counter]() {
        unsigned long n = numtimes;
        while (n--)
            counter++;
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "expected " << threads.size()*numtimes << ", real " <<  counter << '\n';

    return 0;
}
$ g++ -O0 load-modify-store.cpp
$ ./a.out
expected 2000000000, real 1066359623

Data Race: Load Modify Store (Explained)

  • Innocent counter++ is

    • Load from memory

    • Increment in CPU register

    • Write back to memory

CPU A

CPU B

Memory

Instr

Loc

Instr

Loc

Glob

load

42

42

42

load

42

42

inc

43

42

43

inc

43

42

43

store

43

43

store

43

43

43

Optimizations

  • Your program does not run how you wrote it

  • Compiler optimizations (see here for an overview)

  • CPU optimizations: e.g. read and write reordering

  • ⟶ the order in which writes becomes visible to other threads/CPUs is not what you think

  • (Similar for reads)

  • ⟶ no sequential order guaranteed

CPU Optimization: Write Reordering

  • The following code has “undefined behavior”

  • A threads that accesses the object through global_widget might see the pointer in memory before the memory itself

  • This is done by the CPU

Widget* w = new Widget(...);                           // <-- write: memory allocation, Widget initialization
global_widget = w;                                     // <-- write: global_widget

Another Race: std::list Is Not Threadsafe

#include <list>
#include <array>
#include <thread>
#include <iostream>

int main()
{
    const unsigned long numtimes = 1000*1000UL;
    std::list<int> ints;

    auto threadfunc = [&ints](){
        unsigned long n = numtimes;
        while (n--)
            ints.push_back(666);                       // <-- non-atomic two-pointer update
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "#elements: " << ints.size() << '\n';

    return 0;
}
$ g++ -O0 list.cpp
$ ./a.out
#elements: 1476027
  • Somehow pointers have been updated concurrently and incorrectly

  • ⟶ A typical case for a mutex

Protecting std::list With A Mutex

#include <list>
#include <array>
#include <thread>
#include <mutex>
#include <iostream>

int main()
{
    const unsigned long numtimes = 1000*1000UL;
    std::list<int> ints;
    std::mutex ints_lock;

    auto threadfunc = [&ints, &ints_lock](){
        unsigned long n = numtimes;
        while (n--) { 
            ints_lock.lock();                          // <-- acquire
            ints.push_back(666);
            ints_lock.unlock();                        // <-- release
        }
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "#elements: " << ints.size() << '\n';

    return 0;
}
$ g++ -O0 list.cpp
$ ./a.out
#elements: 2000000

What’s A Mutex?

  • A boolean flag, with states “open” and “closed”

  • Operations “acquire” and “release”

  • Acquire

    • Flag is “open”

      • Set it “closed”

    • Flag is “closed”

      • Enqueue yourself onto list of waiters

      • Enter wait state at the same time (⟶ Lost Wakeup Problem)

      • Retry on wakeup

  • Release

    • Reset flag to “open”

    • Wake any waiters

What if there is something wrong with the flag?

For example:

Memory Order

enum class memory_order
{
    relaxed,
    consume,
    acquire,                                        // <-- cf lock
    release,                                        // <-- cf unlock
    acq_rel,
    seq_cst                                         // <-- atomic ++
};

Release And Acquire Ordering

Mutex: memory/flag requirements

  • Release order: when flag is released (set to “open”), the reader/locker must see all writes that happened before “open”

    ⟶ CPU must not reorder writes across flag write

  • Acquire order: the opposite side must cooperate and tag their flag read operations accordingly

    ⟶ CPU must not reorder reads across flag read

Release And Acquire Ordering: A Naive Spinlock

  • Spinlock: does not block the locker if locked, but rather spins actively until free

    • ⟶ less context switches

    • Good for short critical sections

    • Not good if both contenders are scheduled on the same CPU

  • std::atomic_flag is guaranteed to be lock-free

  • std::atomic<> specializations are not

#include <list>
#include <array>
#include <thread>
#include <mutex>
#include <iostream>

class spinlock
{
public:
    void lock()
    {
        while (true) {
            bool oldstate = _flag.test_and_set(std::memory_order::acquire);
            if (oldstate == false)                     // <-- flag changed to true -> **locked**
                break;
        }
    }
    void unlock()
    {
        _flag.clear(std::memory_order::release);       // <-- send "open" to reader
    }
private:
    std::atomic_flag _flag;
};

int main()
{
    const unsigned long numtimes = 1000*1000UL;
    std::list<int> ints;
    spinlock ints_lock;                                // <-- mutex replacement

    auto threadfunc = [&ints, &ints_lock](){
        unsigned long n = numtimes;
        while (n--) { 
            ints_lock.lock();                          // <-- acquire
            ints.push_back(666);
            ints_lock.unlock();                        // <-- release
        }
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "#elements: " << ints.size() << '\n';

    return 0;
}

Sequential Consistency: Resolving Load-Modify-Store-Conflict

#include <atomic>
#include <thread>
#include <array>
#include <iostream>
#include <cassert>

int main()
{
    const unsigned long numtimes = 1000*1000*1000UL;
    std::atomic<unsigned long> counter = 0;

    auto threadfunc = [&counter]() {
        unsigned long n = numtimes;
        while (n--)
            counter++;
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "expected " << threads.size()*numtimes << ", real " <<  counter << '\n';

    return 0;
}

And Other Memory Orders?