The C++ Memory Model¶

Data Race: Load Modify Store ¶

A data race is what the memory model is there to help understand and prevent
Here: Load-Modify-Store-Conflict
⟶ Integer access is not atomic

code/load-modify-store.cpp¶

#include <thread>
#include <array>
#include <iostream>
#include <cassert>

int main()
{
    const unsigned long numtimes = 1000*1000*1000UL;
    unsigned long counter = 0;

    auto threadfunc = [&counter]() {
        unsigned long n = numtimes;
        while (n--)
            counter++;
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "expected " << threads.size()*numtimes << ", real " <<  counter << '\n';

    return 0;
}

$ g++ -O0 load-modify-store.cpp
$ ./a.out
expected 2000000000, real 1066359623

Data Race: Load Modify Store (Explained)¶

Innocent counter++ is
- Load from memory
- Increment in CPU register
- Write back to memory

CPU A		CPU B		Memory
Instr	Loc	Instr	Loc	Glob
load	42			42
	42	load	42	42
inc	43			42
	43	inc	43	42
	43	store	43	43
store	43		43	43

Optimizations ¶

Your program does not run how you wrote it
Compiler optimizations (see here for an overview)
CPU optimizations: e.g. read and write reordering
⟶ the order in which writes becomes visible to other threads/CPUs is not what you think
(Similar for reads)
⟶ no sequential order guaranteed

CPU Optimization: Write Reordering ¶

The following code has “undefined behavior”
A threads that accesses the object through global_widget might see the pointer in memory before the memory itself
This is done by the CPU
⟶

Widget* w = new Widget(...);                           // <-- write: memory allocation, Widget initialization
global_widget = w;                                     // <-- write: global_widget

Another Race: `std::list` Is Not Threadsafe ¶

code/list-mt-unsafe.cpp¶

#include <list>
#include <array>
#include <thread>
#include <iostream>

int main()
{
    const unsigned long numtimes = 1000*1000UL;
    std::list<int> ints;

    auto threadfunc = [&ints](){
        unsigned long n = numtimes;
        while (n--)
            ints.push_back(666);                       // <-- non-atomic two-pointer update
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "#elements: " << ints.size() << '\n';

    return 0;
}

$ g++ -O0 list.cpp
$ ./a.out
#elements: 1476027

Somehow pointers have been updated concurrently and incorrectly
⟶ A typical case for a mutex

Protecting `std::list` With A Mutex ¶

code/list-mt-mutex.cpp¶

#include <list>
#include <array>
#include <thread>
#include <mutex>
#include <iostream>

int main()
{
    const unsigned long numtimes = 1000*1000UL;
    std::list<int> ints;
    std::mutex ints_lock;

    auto threadfunc = [&ints, &ints_lock](){
        unsigned long n = numtimes;
        while (n--) { 
            ints_lock.lock();                          // <-- acquire
            ints.push_back(666);
            ints_lock.unlock();                        // <-- release
        }
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "#elements: " << ints.size() << '\n';

    return 0;
}

$ g++ -O0 list.cpp
$ ./a.out
#elements: 2000000

What’s A Mutex?¶

A boolean flag, with states “open” and “closed”
Operations “acquire” and “release”
Acquire
- Flag is “open”
  - Set it “closed”
- Flag is “closed”
  - Enqueue yourself onto list of waiters
  - Enter wait state at the same time (⟶ Lost Wakeup Problem)
  - Retry on wakeup
Release
- Reset flag to “open”
- Wake any waiters

What if there is something wrong with the flag?

For example:

Data Race: Load Modify Store (Explained)
CPU Optimization: Write Reordering

Memory Order ¶

enum class memory_order
{
    relaxed,
    consume,
    acquire,                                        // <-- cf lock
    release,                                        // <-- cf unlock
    acq_rel,
    seq_cst                                         // <-- atomic ++
};

Release And Acquire Ordering ¶

Mutex: memory/flag requirements

Release order: when flag is released (set to “open”), the reader/locker must see all writes that happened before “open”

⟶ CPU must not reorder writes across flag write
Acquire order: the opposite side must cooperate and tag their flag read operations accordingly

⟶ CPU must not reorder reads across flag read

Release And Acquire Ordering: A Naive Spinlock ¶

Spinlock: does not block the locker if locked, but rather spins actively until free
- ⟶ less context switches
- Good for short critical sections
- Not good if both contenders are scheduled on the same CPU
std::atomic_flag is guaranteed to be lock-free
std::atomic<> specializations are not

code/list-mt-spinlock.cpp¶

#include <list>
#include <array>
#include <thread>
#include <mutex>
#include <iostream>

class spinlock
{
public:
    void lock()
    {
        while (true) {
            bool oldstate = _flag.test_and_set(std::memory_order::acquire);
            if (oldstate == false)                     // <-- flag changed to true -> **locked**
                break;
        }
    }
    void unlock()
    {
        _flag.clear(std::memory_order::release);       // <-- send "open" to reader
    }
private:
    std::atomic_flag _flag;
};

int main()
{
    const unsigned long numtimes = 1000*1000UL;
    std::list<int> ints;
    spinlock ints_lock;                                // <-- mutex replacement

    auto threadfunc = [&ints, &ints_lock](){
        unsigned long n = numtimes;
        while (n--) { 
            ints_lock.lock();                          // <-- acquire
            ints.push_back(666);
            ints_lock.unlock();                        // <-- release
        }
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "#elements: " << ints.size() << '\n';

    return 0;
}

Sequential Consistency: Resolving Load-Modify-Store-Conflict ¶

code/load-modify-store-seq-cst.cpp¶

#include <atomic>
#include <thread>
#include <array>
#include <iostream>
#include <cassert>

int main()
{
    const unsigned long numtimes = 1000*1000*1000UL;
    std::atomic<unsigned long> counter = 0;

    auto threadfunc = [&counter]() {
        unsigned long n = numtimes;
        while (n--)
            counter++;
    };

    std::array threads{
        std::thread(threadfunc),
        std::thread(threadfunc),
    };
    for (auto& t: threads)
        t.join();

    std::cout << "expected " << threads.size()*numtimes << ", real " <<  counter << '\n';

    return 0;
}

And Other Memory Orders?¶

Read-Copy-Update (What is RCU, Fundamentally?)
A Relaxed Guide to memory_order_relaxed - Paul E. McKenney & Hans Boehm - CppCon 2020

Navigation

The C++ Memory Model¶

Data Race: Load Modify Store ¶

Data Race: Load Modify Store (Explained)¶

Optimizations ¶

CPU Optimization: Write Reordering ¶

Another Race: `std::list` Is Not Threadsafe ¶

Protecting `std::list` With A Mutex ¶

What’s A Mutex?¶

Memory Order ¶

Release And Acquire Ordering ¶

Release And Acquire Ordering: A Naive Spinlock ¶

Sequential Consistency: Resolving Load-Modify-Store-Conflict ¶

And Other Memory Orders?¶

Links ¶