We have the Owner’s code working. Now we need to implement the Thief’s logic, steal_top.
This function is called by other threads when they run out of work. They look at our deque and try to grab the oldest task.
The Thief’s steal_top
The logic here mirrors the Owner’s pop_bottom. It also needs to participate in the seq_cst handshake to handle the race for the last item.
A note on the ABA Problem: In many lock-free data structures, if a value changes from A to B and back to A, a thread might think nothing changed. Here, because our top and bottom indices are strictly increasing 64-bit integers, they will practically never wrap around (it would take centuries). This makes the ABA problem impossible for all practical purposes, simplifying our logic significantly compared to a pointer-based stack.
template<typename T>
bool WSDeque<T>::steal_top(T* out) {
// 1. Load `top`. ACQUIRE ensures we see the latest state.
std::uint64_t t = m_top.load(std::memory_order_acquire);
// 2. FENCE. This matches the fence in `pop_bottom`.
// It ensures our read of `top` happens before we read `bottom`.
std::atomic_thread_fence(std::memory_order_seq_cst);
// 3. Load `bottom`. ACQUIRE ensures that if we see a new bottom,
// we also see the data written to the buffer.
std::uint64_t b = m_bottom.load(std::memory_order_acquire);
if (t < b) {
// There is at least one item.
// It is safe to read from the buffer *before* the CAS because
// we know the item exists at index `t`.
T item = m_buffer[t & m_mask];
// 4. CAS. We try to increment `top` to claim the item.
// This is the final arbiter. If multiple thieves (or the owner)
// are fighting for this item, only one CAS will succeed.
if (m_top.compare_exchange_strong(t, t + 1,
std::memory_order_seq_cst, std::memory_order_relaxed)) {
// Success! We stole the item.
*out = item;
return true;
}
}
// Failed. Either empty, or we lost the race.
return false;
}
How It All Fits Together
The correctness of this deque relies on the interactions between the fences and atomic operations.
1. Push vs. Steal (Producer-Consumer)
- Owner: Writes data ->
releasestore tobottom. - Thief:
acquireload ofbottom-> Reads data. - Result: Safe. The thief is guaranteed to see the valid data if it sees the updated index.
2. Steal vs. Steal (Thieves Fighting)
- Two thieves see the same
top. - Both try to CAS
top->top + 1. - Result: The hardware guarantees only one CAS succeeds. The winner gets the task. The loser gets nothing.
3. Pop vs. Steal (The Critical Race)
- Owner: Decrements
bottom, runsseq_cstfence, checkstop. - Thief: Runs
seq_cstfence, checksbottom. - Result: The
seq_cstfences enforce a global order.- If the Owner’s fence runs first, the Thief sees the decremented
bottomand backs off. - If the Thief’s fence runs first, the Owner sees the old
top(or fails the CAS) and backs off. - It is impossible for both to think they won the last item.
- If the Owner’s fence runs first, the Thief sees the decremented
Practical Tips for the Scheduler
Now that you have a deque, you need a scheduler to use it. Here are two tips for building a robust work-stealing thread pool.
1. Random Victim Selection
When a thief needs to steal, which deque should it target?
Don’t use a complex heuristic. Just pick a random thread. It’s statistically robust and simple.
Crucial: Do not use a shared stateful random number generator (like std::mt19937 or rand()) protected by a lock. That introduces the same contention we just fought to remove. Instead, use a thread_local generator for each thread or a stateless hash function.
2. Exponential Backoff
If a thief tries to steal from everyone and fails, the system is likely under load (or empty). Don’t spin tightly in a loop. You’ll burn CPU cycles and slow down the working threads by contending for cache lines. Implement an exponential backoff:
- Yield (
std::this_thread::yield()). - Sleep for 1 microsecond.
- Sleep for 2 microseconds… up to a cap (e.g., 1ms). Reset the backoff as soon as you find work.
Conclusion
We started this series with a problem: a dynamic, unpredictable task tree that couldn’t be efficiently managed by a single shared queue. We saw how locks create bottlenecks and how false sharing silently kills performance.
By building this lock-free deque, we’ve created a structure that allows every thread to work independently on its own branch of that tree, only interacting when absolutely necessary. This pattern powers systems like Intel TBB and Go’s scheduler.