Concurrent Programming

lecture

#lecture note based on 15-213 Introduction to Computer Systems

H2 Classical problems in concurrent programmes

Race - what happens when scheduler do arbitrary things
- e.g. multiple threads incrementing same thing
Deadlock - resource allocation may prevent progress (in any thread)
- e.g. printf in handler: process acquires lock, and handler may interrupt and try to acquire lock again
Livelock / Starvation / Fairness - external events prevent progress, such as scheduler keeping you in line
- Livelock looks like making progress but never escaping from some loop, like threads requesting same resource back and forth but none ever get to use it because of interference
- Starvation is when something keeps using some resource, so others are stuck
- e.g. side street stuck when heavy traffic on main street

H2 Concurrent Server Approaches

Process based - the common solution in the past - ex. create a child to connect to client
- Make child process, kernel takes care of interleaving
- Private address space
- See Processes
Event based
- Programmer manually interleave
- Same address space
- I/O multiplexing
Thread based
- Same address space
- Kernel interleave
- See

H3 Example problem: server

If non-concurrent server already connected to a client, second clicent gets stuck on read after connecting and sending some write. Worse still the first client can wait without closing connecting, which makes the server stuck…

Process based
- Pros
  - Easy code :) - see 30
  - Need to clean up resources :)
- Cons
  - Can’t share data between processes :(
  - Forking is expensive :(
    - can only exit with an int (only some part of the int)
Event based
- Idea: server keeps set of active connections (connfd array)
- Repeat:
  - Check which descriptors have pending input (ask kernel)
  - If listening, accept connection
  - Respond to all the pending inputs
- Pros
  - One address space
  - Easier to debug
  - Cheaper because no forking
- Cons
  - More complex to implement
  - Can’t use multicore
  - Less concurrency control on e.g. partial HTTP request headers
    - Need to hold on to partial data and switch to another request…?
Thread based
- Pros
  - less expensive to create than process and reap
  - Easy data sharing
- Cons
  - Very easy to go wrong

H2 Threads

Recall
- Process
  - context + code + data + stack
  - parent-child hierarchy
- Signals
  - interrupting execution
  - shares state
  - returns to normal execution
  - so not a peer thread executing!
Thread
- Independent to each thread:
  - thread context - private to thread - own copy of:
    - registers
    - condition codes
    - stack pointer
    - programme counter
  - supposedly stack, so local variables too - but this is NOT private
    - kernel keep they separated by some distance so they are less likely to overflow into each other
  - thread id (TID)
- Shared among threads
  - kernel context
    - VM structure
    - file descriptor table
    - brk pointer
  - shared lib
  - heap
  - read/write data
  - data
- No hierarchy, they are associated with process (we usually designate the main one as main thread)
- Different threads can run in parallel on different cores

H3 Posix thread interface

pthread_create - like fork
pthread_join - like waitpid
pthread_self - get TID
pthread_exit - done with thread (note calling exit exits the process)
pthread_mutex_lock
pthread_mutex_unlock
pthread_detach(pthread_self()) - make the thread not joinable
- by default, threads need to be joined
- detach to tell kernel this thread will not get joined

H3 Challenges with threads

Unintended sharing - Multiple peer thread pointing to same thing in main thread’s stack, like &connfd on peer thread stack pointing on connfd on main stack
race condition
- e.g. threads getting changing value that may have changed by the time they run
  - possible solution: maybe allocating arguments to each thread
- e.g. … (many others)

H3 Passing argument to thread

Malloc and free
- producer mallocs, put in argument, and make thread
- consumer gets data and free
Casting int
- producer fit all the info in a void pointer and calls create thread on that
- consumer casts it back and interpret
Point to stack slot - bad
- producer puts data in its stack
- consumer looks at producer’s thread

H3 Shared Variable

A variable is shared iff multiple threads have reference to it.

Since thread stacks are not private… we can have many problems

Variable mapping:

Global
- VM has exactly one instance
Local automatic variable (those without static)
- Each thread has one instance of each
Local static variable
- VM has exactly one instance, like global
- So one copy per all thread!
Exceptions
- errno can be unique per thread

H3 Synchronising Threads

Potential problem: thread reads from memory, change value, and write bad… but maybe two threads read, each change value, and each write back… and things break

Fixing races: enforcing synchronisation

H4 Interleaving graph

Pasted image 20230808221601.png

H4 Mutex - Mutual exclusion

Usually called a lock
Prevents the unsafe regions on interleaving graphs
operations
- lock(m)
  - lock it if not locked
  - else, wait for some other thread to unlock and lock
- unlock(m)
  - unlock
Slow
Doesn’t solve everything!
- pthread_mutex not allowed to lock and unlock across different threads by design
  - ex. parent incrementing counter and forking while passing that parent (this could be solved via semaphore)
- not all races involve threads
  - TOCTOU bug (time of check to time of use): e.g. check permission then open file, but someone (another user, another process…) may have made change to file between that
    - Solution: don’t check before use or synchronise the (check use) block
  - Signal handling races
Edge cases
- Double lock on same thread -> deadlock
- Double unlock -> undefined

H4 Semaphore - generalise mutex

unsigned int, but not for math
operations on some resource $s$
- $P(s)$ “Prolaag” reduce
  - decrement s is 0, wait for $V$ operations to happen
  - Subtract 1 from $s$ and return
  - sem_wait(&lock);
- $V(s)$ “Verhogen” increase
  - Add 1 to $s$
  - If any thread waiting inside $P$, resume one of them
  - sem_post(&lock);
Like resource count
- Mutex is just resource count 0 or 1
- But does not have interface for getting current resource count
- No ordering requirement for $P$ and $V$ calls
Even slower than mutex
Use example
- Give each thread a semaphore, main thread can check if they made progress by consuming the semaphore, or the other way around

H5 Queues, Producers, Consumers

A synchronisation pattern to implement with semaphore

Sync pattern
- Producer(s) puts thing in the queue and notify consumer
- Consumer waits for thing, remove from queue, and notify producer
Use cases
- Multi threaded media processing
Implementations
- 1 entry queue
  - full semaphore and empty semaphores, keeping track of empty/full slots
- n entry queue
  - slots to count available slots in queue
  - items to count available items in queue
  - mutex to make sure one one thing access the queue

H4 Read Write Lock

Problem: most threads read, occasionally something writes
Implementation
- pthread_rwlock_t type
- pthread_rwlock_rdlock(pthread_rwlock_t *rwlock) to get read lock
- pthread_rwlock_wrlock(pthread_rwlock_t *rwlock) to get write lock
- pthread_rwlock_unlock(pthread_rwlock_t *rwlock) to release either lock
Use cases
- Many threads read from same thread
- Reservation system read most of the time
Problems
- Fairness
  - many readers keep reading, writer starve
  - many writers keep writing, readers starve
- Need algorithms to ensure fairness (i.e. no starvation)

H4 Atomic memory operations - hardware level

Telling compiler that something should be done in single instruction and with lock

static _Atomic unsigned long cnt = 0;
into assembly
lock addq $1, cnt(%rip)

H3 Thread safe APIs

thread safe - function always producing correct results when called from multiple concurrent threads

Making things thread safe:

Wrappin things inside internal lock
Slows things down
Things like rand has internal seed…
- -> Make caller responsible for allocating state
Things like itoa has internal buffer for result string
- -> Make caller provide the buffer
If function calls some thread unsafe function… duh it’s also thread unsafe

reentrant functions are thread safe function if it doesn’t share stuff with calls by other threads

rand_r - yes, because they all have independent variables
malloc - no, it’s just using a lock

reentrant functions $\subset$ thread safe functions

H2 Thread Parallelism

There are actually multiple cores in the processor… and things can get complicated

First is that different core may have different cache

Pasted image 20230808222057.png

And hyperthreading in each core…

Pasted image 20230808222806.png

Instruction streams
- cpu takes instructions and put them into operation streams
- operation steams goes into functional unit (viz. core) and executed in parallel
Hyperthreading
- each functional unit can have multiple independent operation queues and registers
- scheduler figures out how to interleave threads on the same core safely
Not just about gigahertz
- Out of order processor structure
  - how well cpu predict branches
- Cache performance

H3 Coherence and consistency

memory consistency - does memory behave well when running on multiple cores
- Without coordination between caches: data in one core 1 doesn’t read core 2’s cache if some var is updated by thread running on core 2.
- Some tagging mechanism helps to get bytes from one cache to another.
  - Invalid - don’t use value
  - shared - readable copy
  - modified - writable copy
sequential consistency - is execution order as written on each thread
- Hardware can swap operation order as long as they think they are sequentially consistent on single thread!

A good memory model:

sequentially consistent
- viz. each thread in order, and threads interleave any way
cache & memory behave correctly
good intra-thread ordering constraints
good synchronisation

H3 Thread programming disclaimers

Responsible for writing race-free code

// thread 1
X = 1
if (Y == 0) print Hello

// thread 2
Y = 1
if (X == 0) print World

This can print Hello World!!

Quantum physics of computer science
—Prof Railing

What happened? Threads may have different cache? Nope the Hello World thing happens even without cache.

Hardware optimisation might have switched order of read and write operations—they look the same on single thread, but can lead to ~~superposition~~ weird behaviour on multithread. Buffering read before write is 20-40 times faster so x86 does this often. (ARM is even worse as it reorders almost every memory operation. Hardware only ensures order makes sense on a single thread). Therefore memory is NOT sequentially consistent

Solution: use synchronisation

H3 Sequential programming examples

H4 Summing many numbers

Try having them share same global counter?

Make lots of threads and have all threads increment
- 3 times faster on 16 threads, but plateaus as thread count increases
  - Why? The threads aren’t sharing cache… 4th type of cache miss - so we need to go to another thread’s cache to get counter value
- Wrong answer
Mutex
- 300 timse slower…
- Correct
Samophore
- 900 times slower…
- Correct

Bad. Accumulate to different places in memory. Solutions:

Adjacent memory accumulator - they don’t write to same place - correct and faster
- false sharing - hardware still thinks cache is shared, but not because they actually share bytes but they are in same block
- note we do want true sharing for good locality.
Spaced memory accumulator - make sure cache blocks don’t overlap - even faster if we manage to store value in different blocks

Even better: put accumulator inside a register

2x fater than spaced memory

H4 Parallel sort

Recall sequential quicksort: pick pivot, partition, sequentially sort each

Parallel:
- model
  - some thread produces sorting task for array segments
  - a pool of threads consume task and do the sorting
- procedure
  - If array is short
    - Making thread has overhead (threshold may depend on machine), so do sequential sort
  - Else
    - pick pivot
    - partition
    - soart each half in different thread

H3 Amdahl’s law

$T$ total sequential time
$p$ fraction of total time that can be sped up
$k$ speed up factor

Results in performance

$T_{k} = p \frac{T}{k} + (1-p)T$

Maximum performance when $k = \infty$

Exception: maybe speeding up one part of code speeds up something else

zettelkasten