L6

Lecture notes courtesy of Eddie Kohler, Robert Morris and Frans Kaashoek

Overview

Why synchronization?
Common ways to achieve cooperation among processes:
- share-nothing: Different processses operate in a separate logical address space, pass data through files or messages.
- share-memory Different processes directly share a single address space (threads).
We look at the shared memory model today.
Synchronization is the coordination of individual actions to ensure that data consistency is maintained.
Synchronization is our weapon against concurrency.

The danger of concurrency: Race condition
Several processes access and manipulate the same data concurrently. Outcome becomes non-deterministic (depends on the particular order in which the access takes place).
Our goal: ensure that only one process at a time can be manipulating this shared data (in the critical section).
Where does concurrency come from?:
- User-level: different threads (e.g. Lab1 Lab2's miniprocesses) access shared data. OS time-multiplexes two threads.
- Kernel: interrupt stops current execution context and starts the interrupt handling context. (Why does weensyos2's kernel have no synchronization?)
- Obvious...
Example code for implementing bank deposits and withdrawls

int balance;

void
deposit()
{
  balance++;
}

void 
withdrawal()
{
  balance--;
}

Let's start with initial balance = 0 and two threads (T1 and T2) making deposits of 1. What could happen?
Intuitive outcome: balance = 2
Danger of race condition (what could happen): balance = 1!!
What happened?

A closer look at banking example with ``objdump -d a.out''

08048344 deposit:
...
8048347:       a1 6c 95 04 08          mov    0x804956c,%eax
804834c:       83 c0 01                add    $0x1,%eax
804834f:       a3 6c 95 04 08          mov    %eax,0x804956c
...

08048356 withdrawl:
...
8048359:       a1 6c 95 04 08          mov    0x804956c,%eax
804835e:       83 e8 01                sub    $0x1,%eax
8048361:       a3 6c 95 04 08          mov    %eax,0x804956c
...

fix withdrawal overflow

bool
withdrawal()
{
  if (balance > 0) {
    balance--;
    return true;
  }else{
    return false;
  }
}

balance = 1, Two threads making withdrawl at the same time,
Intuitive outcome: T1=false,T2=true, balance=0 (or the opposite)
Possible outcome: T1=true,T2=true, balance=0
balance = 1, T1 making deposit, T2 making withdrawl
Intuitive outcome: balance =1
Possible outcome: balance=2?
Reasoning about correctness
Sequential consistency: 2 or more sequences of operations are sequentially consistent iff the result of any execution could be obtained by executing the operations one at a time in some serial order. Furthermore, in the serial order, read ops must reflect the most recent write.
In the banking example, two threads T1 and T2 are executing the following sequence of operations:
```
T1    T2
--------
D     D
W     W
```
Being sequentially consistency implies that if initial balance=0, result will always be 0!
If sequentially consistent, not possible to get balance=1
What does it take to achieve sequential consistency?
Two requirements:
- ensure before-after atomicity for individual operations
- sequentially consistent operations at the h/w level (atomic instructions and sequentially consistent memory access)
Synchronization is the act of building sequential consistency for APPLICATION operations using the sequential consistency provided by lower level operations.
Why is synchronization difficult?
- Correctness is difficult to debug: race condition bugs are usually difficult to find and fix.
- Trade-off between an easy, correct implementation and utilizing the full performance.

Another banking example:

int
readbal()
{
  return balance;
}

void
clearbal()
{
  balance = 0;
}

Two threads execute the following ops concurrently:
```
T1   T2
-------
R    C
R    C
```
Initial balance=10, Can sequentially correct program produce R=10? R=5? R=0?
Is the above code sequentially correct?
- Yes: provided by h/w atomic ops and sequentially consistency memory access

Change the clear balance function to

void
clearbal() {
  while (balance>0) 
    balance--;
}

Is it still correct?

The dissassembled function for clearbal():

...
804836b:       eb 0d                   jmp    804837a 
804836d:       a1 8c 95 04 08          mov    0x804958c,%eax
8048372:       83 e8 01                sub    $0x1,%eax
8048375:       a3 8c 95 04 08          mov    %eax,0x804958c
804837a:       a1 8c 95 04 08          mov    0x804958c,%eax
804837f:       85 c0                   test   %eax,%eax
8048381:       7f ea                   jg     804836d 
...

Hardware sequential consistency (x86)

An ATOMIC operation is an operation that cannot be split apart. The operation executes indivisibly, so no other thread can view an intermediate state.
register operations are atomic (because they are private to each individual thread or exe. context)
How about memory operations?
```
movl $0, %eax
addl $1, %edx
```
In x86, read operations within a single cache line and write operations within a single cache line are atomic.
cache line: The unit of caching, typically 64 bytes.
Modern machines almost always have two levels of cache, L1 and L2. L1 is on-chip, fast but small (10s KB L1). L2 smaller but much bigger 1-10MB.
Atomic h/w instructions should only use one cache line. Operations spanning multiple cache lines are not guaranteed to be atomic.
Better not to gamble on cache line alignment.
Hardware Synchronization primitive
h/w synchronizing instructions
atomically perform actions that would otherwise do not satisfy sequential consistency.

test-and-set

Conceptually, test and set implies
int 
test_and_set(int *a, int v) {
  int tmp = *a;
  *a = v;
  return tmp;
}

x86 instruction: xchgl (exchange)

Compare and swap (cmpxchg)

Conceptually, it implies
int
compare_and_swap(int *a, int old, int new)
{
  if (*a == old) {
    *a = new;
    return 1;
  }
  else return 0;
}

Critical sections

A critical section is a set of instructions where sequential consistency requires they have before-after atomicity. i.e. at most one thread's instruction pointer is in the set at any moment.
CS goals
- Safety (aka mutual exclusion): no more than one thread in CS at any time
- Liveness (aka progress): a thread trying to access CS will eventually succeed
- Bounded waiting: if a thread has requested to access CS, it will succeed after at most N other threads finishes CS.
- possibly fairness: if two threads are both trying to access CS, they have equal chances of success

Prefer the smallest critical section possible. This gives the best performance and utilization.

bool
withdrawal()
{
  if (balance > 0) {    //--- CS
    balance--;          //----CS
    return true;
  }else{
    return false;
  }
}

void
deposit()
{
  balance++;  //--- CS
}

Any smaller CS possible?
Enforcing CS using locks
The lock abstraction
A lock is an object with two operations: acquire(L) and release(L).
The lock has state: it is either locked or not locked.
A call to acquire(L) waits until L is not locked, then changes its state to locked and returns. A call to release(L) marks L as not locked.

bracket every one of your CS with acquire(L) and release(L)

Lock l;
bool
withdrawal()
{
  acquire(&L);
  if (balance > 0) {    //--- CS
    balance--;          //----CS
    release(&L);
    return true;
  }else{
    release(&L);
    ret = false;
  }
}

void
deposit()
{
  acquire(&L);
  balance++;  //--- CS
  release(&L);
}

How to implement acquire and release?

struct Lock {
  int locked;
}

void acquire(Lock *l) 
{
  while (1) {
    if (l->locked == 0) { //C
      l->locked = 1;  //D
      break;
    }
  }
}
void
release(Lock *l) 
{
  l->locked = 0;
}

Does it work?
Assume threads execute on different CPUs
Problem: Two acquire()s on the same lock on different CPUs might both execute line C, and then both execute D. Then both will think they have acquired the lock.

The correct implementation uses h/w primitive

void acquire(Lock *l){
   while(1){
    if(xchg(&l->locked, 1) == 0)
       break;
   }
}

void release(Lock *l){
    l->locked = 0;
}

Why does this work? If two CPUs execute xchg at the same time, the hardware ensures that one xchg does both its steps before the other one starts. So we can say "the first xchg" and "the second xchg". The first xchg will read a 0 and set lock->locked to 1 and the acquire() will return. The second xchg will then see lock->locked as 1, and the acquire will loop.
This style of lock is called a spin-lock, since acquire() stays in a loop while waiting for the lock.
Shouldn't need to use spin-lock on a single-core On a single core machine, you can cheat!
```
acquire()
{
  cli(); //disable interrupt
}

release()
{
  sti(); //enable interrupt
}
```
Linux kernel heavily uses this trick in the single core days (does not work on multiprocessors)!!
Spin-locks are good when protecting short operations: increment a counter, search in i-node cache, allocate a free buffer. In these cases acquire() won't waste much CPU time spinning even if another CPU holds the lock.
Better alternatives for threads that need to wait for a long CS

Lec 6: Synchronization I

Lecture notes courtesy of Eddie Kohler, Robert Morris and Frans Kaashoek

Overview

The danger of concurrency: Race condition

Reasoning about correctness

Hardware sequential consistency (x86)

Hardware Synchronization primitive

Critical sections

The lock abstraction