L7

Lecture notes courtesy of Eddie Kohler, Robert Morris and Frans Kaashoek

In the last lecture, we discussed:
- Race conditions
- H/W synchronization primitive (test-n-set, compare-n-swap)
- Spin-lock implementation
This lecture will cover:
- Build high level primitives out of low level spin-locks and atomic operations (Mutex, Semaphore, conditional variable)
- The kernel's synchronization overview
Using spin-locks inside the kernel
Let's look at the implementation of an example kernel service pipe buffer that allows a process to read what another process writes
A pipe buffer is a bounded buffer (draw pic, P1, P2, bounded buffer inside kernel).
- When the buffer is full, the writer is blocked untill space is available
- When the buffer is empty, the reader is blocked until data is available

The implementation:

#define N 8

typedef struct {
  char buf[N]; /* N = 8 */
  unsigned head;
  unsigned tail;
  spinlock_t lock;
} pipebuf_t;

void writec(pipebuf_t *p, char c) {
  while (1) {
    acquire(&p->lock);
    if ((p->tail - p->head) < N) {
      p->buf[p->tail % N] = c;
      p->tail++;
      release(&p->lock);
      return; //simplified return from kernel to user space
    } 
    release(&p->lock);
  }
}

char readc(pipebuf_t *p, char c) {
  while (1) {
    acquire(&p->lock);
    if (p->tail != p->head) {
      p->buf[p->head % N] = c;
      p->head++;
      release(&p->lock);
      return; //simplified return from kernel to user space
    }
    release(&p->lock);
  }
}

Race condition?
Does overflow in head and tail still allow the code to work correctly?
red lines are added to get rid of race condition
Problems with this implementation?
- No bounded wait (i.e.possible starvation)
  - With bounded wait, all processes waiting for a lock will wake up eventually. In other words, if a given lock is released an infinite number of times, than any acquire operation will eventually succeed.
- Inefficient spinning
How to ensure bounded waiting and avoid spinning inefficiency?
Use a wait queue!
- Wait queue contains a queue of processes waiting for a lock
- Processes in the queue are blocked.
- When the lock is released, the next process in queue order is activated and acquires the lock.
This type of synchronization object is known as a mutex, short for mutual exclusion.
Implementing a Mutex

Let's define the mutex data structure as a lock and a singly linked list of waiting processes:

typedef struct {
  procdescriptor_t *p;           /* process to wake */
  struct wait *next;             /* points to next element of wait queue */
} wait_t;

typedef struct {
  unsigned locked;
  wait_t *wfirst;                /* for quick access to the head of wait queue */
  wait_t *wlast;                 /* for quick access to the tail of wait queue */
  spinlock_t l;
} mutex_t;

Mutex lock operation: if lock unsuccessful, the process is added to the list
Mutex unlock operation: When the lock is released, the next process on the list, if any, is woken up so it can acquire the lock.

A candidate implementation


void
mutex_lock(mutex_t *m)
{
  /* check if lock is free and no one is waiting if so, simply lock the lock and return */
  acquire(&m->l); 
  if (m->locked == 0 && m->wfirst == NULL) {
    m->locked = 1;
    release(&m->l);
    return;
  }

  /* lock is not free, add current process to the end of the wait queue */

  wait_t w; //why can we put a stack-allocated variable on the linked list?
  w.p = current;
  w.next = NULL;
  if (m->wlast)
    m->wlast->link = &w;
  else
    m->wfirst = &w;
  m->wlast = &w;

  /* set the status of the current process as blocked */
  current->state = BLOCKED;

  release(&m->l);

  /* schedule a different, runnable process */
  schedule();

  /* when the current process is waken up, it proceeds here */
  acquire(&m->l);

  /* the waken up process grab the lock and remove itself from the linked list*/
  assert(m->locked == 0);
  assert(m->wfirst == &w);
  m->locked = 1;
  if (w.link != NULL)
    m->wfirst = w.link;
  else
    m->wfirst = m->wlast = NULL;
  release(&m->l);
}
 
void mutex_unlock(mutex_t *m) {
  acquire(&m->l);
  if (m->wfirst)
    m->wfirst->state = RUNNABLE;
  m->locked = 0;
  release(&m->l);
}

Notice that the current process releases the spinlock while it is blocked, otherwise a deadlock will result. In general, spinlocks cannot be held across calls to schedule.
How did our mutex implementation solve the problems w/ spinlock pointed out earlier?
Historically, this functionality took a long time to design correctly. There are many locations for subtle race conditions to creep in. An example is the sleep-wakeup race.
The Sleep-Wakeup Race

A very subtle race resulting from improper implementation of blocking

... in the mutex_lock function ....
release(&m->l);
current->state = BLOCKED;
schedule();
acquire(&m->l);

The sleep-wakeup race example:
- Process A reaches the snippet above and releases the spinlock
- Process B, running on another core, which owns the mutex, tries to release the mutex. It succeeds because the spinlock is free
- Process B, when releasing the mutex, grabs the first process on the wait queue, which is Process A, and sets its state to RUNNABLE
- Process A executes next, sets its state to BLOCKED and schedules, waiting to be waken up
Since the mutex is now free, but Process A is still blocked on the wait queue, Process A will remain blocked indefinitely.
The bug is subtle because the process's state is part of the critical section, even though the process's state is not part of the mutex_t structure.
We can replace spinlocks in the bounded buffer implementation with mutexes. However,
- the main source of inefficiency in readc and writec still remains: the polling loop to determine if the pipe buffer is ready for either reading or writing.
Conditional variable
Examine the readc() implementation. Ideally, we want the process to block until
- No other process is in the critical section AND
- The buffer is non-empty. i.e. p->head != p->tail
We will build another higher level primitive, conditional variable that realizes this requirement.
Conditional variable: a variable or object that stands for a condition.

The interface of a conditional variable

condvar_t cv;    //declaration of a condition variable
 
void cond_wait(condvar_t *cv, mutex_t *m) {
  //1. release mutex
  //2. block until cond_notify(cv) is called
  //THE FIRST TWO OPERATIONS OCCUR ATOMICALLY AS ONE
  //3. acquire mutex
}
 
void cond_notify(condvar_t *cv) {   
  //wake up first process blocked in cond_wait
}
 
void cond_broadcast(condvar_t *cv) {
  //wake up all the processes blocked in cond_wait
}

Why is there a need to have a mutex in cond_wait()?
- To prevent sleep-wake up race

The conditions for blocking in bounded buffer?

nonempty (for readc())
nonfull (for writec())

typedef struct {
  char buf[N]; /* N = 8 */
  unsigned head;
  unsigned tail;
  
  mutex_t l;
  condvar_t nonempty;    //indicates the buffer is not empty
  condvar_t nonfull;     //indicates the buffer is not full
  
} pipebuf_t;

void writec(pipebuf_t *p, char c) {
    mutex_lock(&p->l);                         //acquire the lock outside the while loop
  while (1) {
    if (p->tail - p->head < N) {
      p->buf[p->tail % N] = c;
      p->tail++;
      cond_notify(&p->nonempty);   
      mutex_unlock(&p->l); 
      return;
    }
    cond_wait(&p->nonfull, &p->lock);    //block because the buffer is full
  }
}

char readc(pipebuf_t *p) {
  mutex_lock(&p->l); 
  while (1) {                  
    if (p->tail != p->head) {
      char c = p->buf[p->head % N];
      p->head++;
      cond_notify(&p->nonfull);        //we know that the buffer is not full!
      mutex_unlock(&p->l);
      return c;
    }                                   
    cond_wait(&p->nonempty, &p->l);   
  }
}

Why is it correct? Two scenarios to think about:
- What if we put cond_notify after the mutex_unlock() statement?
- Assume that cond_wait did not release and re-acquire the mutex. If we place a mutex_lock and mutex_unlock around the cond_wait() call, would it work?
How to implement conditional variable?
- Similar to how mutex is implemented. Try to work out the details yourself.
Semaphores
A semaphore is a synchornized integer with two atomic operations:
- P(s): probeer te verlagen (try to decrease) Used for acquiring resource
  - This function blocks while the value of s is negative, then decrements s when it becomes unblocked.
- V(s): verhoog (increase) Used for releasing resource
In Linux kernel lingo, P and V are called down and up, respectively.
Semaphores are similar to locks, but are more versatile
- One can use semaphore to grant "resources" to more than 1 thread. e.g. use them to prevent a queue from overflowing its boundsone may use them to prevent a queue from overflowing its bounds.one may use them to prevent a queue from overflowing its bounds.one may use them to prevent a queue from overflowing its bounds.
One can implement semaphores using spin-locks like what's done with mutexes

One can also implement a mutex with a semaphore.

typedef struct {
  semaphore_t s = 0;
} smutex_t;

mutex_lock(mutex_t *m) {
  P(&m->s);
}

mutex_unlock(mutex *m) {
  V(&m->s);
}

We can initialize semaphore in variable states:
- 0: init semaphore in free state as a binary lock
- -1: init in locked state
- 1: in free state, allows 2 threads to obtain resources

Use semaphore to impelement bounded buffers:

typedef struct {
  unsigned head;
  unsigned tail;
  char buf[N];
  
  sempahore_t lock  = 0;
  sempahore_t nonempty  = -1;    //initially locked
  sempahore_t nonfull = N-1;     //only want first N to succeed
  
} pipebuf_t;

void 
writec(pipebuf_t *p, char c) {
  P(&p->nonfull);                //block until there is space in the buffer
   P(&p->lock); 
  p->buf[p->tail %N] = c;
  p->tail++;
  V(&p->lock);
  V(&p->nonempty);                 //increment because there is now an item in the buffer
  
}

char 
readc(pipebuf_t *p) {
  P(&p->nonempty);                 //block until there is something to read
  P(&p->lock);
  int c = p->buf[p->head % N];
  p->head++;
  V(&p->lock);
  V(&p->nonfull);                //increment becasue we opened space for another character in the buffer
  return c;
}

This implementation of writec and readc using semaphores rather than mutexes + conditional variables is not better in terms of starvation or utilization, but it allows us to combine claiming space and blocking into one variable type.
: recommended exercise. Implement your own read/write mutex
Lock granularity
OS kernel is a complicated large software system. What should the locks protect?
Two big questions to ponder: correctness and performance.
The easiest way to get correct locking is to have just a single lock, that every function grabs at the start and releases at the end.
- in the old days of Linux, this is what's used, the "big kernel lock"
A big lock limits only one CPU at a time can execute anywhere in your code. If OS code is called a lot, this may reduce the performance of an expensive multiprocessor to that of a single CPU.
Finer lock granularity increases concurrency and gets better performance.
How to best reduce lock granularity is a bit of an art.
General strategies in Linux for synchronizing access to data-structure

Below is an excerpt from "Understanding Linux Kernel" Table 5-8.

Kernel control paths accessing the data structure               protection primitive
------------------------------------------------------------------------------------
exceptions only  (e.g. syscall)                                  semaphore        
interrupts       (e.g. disk or network interrupts)               local interrupt disabling + spin lock
exceptions+interrupts                                            local interrupt disabling + spin lock
...

Why not semaphore for interrupts? or interrupts+exceptions case?

Lec 7: Synchronization II

Lecture notes courtesy of Eddie Kohler, Robert Morris and Frans Kaashoek

Using spin-locks inside the kernel

Implementing a Mutex

The Sleep-Wakeup Race

Conditional variable

Semaphores

Lock granularity

General strategies in Linux for synchronizing access to data-structure