Building Distributed Systems, esp. those that deal with a lot of data (acknowledgements: rtm@pdos.lcs.mit.edu)
    
Example systems:
	 Yahoo mail 250m users 1G per user quota
	 YouTube tens of millions of videos each >10MB long
	 Flickr
	 Google

Why distributed systems?
	 No single server is capable of 
	    * holding hundreds of terabytes of data.
	    * serving millions of requests per second.

Goal: Stable performance under high load
  Example: Starbucks.
    5 seconds to write down incoming request. 10 seconds to make it.
    [graph: x=requests, y=output]
    max thruput at 4 drinks/minute.
    what happens at 6 req/min?
    thruput goes to zero at 12 requests/minute.
  Efficiency *decreases* with load -- bad.
  Careful system design to avoid this -- flat line at 4 drinks.
    Peets, for example.
  Better: build systems whose efficiency *increases* w/ load
    w/ e.g. batching, disk scheduling

Goal: scalable performance
  What if more clients than one YahooMail server can handle?
  How to use more servers to handle more clients?
  Idea: partition users across servers
    bottlenecks: how to ensure incoming mail arrives at the right server?
    scaling: will 10 servers allow us to handle 10x as many users?
    load balance: what if some users get much more mail than others?
    layout: what if we want to detect spam by looking at all mailboxes?

Goal: high availability
  Can I get at my mailbox if some servers / networks are down?
  Yes: replicate the data.
  Problem: replica consistency. delete mail, re-appears.
  Problem: physical independence vs communication latency
  Problem: partition vs availability. airline reservations.
  Tempting problem: can 2 servers yield 2x availability AND 2x performance?

Goal: security
  old view: a trusthworthy central party authenticates users and protects them from each other

  Internet has changed focus.
  No central trustworthy party:
  * you fetch a new Firefox binary, how do you know it hasn't been hacked?
  * how do you know that was Amazon you gave your credit card number to?
  * who do you trust your data with? your company? your online backup company? Google?

  Global exposure to random attacks from millions of bored students and evil-doers.
  (software is buggy, admin is not good enough to keep up with security updates)
  * how do we build systems with small trusted code base?
  
Need to build systems with the right interface
  choice of interface results in extra opportunities or extra constraints 
  - Disk (block-level) interface 
  - File system interface 
  - Database-like interface (tables, rows, columns etc.)
  - Application-specific interface
  
-------------
Pop quiz

The constants of systems:
- disk access time 10ms (changes slowly) (in contrast, memory access (100ns))
- disk throughput (40-50M bytes per second)
- network latency (across coast 100ms, to asia 250ms, within datacenter < 1ms)
- network b/w (within datacenter 1Gbps, T1 1.5Mbps, DSL 1Mbps down 300kbps up)

--------------

O/S kernel overview
  context in which you build distributed systems
  o/s has big impact on design, robustness, performance

What problems does o/s solve?
  sharing hardware resources
  protection
  communication
  hardware independence
  (everyone faces these problems)

UNIX abstractions
  process
    address space
    thread of control
    user ID
  Use file descriptors to refer to an I/O stream
    on-disk file
    pipe
    network connection
    device

System call interface to kernel abstractions
  looks like function call, but special
  fork, exec
  open, read, creat

Standard picture
  app (show two of them, mark addresses from zero)
  libraries
  -----
  FS
  disk driver
  (mention address spaces, protection boundaries)
  (mention h/w runs kernel address space w/ special permissions)

Life-cycle of a simple UNIX system call
Interesting points:
  protected transfer (the TRAP instruction)
    h/w allows process to get kernel permissions
    but only by jumping to *known* entry point in kernel
  process suspended until system call finishes

What if the system call needs to wait, e.g. for the disk?
  We care: this is what busy servers do
  sys_open(path)
    for each pathname component
      start read of directory from disk
      sleep waiting for the disk read
      process the directory contents
  sleep()
    save *kernel* registers to PCB1 (including SP)
    find runnable PCB2
    restore PCB2 kernel registers (SP...)
    return

Note:
  each user process has its own kernel stack 
  kernel stack contains state of partially executed system call
  "kernel half"
  trap handler must execute on the right stack
  "blocking system call"
  
What happens when disk completion interrupt occurs?
  Device interrupt routine finds the process waiting for that I/O.
  Marks process as runnable.
  Returns from interrupt.
  Someday process scheduler will switch to the waiting process.

Example: server_1 web server in handout

Problem
  Time-lines for CPU, disk, network
  Server alternates waiting for each of them
  CPU, disk, network are each idle much of the time
  OK if only one client.
  Not OK if there are clients waiting for service.
  We may have lots of work AND idle resources. Not good.
  s/w structure forces one-at-time processing
  How can we use the system's resources more efficiently?

What we want is *I/O concurrency*
  Ability to overlap I/O wait with other useful work.
  In web server case, I/O wait mostly for net transfer to client.
  Could be disk I/O: compile 1st part of file while fetching 2nd part.
  Could be user interaction: emacs GC while waiting for you to type.

Performance benefits of I/O concurrency can be huge
  Suppose we're waiting for disk for client one, 10 milliseconds
  We can probably server 100 other clients from cache during that time!

Typical ways to get concurrency.
  This is about s/w structure.
  There are any number of alternatives:
  1. Multiple processes
  2. One process, many threads
  3. Event-driven

One process can be better than you think!
  O/S provides I/O concurrency transparently when it can
  O/S does read-ahead into cache, write-behind from buffer
    works for disk and network connections

1. I/O Concurrency with multiple processes
  Start a new UNIX process for each client connection / request (fork())
  Master processes hands out connections to each child process. (e.g. server_2() in handout)
  + Isolated: bug for one client does not crash the whole server
  + If > 1 CPU, CPU concurrency as a side effect
  - Cost of starting a new process (fork()) may be high.  e.g. take 300 microseconds on my computer.
  - Context switching between processes is expensive
  - Processes are isolated (E.g. they do not share memory)
	 What if you want a web cache? Must be shared among processes.
    Or even just keep statistics?

2. Concurrency with (kernel-supported) threads
  Looks a bit like multiple processes (use clone() instead of fork())
  But clone() leaves address space alone
  All threads share memory

  per-thread resources: user stack, kernel stack, kernel state
  (per-process resources: addr space, file descriptors)

  + multiple threads are much cheaper than multiple processes 
  + Potentially easier to share data among threads (e.g. shared web cache), however, 
  - Need to be careful about races and deadlocks involving shared data among threads.

3. Event driven programming
Typical O/S provides only partial support for event notification
   yes: new TCP connections, arriving TCP/pipe/tty data
   no: file-system operation completion

Similarly, not all system calls operations can be started w/o waiting
   yes: connect(), socket read(), write()
   no: open(), stat()
   maybe: disk read()

Event-driven programming (Asynchronous programming with one process)
  Organize the s/w around arrival of events
  Write s/w in state-machine style
    When this event occurs, execute this function
  Library support to register interest in events
  The point: this preserves the serial natures of the events
    Programmer sees events/functions occuring one at a time

  - need to break s/w into a series of event handlers with no blocking operations in each handler.
  - hard for async programs to maintain state 
          no longer able to keep track of things w/ the stack because the stack does not persist across different handlers)
  - cannot make use of more than one CPU (because there is only one process!)
  + cheap & fast