Building Distributed Systems, esp. those that deal with a lot of data (acknowledgements: rtm@pdos.lcs.mit.edu) Example systems: Yahoo mail 250m users 1G per user quota YouTube tens of millions of videos each >10MB long Flickr Google Why distributed systems? No single server is capable of * holding hundreds of terabytes of data. * serving millions of requests per second. Goal: Stable performance under high load Example: Starbucks. 5 seconds to write down incoming request. 10 seconds to make it. [graph: x=requests, y=output] max thruput at 4 drinks/minute. what happens at 6 req/min? thruput goes to zero at 12 requests/minute. Efficiency *decreases* with load -- bad. Careful system design to avoid this -- flat line at 4 drinks. Peets, for example. Better: build systems whose efficiency *increases* w/ load w/ e.g. batching, disk scheduling Goal: scalable performance What if more clients than one YahooMail server can handle? How to use more servers to handle more clients? Idea: partition users across servers bottlenecks: how to ensure incoming mail arrives at the right server? scaling: will 10 servers allow us to handle 10x as many users? load balance: what if some users get much more mail than others? layout: what if we want to detect spam by looking at all mailboxes? Goal: high availability Can I get at my mailbox if some servers / networks are down? Yes: replicate the data. Problem: replica consistency. delete mail, re-appears. Problem: physical independence vs communication latency Problem: partition vs availability. airline reservations. Tempting problem: can 2 servers yield 2x availability AND 2x performance? Goal: security old view: a trusthworthy central party authenticates users and protects them from each other Internet has changed focus. No central trustworthy party: * you fetch a new Firefox binary, how do you know it hasn't been hacked? * how do you know that was Amazon you gave your credit card number to? * who do you trust your data with? your company? your online backup company? Google? Global exposure to random attacks from millions of bored students and evil-doers. (software is buggy, admin is not good enough to keep up with security updates) * how do we build systems with small trusted code base? Need to build systems with the right interface choice of interface results in extra opportunities or extra constraints - Disk (block-level) interface - File system interface - Database-like interface (tables, rows, columns etc.) - Application-specific interface ------------- Pop quiz The constants of systems: - disk access time 10ms (changes slowly) (in contrast, memory access (100ns)) - disk throughput (40-50M bytes per second) - network latency (across coast 100ms, to asia 250ms, within datacenter < 1ms) - network b/w (within datacenter 1Gbps, T1 1.5Mbps, DSL 1Mbps down 300kbps up) -------------- O/S kernel overview context in which you build distributed systems o/s has big impact on design, robustness, performance What problems does o/s solve? sharing hardware resources protection communication hardware independence (everyone faces these problems) UNIX abstractions process address space thread of control user ID Use file descriptors to refer to an I/O stream on-disk file pipe network connection device System call interface to kernel abstractions looks like function call, but special fork, exec open, read, creat Standard picture app (show two of them, mark addresses from zero) libraries ----- FS disk driver (mention address spaces, protection boundaries) (mention h/w runs kernel address space w/ special permissions) Life-cycle of a simple UNIX system call Interesting points: protected transfer (the TRAP instruction) h/w allows process to get kernel permissions but only by jumping to *known* entry point in kernel process suspended until system call finishes What if the system call needs to wait, e.g. for the disk? We care: this is what busy servers do sys_open(path) for each pathname component start read of directory from disk sleep waiting for the disk read process the directory contents sleep() save *kernel* registers to PCB1 (including SP) find runnable PCB2 restore PCB2 kernel registers (SP...) return Note: each user process has its own kernel stack kernel stack contains state of partially executed system call "kernel half" trap handler must execute on the right stack "blocking system call" What happens when disk completion interrupt occurs? Device interrupt routine finds the process waiting for that I/O. Marks process as runnable. Returns from interrupt. Someday process scheduler will switch to the waiting process. Example: server_1 web server in handout Problem Time-lines for CPU, disk, network Server alternates waiting for each of them CPU, disk, network are each idle much of the time OK if only one client. Not OK if there are clients waiting for service. We may have lots of work AND idle resources. Not good. s/w structure forces one-at-time processing How can we use the system's resources more efficiently? What we want is *I/O concurrency* Ability to overlap I/O wait with other useful work. In web server case, I/O wait mostly for net transfer to client. Could be disk I/O: compile 1st part of file while fetching 2nd part. Could be user interaction: emacs GC while waiting for you to type. Performance benefits of I/O concurrency can be huge Suppose we're waiting for disk for client one, 10 milliseconds We can probably server 100 other clients from cache during that time! Typical ways to get concurrency. This is about s/w structure. There are any number of alternatives: 1. Multiple processes 2. One process, many threads 3. Event-driven One process can be better than you think! O/S provides I/O concurrency transparently when it can O/S does read-ahead into cache, write-behind from buffer works for disk and network connections 1. I/O Concurrency with multiple processes Start a new UNIX process for each client connection / request (fork()) Master processes hands out connections to each child process. (e.g. server_2() in handout) + Isolated: bug for one client does not crash the whole server + If > 1 CPU, CPU concurrency as a side effect - Cost of starting a new process (fork()) may be high. e.g. take 300 microseconds on my computer. - Context switching between processes is expensive - Processes are isolated (E.g. they do not share memory) What if you want a web cache? Must be shared among processes. Or even just keep statistics? 2. Concurrency with (kernel-supported) threads Looks a bit like multiple processes (use clone() instead of fork()) But clone() leaves address space alone All threads share memory per-thread resources: user stack, kernel stack, kernel state (per-process resources: addr space, file descriptors) + multiple threads are much cheaper than multiple processes + Potentially easier to share data among threads (e.g. shared web cache), however, - Need to be careful about races and deadlocks involving shared data among threads. 3. Event driven programming Typical O/S provides only partial support for event notification yes: new TCP connections, arriving TCP/pipe/tty data no: file-system operation completion Similarly, not all system calls operations can be started w/o waiting yes: connect(), socket read(), write() no: open(), stat() maybe: disk read() Event-driven programming (Asynchronous programming with one process) Organize the s/w around arrival of events Write s/w in state-machine style When this event occurs, execute this function Library support to register interest in events The point: this preserves the serial natures of the events Programmer sees events/functions occuring one at a time - need to break s/w into a series of event handlers with no blocking operations in each handler. - hard for async programs to maintain state no longer able to keep track of things w/ the stack because the stack does not persist across different handlers) - cannot make use of more than one CPU (because there is only one process!) + cheap & fast