Lec 4: OS structure (cont)
Lecture notes courtesy of Eddie Kohler, Frans Kaashoek, Robert Morris
In the last lecture, we talked about
- virtual CPU w/ threads
- virtualize memory w/ address spaces
In this lecture, we will discuss
- virtualize I/O
- Use H/w privilege
- protected control transfer (syscall)
Previous lecture: virtualize CPU and memory
We have virtualized CPU and memory w/ minimal performance overhead
- CPU virtualization: Timer interrupts only fire every 1-100ms (every 1M instructions) upon which kernel saves execution context (i.e. CPU registers etc.)
- Memory virtualization: MMU translates virtual address to physical address (happens in h/w, very fast)
Process is a (group of) thread(s) in a separate address space
OS needs to keep per-process state necessary for virtualizing CPU and memory
Structure of a process descriptor table:
+--procdescriptor_t--------+
| .... |
|registers (%eax,%esp,...) |
| address space |
| ... |
+--------------------------+
Virtualize I/O
Three main considerations determine how OS handles I/O
- Compared to CPU and memory, most I/O devices are slow (especially those in direct contact with humans - keyboard, mouse) overhead when handling I/O devices tends to be acceptable.
- It is dangerous to allow complete control over an I/O device. For instance, file protection on a hard drive
- It is good to have a similar high-level interface that works for all devices, regardless of manufacturer.
For these reasons, modern OS provide abstract interfaces for I/O devices.
What's a good interface for accessing I/O?
Important I/O devices: Hard Disk, CD-ROM, Network, and Keyboard
They differ in characteristics (random access vs. stream etc.). One interface for each device?
UNIX' idea: treat everything as a file.
Our previous interface for reading file (in L3)
int read_file(char *filename, int offset, char *buffer, int len)
Better interface:
int read(int fd, char *buffer, int len)
If the file descriptor is of streaming type/random access, the next len bytes are read.
The offset is moved to a separate function to handle random access operations.
off_t lseek(int fd, off_t offset, int whence)
where whence is either SET_SEEK, SEEK_CUR, SEEK_END.
Attempts to call lseek on streaming file will fail.
The write interface is similar
int write(int fd, const void *data, int len)
Before I/O can be accessed, a file descriptor needs to be created using the open
int open(const char *name, int mode);
Each process' process descriptor table contains a file descriptor table (Otherwise, OS does not
know what fd=3 refers to!)
+--procdescriptor_t--------+
| .... |
|registers (%eax,%esp,...) |
| address space |
|file descriptor table |
+--------------------------+
+--file descriptor table--------------+
| ... |
| "wordfile.txt", O_RDONLY, offset=10 |
| tcp_socket, O_RDWR, .... |
| ... |
+-------------------------------------+
close frees up space in file descriptor table
Waiting on I/O
In L3, the OS wastes a lot of time doing nothing while I/O is slowly doing its thing
Recall the read_ide_sector function as part of sys_read
read_ide_sector()
{
while ((inb(0x1F7) & (0xC0)) != 0x40)
/* do nothing */
}
Busy waiting defeats our goal of high utilization.
Better alternative: make the process waiting on I/O to yield to other processes.
Keep track of what processes are blocked (on I/O) and which are runnable.
+--procdescriptor_t--------+
| .... |
|registers (%eax,%esp,...) |
| address space |
|file descriptor table |
|process state |
+--------------------------+
Basic Process interaction: Fork and Exec
Part of minilab1's task is to implement process creation
The application code that uses fork
pid_t pid;
pid = fork();
if (pid == 0) {
//do child code
}else{
//do parent code
}
Fork can be implemented easily:
- create new procdescriptor_t entry for the new process
- copy memory, registers, file descriptor table.
- For parent, set %eax=child's process ID, for child, set %eax = 0.
exec loads a program from file into memory
Copying memory from parent to child wasteful if child is going to execute a new program? (trick: copy-on-write)
Processes also need to synchronize with each other: kill, wait, ...
H/w Privilege
In L3, we talked about how the kernel can only properly isolate (enforce modularity)
different application processes with h/w support. For example,
- timer interrupt to force an applicaiton process to give up CPU
- virtual memory to enforce memory protection
Applications are not only negligent, they could also be malicious.
How to prevent application processes from disabling timer interrupts,
modifying page tables, directly communicating with h/w?
H/w support for privilege levels. Main idea: Running code has "privilege"
associated with it. High privilege code (the operating system) can run any
instruction it likes. Low privilege code (user code) can only run "safe"
commands.
x86 supports 0,1,2,3 privilege levels, of which a typical OS like Linux uses only two. (0: most privileged 3: least privileged)
Each application has a current privilege level (CPL) (encoded in two bits in the Code Segment(CS) register)
Each dangerous instruction on the processor executes a check before executing a dangerous instruction, which looks like the following pseudo-code:
if (CPL != 0)
raise exception;
else
execute instruction;
What instructions should be "protected" (i.e. "dangerous")?
User-level applications cannot set CPL nor directly execute "dangerous" instructions. Otherwise, the h/w processor generates an exception and gives control back to kernel.
Kernel must set the CPL correctly before jumping to user code.
Protected control transfer
Application processes cannot directly execute "dangerous" x86 instructions like inb/outb but must invoke OS services (functions) instead.
App cannot directly call kernel functions. Why?
App invoke kernel services (syscalls) by going through protected control transfer using x86 instruction int
int generates a trap exception. (Intel classifies software generated interrupts as exceptions)
Interrupts are handled by interrupt gate or trap gate
Each interrupt is associated with a number used for identifying its corresponding gate
When an interrupt occurs, the following steps are performed:
- Processor switches to a numerically lower pivilege level, e.g. with int
- Processor performs stack switch if switching to a numerically lower pivilege level
- Processor saves the EFLAGS, CS, EIP registers on the stack
- Processor jumps to the interrupt/trap gate based on the interrupt #.
- Kernel stores the general registers (%eax,%ebx,...) into memory
- Kernel performs the rest of interrupt handling logic
- Kernel invokes the iret instruction to return to interrupted procedure (restores saved EIP, CS, EFLAGS, performs stack switching, resets privilege)
Note which of the tasks are done by h/w (processor) and which are done by the kernel.
System call
Linux invokes syscall using int 0x80 (interrupt # 128)
The syscall number is passed as %eax
In the inerrrupt handling routine, kernel invokes the right syscall function based on syscall #.
The return value of the syscall is passed via %eax
Linux enables tracing of syscalls invoked by a user-level process, i.e. strace -p 6778 where 6778 is the process ID.
Tracing is implemented by invoking tracing functions upon entering and exitting syscall handling routines.
Note that a user application usually uses a library wrapper function to access syscalls.
Minilab1 implements syscall differently from Linux (or any other typical modern OS). Does it use a single interrupt number for all its syscalls?
H/w vs. software fault isolation
H/w privilege level support allows us to forbid apps from executing "dangerous" instructions.
Software-enforced isolation idea 1: Why not parse app code to throw away "dangerous" instructions?
- not possible for arbitary x86 code (unaligned code, self-modifying code)
- arbitary x86 code is difficult to disassemble correctly.
- possible if we enforce lots of structure in x86 code (read-only code segment, aligned code etc. it's an active research area)
Software idea 2: why not create an intermediate "instruction set" (and an associated runtime system) to forbid "dangerous" instructions?
- E.g. JVM runtime executes java bytecode. JVM can refuse to execute a "dangerous" instruction.
- Bad: Slow. Forces all apps to be written in java (or at least compile to Java bytecode).
Summary: Organization of a Modern Operating System
Picture of user-level processes, OS, syscall interfaces
Kernel runs with full privilege over the hareware.
Above is the traditional OS organization: monolithic OS
Kernel is a big program occupying a single address space
All kernel code runs w/ full h/w privilege (CPL=0)
good: fast, easy for sub-systems to cooperate (e.g. paging and file system) via simple function calls
bad: no isolation within kernel. One buggy component affects everything else.
Alternative organization: microkernel
Split up kernel subsystems into server processes
- servers: VM, FS, TCP/IP, Print, Display
- some servers have privileged access to h/w, some don't
app communicates w/ servers via IPC
Kernel's task: implement fast IPC
Good: simple/efficient kernel, sub-systems isolated, enforced better modularity
bad: cross-sub-system optimization harder, lots of IPCs may be slow
Monolithic OS remains the most popular today.
Digression: H/w emulation