Lec 4: OS structure (cont)

Lecture notes courtesy of Eddie Kohler, Frans Kaashoek, Robert Morris

In the last lecture, we talked about

virtual CPU w/ threads
virtualize memory w/ address spaces

In this lecture, we will discuss

virtualize I/O
Use H/w privilege
protected control transfer (syscall)

Previous lecture: virtualize CPU and memory

We have virtualized CPU and memory w/ minimal performance overhead

CPU virtualization: Timer interrupts only fire every 1-100ms (every 1M instructions) upon which kernel saves execution context (i.e. CPU registers etc.)
Memory virtualization: MMU translates virtual address to physical address (happens in h/w, very fast)

Process is a (group of) thread(s) in a separate address space

OS needs to keep per-process state necessary for virtualizing CPU and memory

Structure of a process descriptor table:

+--procdescriptor_t--------+
|   ....                   |
|registers (%eax,%esp,...) |
|  address space           |
|   ...                    |
+--------------------------+

Virtualize I/O

Three main considerations determine how OS handles I/O

Compared to CPU and memory, most I/O devices are slow (especially those in direct contact with humans - keyboard, mouse) overhead when handling I/O devices tends to be acceptable.
It is dangerous to allow complete control over an I/O device. For instance, file protection on a hard drive
It is good to have a similar high-level interface that works for all devices, regardless of manufacturer.

For these reasons, modern OS provide abstract interfaces for I/O devices.

What's a good interface for accessing I/O?

Important I/O devices: Hard Disk, CD-ROM, Network, and Keyboard

They differ in characteristics (random access vs. stream etc.). One interface for each device?

UNIX' idea: treat everything as a file.

Our previous interface for reading file (in L3)

int read_file(char *filename, int offset, char *buffer, int len)

Better interface:

int read(int fd, char *buffer, int len)

If the file descriptor is of streaming type/random access, the next len bytes are read.

The offset is moved to a separate function to handle random access operations.

off_t lseek(int fd, off_t offset, int whence)

where whence is either SET_SEEK, SEEK_CUR, SEEK_END.

Attempts to call lseek on streaming file will fail.

The write interface is similar

int write(int fd, const void *data, int len)

Before I/O can be accessed, a file descriptor needs to be created using the open

int open(const char *name, int mode);

Each process' process descriptor table contains a file descriptor table (Otherwise, OS does not know what fd=3 refers to!)

+--procdescriptor_t--------+
|   ....                   |
|registers (%eax,%esp,...) |
|  address space           |
|file descriptor table     |
+--------------------------+

+--file descriptor table--------------+
| ...                                 |
| "wordfile.txt", O_RDONLY, offset=10 |
| tcp_socket, O_RDWR,  ....           |
| ...                                 |
+-------------------------------------+

close frees up space in file descriptor table

Waiting on I/O

In L3, the OS wastes a lot of time doing nothing while I/O is slowly doing its thing

Recall the read_ide_sector function as part of sys_read

read_ide_sector()
{
  while ((inb(0x1F7) & (0xC0)) != 0x40)
    /* do nothing */
}

Busy waiting defeats our goal of high utilization.

Better alternative: make the process waiting on I/O to yield to other processes.

Keep track of what processes are blocked (on I/O) and which are runnable.

+--procdescriptor_t--------+
|   ....                   |
|registers (%eax,%esp,...) |
|  address space           |
|file descriptor table     |
|process state             |
+--------------------------+

Basic Process interaction: Fork and Exec

Part of minilab1's task is to implement process creation

The application code that uses fork

pid_t pid;
pid = fork();
if (pid == 0) {
  //do child code
}else{
  //do parent code
}

Fork can be implemented easily:

create new procdescriptor_t entry for the new process
copy memory, registers, file descriptor table.
For parent, set %eax=child's process ID, for child, set %eax = 0.

exec loads a program from file into memory

Copying memory from parent to child wasteful if child is going to execute a new program? (trick: copy-on-write)

Processes also need to synchronize with each other: kill, wait, ...

H/w Privilege

In L3, we talked about how the kernel can only properly isolate (enforce modularity) different application processes with h/w support. For example,

timer interrupt to force an applicaiton process to give up CPU
virtual memory to enforce memory protection

Applications are not only negligent, they could also be malicious.

How to prevent application processes from disabling timer interrupts, modifying page tables, directly communicating with h/w?

H/w support for privilege levels. Main idea: Running code has "privilege" associated with it. High privilege code (the operating system) can run any instruction it likes. Low privilege code (user code) can only run "safe" commands.

x86 supports 0,1,2,3 privilege levels, of which a typical OS like Linux uses only two. (0: most privileged 3: least privileged)

Each application has a current privilege level (CPL) (encoded in two bits in the Code Segment(CS) register)

Each dangerous instruction on the processor executes a check before executing a dangerous instruction, which looks like the following pseudo-code:

if (CPL != 0)
  raise exception;
else
  execute instruction;

What instructions should be "protected" (i.e. "dangerous")?

User-level applications cannot set CPL nor directly execute "dangerous" instructions. Otherwise, the h/w processor generates an exception and gives control back to kernel.

Kernel must set the CPL correctly before jumping to user code.

Protected control transfer

Application processes cannot directly execute "dangerous" x86 instructions like inb/outb but must invoke OS services (functions) instead.

App cannot directly call kernel functions. Why?

App invoke kernel services (syscalls) by going through protected control transfer using x86 instruction int

int generates a trap exception. (Intel classifies software generated interrupts as exceptions)

Interrupts are handled by interrupt gate or trap gate

Each interrupt is associated with a number used for identifying its corresponding gate

When an interrupt occurs, the following steps are performed:

Processor switches to a numerically lower pivilege level, e.g. with int
Processor performs stack switch if switching to a numerically lower pivilege level
Processor saves the EFLAGS, CS, EIP registers on the stack
Processor jumps to the interrupt/trap gate based on the interrupt #.
Kernel stores the general registers (%eax,%ebx,...) into memory
Kernel performs the rest of interrupt handling logic
Kernel invokes the iret instruction to return to interrupted procedure (restores saved EIP, CS, EFLAGS, performs stack switching, resets privilege)

Note which of the tasks are done by h/w (processor) and which are done by the kernel.

System call

Linux invokes syscall using int 0x80 (interrupt # 128)

The syscall number is passed as %eax

In the inerrrupt handling routine, kernel invokes the right syscall function based on syscall #.

The return value of the syscall is passed via %eax

Linux enables tracing of syscalls invoked by a user-level process, i.e. strace -p 6778 where 6778 is the process ID.

Tracing is implemented by invoking tracing functions upon entering and exitting syscall handling routines.

Note that a user application usually uses a library wrapper function to access syscalls.

Minilab1 implements syscall differently from Linux (or any other typical modern OS). Does it use a single interrupt number for all its syscalls?

H/w vs. software fault isolation

H/w privilege level support allows us to forbid apps from executing "dangerous" instructions.

Software-enforced isolation idea 1: Why not parse app code to throw away "dangerous" instructions?

not possible for arbitary x86 code (unaligned code, self-modifying code)
arbitary x86 code is difficult to disassemble correctly.
possible if we enforce lots of structure in x86 code (read-only code segment, aligned code etc. it's an active research area)

Software idea 2: why not create an intermediate "instruction set" (and an associated runtime system) to forbid "dangerous" instructions?

E.g. JVM runtime executes java bytecode. JVM can refuse to execute a "dangerous" instruction.
Bad: Slow. Forces all apps to be written in java (or at least compile to Java bytecode).

Summary: Organization of a Modern Operating System

Picture of user-level processes, OS, syscall interfaces

Kernel runs with full privilege over the hareware.

Above is the traditional OS organization: monolithic OS

Kernel is a big program occupying a single address space

All kernel code runs w/ full h/w privilege (CPL=0)

good: fast, easy for sub-systems to cooperate (e.g. paging and file system) via simple function calls

bad: no isolation within kernel. One buggy component affects everything else.

Alternative organization: microkernel

Split up kernel subsystems into server processes

servers: VM, FS, TCP/IP, Print, Display
some servers have privileged access to h/w, some don't

app communicates w/ servers via IPC

Kernel's task: implement fast IPC

Good: simple/efficient kernel, sub-systems isolated, enforced better modularity

bad: cross-sub-system optimization harder, lots of IPCs may be slow

Monolithic OS remains the most popular today.

Digression: H/w emulation

The Bochs emulator works by
- doing exactly what a real x86 PC would do,
- only implemented in software rather than hardware!
Runs as a normal process in a "host" operating system (e.g., Linux)

Uses normal process storage to hold emulated hardware state:

Stores emulated CPU registers in global variables

		int32_t regs[8];
		#define REG_EAX 1;
		#define REG_EBX 2;
		#define REG_ECX 3;
		...
		int32_t eip;
		int16_t segregs[4];
		...

Stores emulated physical memory in Boch's memory

                char mem[256*1024*1024];

Execute instructions by simulating them in a loop:

	for (;;) {
		read_instruction();
		switch (decode_instruction_opcode()) {
		case OPCODE_ADD:
			int src = decode_src_reg();
			int dst = decode_dst_reg();
			regs[dst] = regs[dst] + regs[src];
			break;
		case OPCODE_SUB:
			int src = decode_src_reg();
			int dst = decode_dst_reg();
			regs[dst] = regs[dst] - regs[src];
			break;
		...
		}
		eip += instruction_length;
	}

Simulate PC's physical memory map by decoding emulated "physical" addresses just like a PC would.
Simulate I/O devices, etc., by detecting accesses to "special" memory and I/O space and emulating the correct behavior: e.g.,
- Reads/writes to emulated hard disk transformed into reads/writes of a file on the host system
- Writes to emulated VGA display hardware transformed into drawing into an X window
- Reads from emulated PC keyboard transformed into reads from X input event queue