Lecture notes courtesy of Eddie Kohler, Robert Morris and Frans Kaashoek
Scheduling
- What is scheduling?
- OS policies and mechanisms to allocate resources to entities.
- Scheduling makes sense when re-assignment of resources
can happen with some frequency: CPU cycles, network
bandwidth, disk accesses, memory when swapping to disk.
- Not so applicable to more static resources like disk space.
- Entities that you might want to give resources to: users,
processes, threads, web requests, or MIT accounts.
- Why scheduling?
- Scheduling unavoidable with shared resources (whether implicit
or explicit), but particularly interesting when demand exceeds
available resources (e.g. demand for CPU is infinite with
compute-bound jobs).
- A good scheduling policy ensures that
each entity gets the resources it needs.
- Many polices for resource to entity allocation are possible:
strict priority, divide equally, shortest job first, minimum
guarantee combined with admission control.
- The importance of clever scheduling.
- Scheduling was popular in the days of time sharing, when
there was a shortage of resources all around.
- Many scheduling problems become not very interesting when
you can just buy a faster CPU or a faster network.
- Exception: web sites and large-scale networks often cannot
be made fast enough to handle peak demand (flash crowds,
attacks) so scheduling becomes important again.
For example may want to prioritize paying customers, or
address denial-of-service attacks.
- Exception 2: some scheduling decisions have non-linear
effects on overall system behavior, not just different
performance for different users. For example, see pitfalls.
- Exception 3: real time systems
- Key problems
- Gap between desired policy and available mechanism.
High-level desired policies often don't have a one-to-one
mapping onto available scheduling mechanisms. Scheduler
can approximate policy, or implement it under certain
assumptions, etc.
- Conflicting goals, e.g. low latency, high throughput,
and fairness. The scheduler must make a trade-off between
the goals.
- Interaction between different schedulers.
Just optimizing the CPU scheduler may do little to for
the overall desired policy.
- General plan for scheduling mechanisms
- Understand where scheduling is occuring.
- Expose scheduling decisions, allow control.
- Account for resource consumption, to allow intelligent control.
Basic Scheduling algorithms
In the simplest form, OS must schedule which of the user processes to run next among a queue of runnable processes.
Two possibilities for when to schedule:
- cooperative (non-preemptive) scheduling: The running process decides when to yield its resources. Or, finish the current
running process before starting another.
- preemptive scheduling: the kernel can stop a potentially bad process and force it to give up control of its resources.
Preemptive scheduling is used in most modern operating systems for user-level applications; however, cooperative scheduling is still often used by the kernel when it manages its own internal jobs.
Performance considerations for scheduling algorithms
- Throughput: The rate of useful work done for a given workload. Ideal scenario: draw diagram. offer load increases, throughput increases linearly until it hits capacity. The higher resource utilization (the lower overhead), the better the throughput is.
- Latency (or turn-around time): The total amount of time it takes a job to complete its execution. This includes the all the time where the job is waiting around for other job to complete.
- Wait time: The time it takes before the schedule allows the given job to start executing.
- Fairness: Does each job get equal share? (or proportional to priority?)
- Average vs. worse case.
Scheduling is hard because different performance goals usually conflict with each other.
Algorithm #1: FCFS (first-come-first-served)
Organize the set of runnable processes (or jobs) in a FIFO queue.
A new job is added to the tail of the queue.
Run the job at the head of queue till it finishes, then run the next job, etc.
Example:
Let x refer to the context switching overhead
Job Arrival-Time Amount-of-work Start Finish Wait Turn-around
A 0 3 0 3 0 3
B 1 5 3+x 8+x 2+x 7+x
C 3 2 8+2x 10+2x 5+2x 7+2x
Timeline: +---A----+----------B---------+---C-------+
0 3+x 8+2x
Avg waiting time: (7+3x)/3 = 2.33+x
Avg turn-around: (17+3x)/3
The context switching overhead, x, is not negligible. (flush TLB cache, save/restore registers, and plus the running of the scheduling algorithm itself!)
Good: running time of FCFS is low: O(1)
Bad: average waiting time is big because a long running job makes everyone else wait longer.
What if we want to optimize average waiting time?
Algorithm #2: Shortest-job-first
When time comes to dispatch a job, execute one with the shortest running time
It's idealistic: How to predict running time beforehand?
Example:
Job Arrival-Time Amount-of-work Start Finish Wait Turn-around
A 0 3 0 3 0 3
B 1 5 5+2x 10+2x 4+2x 9+2x
C 3 2 3+x 5+x x 2+x
Timeline: +---A----+--C--+------------B------------+
0 3+x 5+2x
Avg waiting time: (4+3x)/3 = 1.33+x
Avg turn around time: (14+3x)/3
Good: We can prove that shortest-job-first scheduling results in the lowest possible average waiting time (among all non-preemptive scheduling algos).
Bad: need to predict running time of a job somehow.
Bad: more expensive as OS needs to keep a sorted queue of jobs (Worst-case is O(n) to insert a new job)
Bad: starvation (i.e. non-fairness): a long-running job may never get an opportunity to run because short jobs are preferred.
Algorithm #3: Round-robin
Pre-emptive RR: break long jobs into a number of smaller jobs, each of which executes no more than a fixed time value (quantum).
Keep a FIFO queue of runnable jobs, execute a job for up to quantum time, moves the pre-empted job to the end of the queue.
Execution Time line: A B A B C A B C B B
Job Arrival-Time Amount-of-work Start Finish Wait Turn-around
A 0 3 0 6+5x 0 6+5x
B 1 5 1+x 10+9x x 9+9x
C 3 2 4+4x 8+7x 1+4x 5+7x
Avg waiting time: (x+1+4x)/3 = 0.33+1.67*x
Avg. turn-around time: (20+21x)/3
Good: short average waiting time (because each quantum is short).
Good: easy and efficient to implement.
Bad: more overhead for long jobs (due to more context switching..)
Bad: longer average turn-around time than shortest-job first
Algorithm #3: Priority scheduling
Priorities can be defined either externally or internally.
- Internal Priorities: The priority of a process is defined using some measurable internal factors like time limits, memory requirements, number of open files etc.
- External Priorities: They are set by some criteria that is external to the OS.
strict priority scheduling: Processes having lowest priority parameter are serviced first. Round Robin scheduling is employed among processes that have equal priority.
Variants:
- Preemptive Priority based Scheduling: A newly arrived process's priority is compared with an already existing priority. If it is of greater importance, then the current process is preempted else it continues.
- Non Preemptive Priority Based Scheduling: The newly arrived process is put at the head of the priority queue.
Good: Can potentially implement different policies (emacs should be given preference over backup etc.).
Bad: starvation (low priority-job might never gain access to CPU).
Common scheduling Pitfalls
Pitfall: Priority Inversion
- Assume policy is strict priority.
Thread T1: low priority
(e.g. check to see if I have new mail periodically).
Thread T2: medium priority.
Thread T3: high priority (e.g. real-time game).
- T1: acquire(l)
context switch to T3 (maybe T3 wakes up needs to run:
it has priority)
T3: acquire(l)... must wait for T1 to release(l)...
context switch to T2 (again, perhaps T2 needs to run &
has prio over T1)
T2 computes for a while
T3 is indefinitely delayed despite high priority
- Can solve if T3 lends its priority to holder of lock
it is waiting for, so T1 runs instead of T2.
- This is really a multiple scheduler problem,
since locks schedule access to locked resource.
- i.e. lock schedules access according to FCFS
- CPU schedules according to priority.
Pitfall: Receive Livelock
- Scheduling is not confined to user processes.
- Typical UNIX execution contexts:
- process, user half (e.g. web server)
- process, kernel half (e.g. the read() write() syscall)
- soft interrupts (e.g. TCP ACKS, ARP responses etc.)
- device (hard) interrupts (e.g. reception of a packet)
- Rules for context transitions:
- Hardware can force hard interrupt any most times except in a hard intr
- Soft interrupt can be invoked any time except during a soft or hard intr
- Interrupt must complete before interrupted code can continue, since interrupts share stack
- Never runs a user half if any kernel half wants to run
- User halfs pre-emptable via timer interrupt
- (Implicit) scheduling priority structure: hard intr > soft intr > kernel half > user half
- Why soft interrupts?
- For time-consuming processing in response to hard interrupts,
which should run relatively soon, and might not be associated
with any specific kernel half (e.g. TCP ACKs, ARP responses,
etc.)
- Hard interrupt does time-critical stuff, schedules soft interrupt,
more hard interrupts can occur during soft int.
- Short-running hard interrupts avoid input overruns.
- Receive interrupt (hard intr)
- Copy incoming packet to memory
- Append packet pointer to IP input queue (ipintrq)
- Schedule IP soft interrupt
- IP soft interrupt
- For each packet in IP input queue,
- Decide what to do with it
- E.g. deliver to a user program
- Append packet pointer to socket Q
- Wake up process
- Transmit interrupt (hard intr)
- Free last packet buffer
- Copy packet from if_snd queue to h/w
- Tell h/w to transmit
- Network interface hardware (e.g. ethernet)
- Executes concurrent to the main CPU
- Has small receive buffer on interface h/w (a few packets)
- Raises receive interrupt for each received packet
- Has small output buffer (a few packets)
- Raises transmit-complete interrupts when buffers open up
- Why the input queue (ipintrq)?
- Why the output queue (if_snd)?
- Is this a good arrangement?
- Draw diagram of web request served vs. requests issued.
- Possible solutions:
- Ideal: When IP input queue fills out, disable receive interrupts for the network device. When the queue is drained, re-enable receive interrupts.
- Practical: Polling. Whenever IP input queue opens up, polls device for received packets.
- General principle:
- Don't spend time on new work before completing existing work.
- Or give new work lower priority than partially-completed work.
- Corollary: If you might discard, do it as early as possible.
Pitfall: Multiple Interacting Schedulers.
- Suppose you want your emacs to have priority over
everything else. Give it high CPU priority.
Does that mean nothing else will run if emacs wants to run?
- No: Disk scheduler might not know to favor emacs's disk I/Os.
Typical UNIX disk scheduler favors disk efficiency, not process
prio.
- Suppose emacs needs more memory. Other processes have dirty
pages; emacs must wait. Disk scheduler doesn't know these other
processes' writes are high prio.
Pitfall: Server Processes
- Suppose emacs uses X windows to display.
The X server must serve requests from many clients.
- Does it know that emacs' requests should be given priority?
- Does the OS know to raise X's priority when it is serving emacs?
- Similarly for DNS, and NFS. Does the network know to give
emacs's NFS requests priority?