Introduction

In this lab, you will implement Paxos, the distributed agreement protocol. Make sure you read the paper Paxos made simple, and attend the corresponding lecture. Paxos is an extremely useful primitive in distributed systems to achieve automatic machine failover while preserving consistency quarantees. In theory, one can use Paxos to try to reach agreement on anything. In practice, the most important agreement in a system is the sequence of membership changes (i.e. view changes). Once the members of a system can agree on the unique history of view changes, agreeing on other matters (such as the primary) and ensuring consistency across view changes is straighforward. We call the part of the system that uses Paxos to reach an agreement on view changes the configuration service. The configuration service you build in this lab will be used in the next lab to build a replicated state machine that makes lock service a fault tolerant service.

Getting Started

Begin by initializing your new lab directory with your implementation from the previous lab and then updating it with some new infrastructure code in http://www.news.cs.nyu.edu/~jinyang/fa08/labs/yfs-lab5.tgz for Lab 5.

% wget -nc http://www.news.cs.nyu.edu/~jinyang/fa08/labs/yfs-lab5.tgz
% rsync -av l4/* l5
% tar xzvf yfs-lab5.tgz

This will add new files, paxos_protocol.h, paxos.{cc,h}, log.{cc,h} config_smain.cc rsm_tester.pl to your l5/ directory and overwrite the Makefile from your previous lab, so you may have to redo any changes in those files.

In files paxos.{cc,h}, you will find a sketch implementation of the paxos class that will use the Paxos protocol to agree on view changes. The file paxos_protocol.h defines the suggested RPC protocol between instances of Paxos running on different replicas, including structures for arguments and return types, and marshall code for those structures.

The files log.{cc,h} provide a full implementation of a log class, which should be used by your paxos class to log important Paxos events to disk. Then, if the node fails and later re-joins, it has some memory about past views of the system. Please do not make any changes to this class, as we will use our own original versions of these files during testing.

config_smain.cc is a wrapper class used to start a configuration server (the paxos class). The lab tester rsm_tester.pl will automatically start several configuration servers, kill and restart some of them and check that you have implemented the view change protocols correctly.

Understanding how Paxos is used for view changes

The paxos class performs view changes among the set of configuration servers using the Paxos protocol. The first view of the system is specified manually when each configuration server is started (as shown in the arguments of the paxos constructor). Subsequence view changes rely on Paxos to agree on a unique next view to displace the current view.

When the system starts from scratch, the first node (as specified in the arguments of the paxos constructor) creates view 1 containing itself only, i.e. view_1={1}. When node 2 joins after the first node, it starts out in view 1 with the first node (as specified in the arguments of the paxos constructor) as the only member. It will use Paxos to propose to nodes in view_1={1} a new view_2 containing node 1 and 2. When Paxos succeeds, view_2 is formed, i.e.view_2={1,2}. When node 3 joins, it also starts with view_1={1} but attempts to propose view_2={1,3}. It will be informed by node 1 that view_2={1,2} already exists. It will then attempt to propose to nodes in view 2 a new view_3={1,2,3}. And so on.

The paxos class will also initiate view change when it discovers that some configuration server in the current view is not responding. In particular, the node with the smallest id periodically sends heartbeat RPCs to all others (and all other servers periodically send heartbeat to the node with the smallest id). If a heartbeat RPC times out, the paxos calls the start_paxos method to kick Paxos protocol into action. Because each node independently decides if it should run Paxos, there may be several instances of Paxos running simulateneously; Paxos sorts that out correctly.

The paxos keeps track of whether the current view is stable or not (using the paxos::stable variable). If the current view is stable, there are no on-going Paxos view change attempts by this node or others. When the current view is un-stable, the node is inititating the Paxos protocol or participating in Paxos initiated by others.

The Paxos protocol is run by the manager thread inside paxos class. The manager thread operates in a loop waiting to run the Paxos protocol. In simple pseudocode:

paxos::manager() { 
   while (1) { 
      if (stable) 
         pthread_cond_wait(manager) 

      //run Paxos, if succeeds, switch to the new view and becomes stable
   } 
}

This thread is kicked into action from start_paxos() method or the paxos constructor to initiate Paxos when the node joins or detects the failures of others.

Each configuration server logs important Paxos events as well as a complete history of all views on disk. At any time a node can reboot and when it re-joins, it may be many views behind. Unless the node brings itself up-to-date on the current view, it won't be able to participate in Paxos. By remembering all views, the other nodes can bring this re-joined node up to date.

The Paxos Protocol

Now let's examine the Paxos protocol itself that you need to implement. Below is the pseudocode for Paxos. The paxos skeleton class and protocol contain member variables, RPCs, and RPC handlers corresponding to this code.

state:
  n_a, v_a: highest proposal # and its corresponding value this node has accepted
  n_h: highest proposal # seen in a prepare
  my_n: the last proposal # the node has used in this round of Paxos
  vid_h: highest view number we have accepted
  views: a map of past view numbers to values
  stable: "false" when running Paxos, "true" when agreement has been reached for the current view 

on each view change, initialize state 
  n_a = 0
  n_h = 0
  my_n = 0
  v_a = () // empty list

when a node initiates Paxos (upon join or detecting failures), 
  stable = false
  proceed to Paxos Phase 1

Paxos Phase 1
  a node (maybe more than one.) decides to be leader (need not be in current view):
    my_n = max(n_h, my_n)+1, append node ID  // unique proposal number
    sends prepare(vid_h+1, my_n) to all nodes in {views[vid_h], initial contact node, itself}

  if node receives prepare(vid, n):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n > n_h:
      n_h = n
      stable = false
      return prepareres(n_a, v_a)
    else:
      return reject()

Paxos Phase 2
  if leader gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if leader gets reject():
    delay and restart paxos
  else if leader gets prepareres from majority of nodes in views[vid_h]:
    if any prepareres(n_i, v_i) exists such that v_i is not empty:
      v = non-empty value v_i corresponding to highest n_i received
    else leader gets to choose a value:
      v = set of live nodes (including self)
    send accept(vid_h+1, my_n, v) to all responders
  else:
    delay and restart paxos

  if node gets accept(vid, n, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n >= n_h:
      n_a = n
      v_a = v
      return acceptres()
    else
      return reject()

Paxos Phase 3
  if leader gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if leader gets acceptres from a majority of nodes in views[vid_h]:
    send decide(vid_h+1, v_a) to all (including self)
  else:
    delay and restart paxos

  if node gets decide(vid, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else:
      views[vid] = v
      vid_h = vid
      view change
      stable = true

Note that there are two types of numbers: view numbers and proposal numbers. For a given view, potentially many nodes can make propsals for a new view, and each of these proposals has a unique proposal number. When comparing different proposals, the highest proposal number wins. To ensure that each proposal number is unique, each proposal consists of a number and the node's identifier. We provide you with a struct prop_t in paxos_protocol.h that you should use for proposal numbers; we also provide the > and >= operators for the class.

At any time a node can decide it wants to start a view change, and start Paxos off. If nothing goes wrong, and there are no concurrent proposals for the next view, Paxos clearly reaches agreement. However, many nodes can become leaders at the same time, creating conflicts that prevent an agreement from being reached. Thus, we would like to ensure with good probability that there is only one leader at a time. To achieve this, each leader delays a random amount of time before beginning phase 1; furthermore if a leader learns of another instance of Paxos started with a higher proposal number for the same view, it will delay for a random amount of time and then attempt to lead another proposal. In this way, the system will eventually have only one active leader with high probability. The skeleton paxos already provides a dodelay() method for this purpose.

Each replica must log certain change to its Paxos state (in particular the n_a, v_a, and n_h fields), as well as log every view change. The provided log class does this for you; please use it without modification, as the test program depends on its output in a particular format.

Your Job

The measure of success for this lab is to pass the test 0-7 of rsm_tester.pl (The remaining tests are reserved for the next lab.). The tester starts 3 or 4 configuration servers, kill some of them, restart some of them and check that all servers indeed go through a unique sequence of view changes by examining their on-disk logs.
% ./rsm_tester.pl 0 1 2 3 4 5 6 7 
test1...
...
test2...
...
test3...
...
test4...
...
test5...
...
test6...
...
test7...
...
test8...
...
tests done OK

Important: If rsm_tester.pl fails during the middle of a test, the remaining config_server processes are not killed and the log files are not cleaned up (so you can debug the causes.). Make sure you do 'killall config_server; rm -f *.log' to clean up the lingering processes before running rsm_tester.pl again.

Detailed Guidance

We guide you through a series of steps to get this lab working incrementally.

Step One: Implement Paxos

Implement the Paxos protocol listed in the pseudocode, log view changes to disk, but do not worry about failures yet. At the end of this step, you need only be able to run 'rsm_tester.pl 0'. This test starts three configuration servers one after another and checks that all servers go through the same three views.

First, fill in the paxos constructor to start with the correct starting view (which is obtained either from on-disk logs or from the default first view). When starting from scratch (with blank on-disk logs), the first node should initialize view 1 to itself (without going through Paxos) and logs view 1 to disk. When the second node starts, it also initializes view 1 to the first node (as specified in the constructor's argument) and logs view 1 to disk. However, since the second node does not find itself in view 1, it will kick the Paxos thread into action to propose view 2. And so on for the third node.

Next, fill in the Paxos implementation. Try to follow the pseudocode provided above, and use the RPC protocol we provide in paxos_protocol.h. Note that though the pseudocode shows different types of responses to each kind of RPC, our protocol combines these responses into one type of return structure. For example, the prepareres struct can act as a prepareres, an oldview, or a reject message, depending on the situation.

Whenever Paxos has successfully agreed on the new view, log the new view to disk. We have provided logview method (in log.cc) for you to do this.

The log class writes its content to a file in the current directory called paxos-[port].log. Note that rsm_tester.pl will remove these logs when a test finishes successfully, unless you comment out the second line of the cleanup() subroutine in the script. rsm_tester.pl also re-directs the stdout and stderr of your configuration server to config_server-[arg1]-[arg2].log. You might find these logs useful for debugging.

Upon completing this step, you should be able to pass 'rsm_tester.pl 0'.

Step Two: Simple failures

Next you should handle the simple failure cases of a single configuration server failing. Recall that when dealing with failed nodes, paxos calls start_paxos() to kick the Paxos thread into action.

Once this works, you should be able to run 'rsm_tester.pl 0 1 2'.

Step Three: Logging Paxos state

In addition to logging new views, modify your Paxos implementation to use the log class to log changes to n_h, and n_a and v_a when they are updated. Convince yourself why these three values must be logged to disk if we want to re-start a previous crashed node correctly. We have provided the code to write and read logs in log.cc (see log::loghigh(), and log::logprop()), so you just have to make sure to call the approriate methods at the right times.

Now you can run tests that involve restarting a node after it fails. In particular, you should be able to pass 'rsm_tester.pl 3 4 '. In test 4, rsm_tester.pl starts three servers, kill the third server (the remaining two nodes should be able to agree on new view), kill the second server ( the remaining one node tries to run Paxos, but cannot succeed since no majority of nodes are present in the current view), restarts the third server (it will not help with the agreement since the third server is not in the current view), kills the third server, restarts second server (now agreement can be reached) and finally restarts third server.

Step Four: Complicated failures

Finally, you need to verify that your code handles some of the tricky corner cases that Paxos is supposed to deal with. Our test scripts do not test all possible corner cases, so you could still have a buggy Paxos implementation after this step, but you will have a good feel for the protocol.

In paxos.cc, we provide two methods: breakpoint1() and breakpoint2(). Your Paxos code must call breakpoint1() just after completing Phase 1, but before starting Phase 2. Similarly it must call breakpoint2() in between Phases 2 and 3. The tester sends SIGUSR1 or SIGUSR2 to a configuration server to cause it to exit at the respective breakpoint. (You can try this manually on the command line with a command like 'kill -USR1 [pid]', but rsm_tester.pl also tests the following cases automatically).

By now, your code should now reliably pass all required tests reliably, i.e. 'rsm_tester.pl 0 1 2 3 4 5 6 7'.

Handin procedure

Tar your l5/ files with the following commands:

cd ~/yfs/l5
make clean
cd ..
tar czvf yfs-lab5.tgz l5/
Go to submit site to upload yfs-lab5.tgz
For questions or comments, email dss-staff