Introduction

In this series of labs, you will implement a fully functional distributed file server with the following architecture:

The architecture involves multiple file clients, labeled yfs above, each running on a different machine. Because we use the FUSE userlevel file system toolkit, yfs will appear to local applications as a mounted file system. Instead of storing file system data locally, all yfs clients store data with a single extent server, which allows sharing of data across multiple yfs clients.

This architecture is appealing because, in principle, it shouldn't slow down very much as you add client hosts. Most of the complexity is in the yfs program, so new clients make use of their own CPUs rather than competing with existing clients for the server's CPU. The extent server is shared, but hopefully it's simple and fast enough to handle a large number of clients. In contrast, a conventional NFS server is pretty complex (it has a complete file system implementation) so it's more likely to be a bottleneck when shared by many NFS clients. The only fly in the ointment is that the yfs servers need a locking service to avoid inconsistent updates.

In this lab, you'll implement the lock service. The core logic of the lock service is quite simple and consists of two modules, the lock client and lock server that communicate via RPCs. A client requests a specific lock from the lock server by sending an acquire request. The lock server grants the requested lock to one client at a time. When a client is done with the granted lock, it sends a release request to the server so the server can grant the lock to another client who also tried to acquire it in the past.

In addition to implementing the lock service, you'll also augment the provided RPC library to ensure at-most-once execution by eliminating duplicate RPC requests. Duplicate requests exist because the RPC system must re-transmit lost RPCs in the face of lossy network connections and such re-transmissions often lead to duplicate RPC delivery when the original request turns out not to be lost, or when the server reboots. Duplicate RPC delivery, when not handled properly, often violates application semantics. Here's a example of duplicate RPCs causing incorrect lock server behavior. A client sends an acquire request for lock x, server grants the lock, client releases the lock with a release request, a duplicate RPC for the original acquire request then arrives at the server, server grants the lock again, but the client will never release the lock again since the second acquire is just a duplicate. Such behavior is clearly incorrect.

Getting Started

For this lab, you should be able to use any Linux/BSD/MacOS machines. First, create a directory for your labs (in our example, we'll call it "yfs"), and download the lab1 code skeleton from http://www.news.cs.nyu.edu/~jinyang/fa08/labs/yfs-lab1.tgz and RPC library from http://www.news.cs.nyu.edu/~jinyang/fa08/labs/yfs-rpc.tgz.

% mkdir yfs
% cd yfs
% wget -nc http://www.news.cs.nyu.edu/~jinyang/fa08/labs/yfs-lab1.tgz
% tar xzvf yfs-lab1.tgz
% wget -nc http://www.news.cs.nyu.edu/~jinyang/fa08/labs/yfs-rpc.tgz
% tar xzvf yfs-rpc.tgz

In directory l1/, we provide you with a skeleton RPC-based lock server, a lock client interface, a sample application that uses the lock client interface, and a tester. Now compile and start up the lock server, giving it a port number on which to listen to RPC requests. You'll need to choose a port number that other programs aren't using. For example:

% cd l1
% make
% ./lock_server 3772
Now open a second terminal on the same machine and run lock_demo, giving it the port number on which the server is listening:
% cd ~/yfs/l1
% ./lock_demo 3772
stat request from clt 16283 for lock a
stat returned 0
% 

lock_demo asks the server for the number of times a given lock has been acquired, using the stat RPC that we have provided. In the skeleton code, this will always return 0. The lock client skeleton does not do anything yet for the acquire and release operations; similarly, the lock server does not implement any form of lock granting or releasing. Your job in this lab is to fill in the client and server function and the RPC protocol between the two processes.

Your Job

Your first job is to implement a correct lock server assuming a perfect underlying network. In the context of a lock service, correctness means obeying this invariant: at any instance of time, there is at most one client holding a lock of a given name.

In addition to being correct, we also demand that the rpc handlers at lock_server run to completion without blocking. That is, a server thread should not block on condition variables or remote RPCs. Of course, a thread can wait to acquire locks as long as we are certain that that lock will not be held by another thread across an RPC, and once it has acquired the lock it should run to completion. This requirement ensures your lock server can be replicated correctly at a later lab.

We will use the program lock_tester to check the correctness invariant, i.e. whether the server grants each lock just once at any given time, under a variety of conditions. You run lock_tester with the same arguments as lock_demo. A successful run of lock_tester (with a correct lock server) will look like this:

% ./lock_tester 3772
simple lock client
acquire a release a acquire a release a
acquire a acquire b release b releasea
test2: client 0 acquire a release a
test2: client 2 acquire a release a
. . .
./lock_tester: passed all tests successfully
If your lock server isn't correct, lock_tester will print an error message. For example, if lock_tester complains "error: server granted a twice!", the problem is probably that lock_tester sent two simultaneous requests for the same lock, and the server granted the lock twice (once for each request). A correct server would have sent one grant, waited for a release, and only then sent a second grant.

Your second job is to augment the RPC library in directory yfs/rpc to guarantee at-most-once execution. We simulate lossy networks on a local machine by setting the environmental variable RPC_LOSSY. If you can pass both the rpc system tester and the lock_tester, you are done. Here's a successful run of both testers:

% ./rpctest
simple test
. . .
rpctest OK

% killall lock_server
% export RPC_LOSSY=5
% ./lock_server 3722 &
% ./lock_tester 3772
simple lock client
acquire a release a acquire a release a
. . .
./lock_tester: passed all tests successfully

For this lab, your lock server and RPC augmentation must pass the both rpctest and lock_tester; you should ensure it passes several times in a row to guarantee there are no rare bugs. You should only make modifications on files rpc.{cc,h}, lock_client.{cc,h}, lock_server.{cc,h} and lock_smain.cc. We will test your code with with our own copy of the rest of the source files and testers. You are free to add new files to the directory as long as the Makefile compiles them appropriately, but you should not need to.

For this lab, you will not have to worry about server failures or client failures. You also need not be concerned about security such as malicious clients releasing locks that they don't hold.

Detailed Guidance

In principle, you can implement whatever design you like as long as your implementation satisfies all requirements in the "Your Job" section and passes the tester. To be nice, we provide detailed guidance and tips on a recommended implementation plan. You do not have to follow our recommendations, although doing so makes your life easier and allows maximal design/code re-use in later labs. Since this is your first lab, you should also read the general programming tips in the lab overview page as well.

Step One: implement the lock_server assuming a perfect network

First, you should get the lock_server running correctly without worrying about duplicate RPCs under lossy networks.

Step two: Implement at-most-once delivery in RPC

After your lock_server has passed lock_tester under a perfect network, enable RPC_LOSSY by typing "export RPC_LOSSY=5", restart your lock_server and try lock_tester again. If you implemented lock_server in the simple way as described previously, you will see the lock_tester fail (or hang indefinitely). Try to understand exactly why your lock_tester fails when re-transmissions cause duplicate RPC delivery.

Read the RPC source code in rpc/rpc.{cc,h} and try to grasp the overall structure of the RPC library as much as possible first by yourself without reading the hints below.

The rpcc class handles the RPC client's function. At its core lies the rpcc::call1 function, which accepts a marshalled RPC request for transmission to the RPC server. We can see that call1 attaches additional RPC fields to each marshalled request:

210   // add RPC fields before req
211   m1 << clt_nonce << svr_nonce << proc << myxid << xid_rep_window.front() << req.str();
212 
What's the purpose for each of these fields? (Hint: most of them are going to help you implement at-most-once devliery) After call1 has finished preparing the final RPC request, it sits in a "while(1)" loop to (repeatedly) update the timeout value for the next retransmission and waits for the corresponding RPC reply or timeout to happen.

The rpcs class handles the RPC server's function. It creates a separate thread (executing rpcs::loop) that continously tries to read RPC requests from the underlying channel (e.g. a TCP connection). Once a request is read succesfully, it spawns a new thread to dispatch this request to the registered RPC handler. The function rpcs::dispatch implements the dispatch logic. It extracts various RPC fields from the request. These fields include the RPC procedure number which is used to find the corresponding handler. Additionally, they also provide sufficient information for you to ensure the server can eliminate all duplicate requests.

How to ensure at-most-once delivery? A strawman approach is to make the server remember all unique RPCs ever received. Each unique RPC is identified by both its xid (unique across a client instance) and clt_nonce (unique across all client instances). In addition to the RPC ids, the server must also remember the actual values of their corresponding replies so that it can re-send the (potentially lost) reply upon receiving a duplicate request without actually executing the RPC handler. This strawman guarantees at-most-once, but is not ideal since the memory holding the RPC ids and replies can grow indefinitely. A better alternative is to use a sliding window of remembered RPCs at the server. Such an approach requires the client to generate xid in a strict sequence, i.e. 0, 1, 2, 3... When can the server safely forget about a received RPC and its response, i.e. sliding the window forward?

Once you figure out the basic design for at-most-once delivery, go ahead and realize your implementation in rpc.cc (rpc.cc is the only file you should be modifying). Hints: you need to add code in three places, rpcc:rpcc constructor to create a thread to enable retransmissions, rpcs:add_reply to remember the RPC reply values and rpcs::checkduplicate_and_update to eliminate duplicate xid and update the appropriate information to help the server safely forget about certain received RPCs.

After you are done with step two, test your rpc implementation with RPC_LOSSY set to zero first ("export RPC_LOSSY=0"). Make sure ./rpctest passes all tests. Test with rpctest again after enabling loss ("export RPC_LOSSY=5"). Once your rpc implementation passes all these tests, test your lock server again in a lossy environment by restarting your lock_server and lock_tester after setting "RPC_LOSSY=5".

Handin procedure

Tar your l1/ and rpc/ files together like the following.

cd ~/yfs/l1
make clean
cd ..
tar czvf yfs-lab1.tgz rpc/ l1/ 
Go to submit site to upload yfs-lab1.tgz
For questions or comments, email dss-staff@cs.nyu.edu.