Introduction

In this lab you will build a server and client that cache locks at the client, reducing the load on the server and improving client performance. For example, when client 1 asks for lock 42 repeatedly and no other client wants the lock, then all acquire and releases can be performed on client 1 without having to contact the server.

The challenge in the lab is the protocol between the clients and the server. For example, when client 2 acquires a lock that client 1 has cached, the server must revoke that lock from client 1 by sending a revoke RPC to client 1. The server can give client 2 the lock only after client 1 has released the lock, which may be a long time after sending the revoke (e.g., if a thread on client 1 holds the lock for a long period). The protocol is is further complicated by the fact that packets may be lost, duplicated, and delivered out of order. The at-most-once RPC server you implemented in Lab 1 will take care of most of these problems, but RPCs may still be delivered out of order. We will test your lock server code with RPC_LOSSY set to 5, like in Lab 1.

Another reason the lock server must work in these lossy conditions is because in later labs we will make the lock service fault tolerant to failures of individual lock servers, and then these conditions can show up. Of course, in practice, similar conditions could happen with the extent server, but we won't make the extent server fault tolerant in the labs and will assume the network works well; thus, we will ignore these problems for the extent server and will not test for it with the extent server.

To make the lock service fault tolerant, we will want to put a constraint on the lock server operations in later labs. This constraint will have implications for the protocol between the lock clients and lock server. The constraint is that handlers on the server should run to completion without blocking. That is, a server thread should not block on condition variables or remote RPCs. Of course, the thread can wait to take out locks as long as it can be sure that the lock will never be held by another thread across an RPC, and once it has acquired the lock it should run to completion.

To avoid having to re-implement the lock server again for these later labs, your implementation of the lock server for this lab should adhere to this non-blocking constraint. We won't test in this lab whether your implementation adheres to the constraints, but Lab 7 will.

Your server will be a success if it manages to operate out of its local lock cache when reading/writing files and directories that other hosts aren't looking at, but maintains correctness when the same files and directories are concurrently read and updated on multiple hosts. We will test with both RPC_LOSSY set to 0 and RPC_LOSSY set to 5.

Getting Started

Since you are building on the past labs, ensure the code in your Lab 4 directory passes all tests for Labs 1, 2, 3 and 4 before starting in on this lab.

Begin by merging your solution to lab 4 with the new code for lab 5:

% cd ~/ds-class/lab
% git commit -am 'my solution to lab4'
Created commit ...
% git pull
remote: Generating pack...
...
% git checkout -b lab5 origin/lab5
Branch lab5 set up to track remote branch refs/remotes/origin/lab5.
Switched to a new branch "lab5"
% git merge lab4

This will add these new files to your lab directory:

We have also made changes to the following files:

In order to evaluate the caching performance of your caching lock server and clients, the RPC library has a feature that counts unique RPCs arriving at the server. You can set the environment variable RPC_COUNT to N before you launch a server process, and it will print out RPC statistics every N RPCs. For example, in the bash shell you could do:

% export RPC_COUNT=25
% ./lock_server 3772
RPC STATS: 7001:23 7002:2 
...
This means that the RPC with the procedure number 0x7001 (acquire in the original lock_protocol.h file) has been called 23 times, while RPC 0x7002 (release) has been called twice.

Testing Performance

Our measure of performance is the number of acquires that your lock clients send to the lock server. You can use RPC_COUNT, as mentioned above, to print out these stats every N calls.

The workload on which we'll evaluate your lock protocol's performance is generated by test-lab-4-c. It creates two subdirectories and creates/deletes 100 files in each directory, using each directory through only one of the two YFS clients.

Using the lock server of lab 4, you will see that the number of acquires is at least a few thousand. With the caching lock client, you will see that the number of acquires is only a few hundred (i.e., less than a thousand). We are a bit vague in this performance goal because the exact numbers depend a bit on how you use locks in yfs_client. Suffice it to say, the drop in acquires should be significant.

Of course, your system must also remain correct. We will require the code you hand in to pass all of the lab 1 and lab 4 testers as well as getting good performance on the test-lab-4-c tester.

Protocol and Implementation Hints

We strongly recommend you implement a protocol in the style suggested below. You may think there's a simpler protocol, and you're probably right; you'll have to trust us that our protocol makes things easier when we replicate the lock server for fault tolerance.

This protocol has most of the complexity on the client. All the handlers on the server run to completion and threads wait on condition variables on the client when a lock is taken out by another thread (on this client or another client). This allows the server to be replicated using the replicated state machine approach in labs 7 and 8. If you change the protocol, make sure that handlers on the server run to completion.

On the client a lock can be in several states:

In many of the states there may be several threads waiting for the lock, but only one thread per client ever needs to be interacting with the server; once that thread has acquired and released the lock it can wake up other threads, one of which can acquire the lock (unless the lock has been revoked and released back to the server). If you need a way to identify a thread, you can use its thread id (tid), which you can get using pthread_self().

When a client asks for a lock with an acquire RPC, the server grants the lock and responds with OK if the lock is not owned by another client (i.e., the lock is free). If the lock is owned by another client, the server responds with RETRY. At some point later (after another client has released the lock using a release RPC), the server sends the client a retry RPC. The retry RPC informs the client that the lock may be free, and therefore it ought to try another acquire RPC.

Note that RETRY and retry are two different things. On the one hand, RETRY is the value the server returns for a acquire RPC to indicate that the requested lock is not currently available. On the other hand, retry is the RPC that the server sends the client when a previously requested lock becomes available.

Once a client has acquired ownership of a lock, the client caches the lock (i.e., it keeps the lock instead of sending a release RPC to the server when a thread releases the lock on the client). The client can grant the lock to other threads on the same client without interacting with the server. The server will inform the client when it wants a lock back.

The server sends the client a revoke RPC to get the lock back. This request tells the client that it should send the lock back to the server when it releases the lock or right now if no thread on the client is holding the lock.

For your convenience, we have defined a new RPC protocol called rlock_protocol in lock_protocol.h) to use when sending RPCs from the server to the client. This protocol contains definitions for the retry and revoke RPCs.

A good way to implement releasing locks on the client is using a separate releaser thread (as mentioned above, the skeleton code already launches one for you). When receiving a revoke request, the client adds the revoke request to a list and wakes up the releaser thread. The releaser thread will release the lock (i.e., send a release RPC to the server) when the lock becomes free on the client. Using a separate thread is good because it avoids potential distributed deadlocks and ensures that revoke RPCs from the server to the client run to completion on the client.

On the server, handlers shouldn't block either. A good way to implement this on the server is to have revoker and retrier threads that are in charge of sending retry and revoke RPCs, respectively. When a client asks for a lock that is taken by another client, the acquire handler adds a revoke request to a queue and wakes up the revoker thread. When a client releases a lock, the release handler adds a retry request to a list for the retrier (if there are clients who want the lock) and wakes up the retrier thread. This design ensures that the handlers run to completion. Blocking operations are performed by the retrier and revoker threads, and those blocking operations are just RPCs to the client, whose handlers also should run to completion without blocking.

A challenge in the implementation is that retry and revoke RPCs can be out of order with the acquire and release requests. That is, a client may receive a retry request before it has received the response to its acquire request. Similarly, a client may receive a revoke before it has received a response on its acquire request.

A good way to handle these cases is to assign sequence numbers to all requests. That is each request should have a unique client ID (e.g., a random number or the id string) and a sequence number. For an acquire, the client picks the first unused sequence number and supplies that sequence number as an argument to the acquire RPC, along with the client ID. You probably want to send no additional acquires for the same lock to the server until the oustanding one has been completed. The corresponding release (which may be much later because the lock is cached) should probably carry the same sequence number as the last acquire, in case the server needs to tell which acquire goes along with this release. This approach requires the server to remember at most one sequence number per client per lock. You may be able to think of a strategy that doesn't require sequence numbers. We suggest that you use sequence numbers anyway, for two reasons. One is that sequence numbers are easy to reason about, whereas thinking of all possible ways in which reordering might cause problems is harder. The second reason is that you will need sequence numbers anyway when we replicate the lock service to handle lock server failures.

Unless your server in Lab 1 has non-blocking handlers already and uses retry and revoke RPCs, you are probably best off starting from scratch for this lab. Lab 1 didn't require you to write much code, and morphing it into something that is suitable for lab 5 is likely to be more work than doing it right from scratch.

As in the previous labs, remember not to hold pthreads mutexes across remote procedure calls. Holding mutexes across RPCs might seem to work, and it might even pass our tests, but it's a really bad idea! Suppose the lock server does an RPC while holding a mutex. Then the server receives an RPC from a client, and the RPC handler tries to acquire the same mutex. The result is that the client's RPC can't even be processed until the first RPC returns. Not only is this bad for performance, but it can lead to distributed deadlock, which is one of the most tortuous problems you may ever have to debug.

Pthreads mutexes are intended for synchronizing multiple threads' access to shared memory. If you use them correctly, you will only need to hold the mutex for a few microseconds, and this in turn means that you could have a single pthreads mutex protecting the entire state of the lock server without limiting its scalability. This will make your job much simpler.

Step One: Design the Protocol

You should design the protocol and basic system structure on paper (after playing perhaps a little bit around with the code). In particular, carefully think through the different scenarios due to reordered messages. Changing the basic system structure and tracking down errors in your implemented protocol is painful. If you have thought through all scenarios before you start implementing and have the right system structure, you can save yourself much time.

The following questions might help you with your design (they are in no particular order):

Step Two: Lock Client and Server, and Testing with RPC_LOSSY=0

A reasonable first step would be to implement the basic design of your acquire protocol on both the client and the server, including having the server send revoke messages to the holder of a lock if another client requests it. This will involve registering RPC handlers on the client, and devising a way for the server to receive and remember each client's location address (i.e., the id variable in lock_client_cache) and using it to send the client RPCs.

Next you'll probably want to implement the release code path on both the client and the server. Of course, the client should only inform the server of the release if the lock has been revoked. This will also involve having the server send a retry RPC to the next client in line waiting for the lock.

Also make sure you instantiate a lock_server_cache object in lock_smain.cc, and correctly register the RPC handlers.

Once you have your full protocol implemented, you can run it using the lock tester, just as in Lab 1. For now, don't bother testing with loss:

% export RPC_LOSSY=0
% ./lock_server 3772

Then, in another terminal:

% ./lock_tester 3772

Run lock_tester. You should pass all tests and see no timeouts. You can hit Ctrl-C in the server's window to stop it.

Step Three: Testing the Lock Client and Server with RPC_LOSSY=5

Now that it works without loss, you should try testing with RPC_LOSSY=5. Here you may discover problems with reordered RPCs and responses.

% export RPC_LOSSY=5
% ./lock_server 3772

Then, in another terminal:

% export RPC_LOSSY=5
% ./lock_tester 3772

Step Four: Run File System Tests

In the constructor for your yfs_client, you should now instantiate a lock_client_cache object, rather than a lock_client object. You will also have to include lock_client_cache.h. Once you do that, your YFS should just work under all the Lab 4 tests. We will run your code against all 3 tests (a, b, and c) from Lab 4.

You should also compare running your YFS code with the two different lock clients and servers, with RPC count enabled at the lock server. For this reason, it would be helpful to keep your Lab 4 code around and intact, the way it was when you submitted it. As mentioned before, you can turn on RPC statistics using the RPC_COUNT environment variable. Look for a dramatic drop in the number of acquire (0x7001) RPCs between your Lab 4 and Lab 5 code during the test-lab-4-c test.

The file system tests should pass with RPC_LOSSY set as well. You can pass a loss parameter to start.sh and it will enable RPC_LOSSY automatically:

% ./start.sh 5          # sets RPC_LOSSY to 5
If you're having trouble, make sure that, e.g., the Lab 3 tester passes. If it doesn't, then the issues are most likely with YFS under RPC_LOSSY, rather than your caching lock client.

Evaluation Criteria

If your caching lock protocol performs well and your system passes all the tests, you're done. In particular, we will be checking the following:

Bonus Exercises

Here are a few things you can do if you finish the lab early and feel like improving your code. These are not required, and there are no associated bonus points, but some of you may find them interesting. Note that you should only consider extensions to the lab if your code passes all the tests with no errors. If you decide to add extensions and you are concerned that you might have introduced new bugs, you can submit a basic version for grading purposes and keep the extended version separate. The extended version could be in a separate directory or a separate git branch, but make sure that the active branch in the lab directory is the one you want us to test, or otherwise make it clear what we should evaluate.

Collaboration policy

You are welcome (and encouraged!) to discuss design issues on the class mailing list. However, you must write all the code you hand in for the programming assignments, except for code that we give you as part of the assignment. You are not allowed to look at anyone else's solution (and you're not allowed to look at solutions from previous years). You may discuss the assignments with other students, but you may not look at or copy each others' code.

Handin procedure

Prepare a tar file by executing these commands:
% cd ~/ds-class/lab
% ./stop.sh
% rm core*
% make clean
% cd ..
% tar czvf yfs-lab5.tgz lab/
That should produce a file called yfs-lab5.tgz in your ds-class/ directory. Go to submit site to upload yfs-lab5.tgz