Distributed Systems - Fall 2019

Lab 4: Fault-tolerant Key/Value Service

Due: 12/4 at 11:59pm

Introduction

In this lab you will build a fault-tolerant key-value storage service using your Raft library from lab 2. You will build your key-value service as a replicated state machine, consisting of several key-value servers that coordinate their activities through the Raft log. Your key/value service should continue to process client requests as long as a majority of the servers are alive and can communicate, in spite of other failures or network partitions.

Your system will consist of clients and key/value servers, where each key/value server also acts as a Raft peer. Clients send Put(), Append(), and Get() RPCs to key/value servers (called kvraft servers), which then place those calls into the Raft log and execute them in order. A client can send an RPC to any of the kvraft servers, but should retry by sending to a different server if the server is not currently a Raft leader, or if there's a failure. If the operation is committed to the Raft log (and hence applied to the key/value state machine), its result is reported to the client. If the operation failed to commit (for example, if the leader was replaced), the server reports an error, and the client retries with a different server.

This lab has two parts. In part A, you will implement the service without worrying that the Raft log can grow without bound. In part B, you will implement snapshots (Section 7 in the paper), which will allow Raft to garbage collect old log entries. Please submit both parts by the lab deadline.

This lab doesn't require you to write much code, but you will most likely spend a substantial amount of time thinking and staring at debugging logs to figure out why your implementation doesn't work. Debugging will be more challenging than in the Raft lab because there are more components that work asynchronously of each other. Start early.
You should reread the extended Raft paper, in particular Sections 7 and 8. For a wider perspective, have a look at Chubby, Raft Made Live, Spanner, Zookeeper, Harp, Viewstamped Replication, and Bolosky et al.

Getting Started

Do a git pull to get the latest lab software. We supply you with skeleton code and tests in src/kvraft. You will need to modify kvraft/client.go, kvraft/server.go, and perhaps kvraft/common.go.

To get up and running, execute the following commands:

$ cd ~/golabs
$ git pull
...
$ cd src/kvraft
$ GOPATH=$HOME/golabs
$ export GOPATH
$ go test -v
...
$

When you're done, your implementation should pass all the tests in the src/kvraft directory:

$ go test
Test: One client ...
... Passed
Test: concurrent clients ...
... Passed
Test: unreliable ...
... Passed
...
PASS
ok  	kvraft	345.032s

Part A: Key/value service without log compaction

The service supports three RPCs: Put(key, value), Append(key, arg), and Get(key). It maintains a simple database of key/value pairs. Put() replaces the value for a particular key in the database, Append(key, arg) appends arg to key's value, and Get() fetches the current value for a key. An Append to a non-existant key should act like Put.

You will implement the service as a replicated state machine consisting of several kvservers. Your kvraft client code (Clerk in src/kvraft/client.go) should try different kvservers it knows about until one responds positively. As long as a client can contact a kvraft server that is a Raft leader in a majority partition, its operations should eventually succeed.

Your first task is to implement a solution that works when there are no dropped messages, and no failed servers. Your service must ensure that Get(), Put(), and Append return results that are linearizable. That is, completed application calls to the Clerk.Get(), Clerk.Put(), and Clerk.Append() methods in kvraft/client.go must appear to all clients to have affected the service in the same linear order, even in there are failures and leader changes. A Clerk.Get(key) that starts after a completed Clerk.Put(key, …) or Clerk.Append(key, …) should see the value written by the most recent Clerk.Put(key, …) or Clerk.Append(key, …) in the linear order. Completed calls should have exactly-once semantics.

A reasonable plan of attack may be to first fill in the Op struct in server.go with the "value" information that kvraft will use Raft to agree on (remember that Op field names must start with capital letters, since they will be sent through RPC), and then implement the PutAppend() and Get() handlers in server.go. The handlers should enter an Op in the Raft log using Start(), and should reply to the client when that log entry is committed. Note that you cannot execute an operation until the point at which it is committed in the log (i.e., when it arrives on the Raft applyCh).

You have completed this task when you reliably pass the first test in the test suite: "One client". You may also find that you can pass the "concurrent clients" test, depending on how sophisticated your implementation is.

Your kvraft servers should not directly communicate; they should only interact with each other through the Raft log.

After calling Start(), your kvraft servers will need to wait for Raft to complete agreement. Commands that have been agreed upon arrive on the applyCh. You should think carefully about how to arrange your code so that it will keep reading applyCh, while PutAppend() and Get() handlers submit commands to the Raft log using Start(). It is easy to achieve deadlock between the kvserver and its Raft library.
Your solution needs to handle the case in which a leader has called Start() for a client RPC, but loses its leadership before the request is committed to the log. In this case you should arrange for the client to re-send the request to other servers until it finds the new leader. One way to do this is for the server to detect that it has lost leadership, by noticing that a different request has appeared at the index returned by Start(), or that the term reported by Raft.GetState() has changed. If the ex-leader is partitioned by itself, it won't know about new leaders; but any client in the same partition won't be able to talk to a new leader either, so it's OK in this case for the server and client to wait indefinitely until the partition heals.
You will probably have to modify your client Clerk to remember which server turned out to be the leader for the last RPC, and send the next RPC to that server first. This will avoid wasting time searching for the leader on every RPC, which may help you pass some of the tests quickly enough.
A kvraft server should not complete a Get() RPC if it is not part of a majority (so that it does not serve stale data). A simple solution is to enter every Get() (as well as each Put() and Append()) in the Raft log. You don't have to implement the optimization for read-only operations that is described in Section 8.
It's best to add locking relatively early in development because the need to avoid deadlocks sometimes affects overall code design. Check that your code is race-free using go test -race.

In the face of unreliable connections and server failures, a client may send an RPC multiple times until it finds a kvraft server that replies positively. If a leader fails just after committing an entry to the Raft log, the client may believe that the request failed, and re-send it to another leader. Each call to Clerk.Put() or Clerk.Append() should result in just a single execution, so you will have to ensure that the re-send doesn't result in the servers executing the request twice.

Add code to cope with duplicate client requests, including situations where the client sends a request to a kvraft leader in one term, times out waiting for a reply, and re-sends the request to a new leader in another term. The client request should always execute just once. To pass part A, your service should reliably pass all tests through TestPersistPartitionUnreliable().

You will need to uniquely identify client operations to ensure that the key/value service executes each one just once.
Your scheme for duplicate detection should free server memory quickly, for example by having each RPC imply that the client has seen the reply for its previous RPC. It's OK to assume that a client will make only one call into a clerk at a time.

Part B: Key/value service with log compaction

As things stand now with your lab code, a rebooting server replays the complete Raft log in order to restore its state. However, it's not practical for a long-running server to remember the complete Raft log forever. Instead, you'll modify Raft and kvraft to cooperate to save space: from time to time kvraft will persistently store a "snapshot" of its current state, and Raft will discard log entries that precede the snapshot. When a server restarts (or falls far behind the leader and must catch up), the server first installs a snapshot and then replays log entries from after the point at which the snapshot was created. Section 7 of the extended Raft paper outlines the scheme; you will have to design the details.

You should spend some time figuring out what the interface will be between your Raft library and your service so that your Raft library can discard log entries. Think about how your Raft will operate while storing only the tail of the log, and how it will discard old log entries. You should discard them in a way that allows the Go garbage collector to free and re-use the memory; this requires that there be no reachable references (pointers) to the discarded log entries.

The kvraft tester passes maxraftstate to your StartKVServer(). maxraftstate indicates the maximum allowed size of your persistent Raft state in bytes (including the log, but not including snapshots). You should compare maxraftstate to persister.RaftStateSize(). Whenever your key/value server detects that the Raft state size is approaching this threshold, it should save a snapshot, and tell the Raft library that it has snapshotted, so that Raft can discard old log entries.

Your raft.go probably keeps the entire log in a Go slice. Modify it so that it can be given a log index, discard the entries before that index, and continue operating while storing only log entries after that index. Make sure you pass all the Raft tests after making these changes.

Modify your kvraft server so that it detects when the persisted Raft state grows too large, and then saves a snapshot and tells Raft that it can discard old log entries. Save each snapshot with persister.SaveSnapshot() (don't use files).

Modify your Raft leader code to send an InstallSnapshot RPC to a follower when the leader has discarded the log entries the follower needs. When a follower receives an InstallSnapshot RPC, your Raft code will need to send the included snapshot to its kvraft. You can use the applyCh for this purpose — see the UseSnapshot field. A kvraft instance should restore the snapshot from the persister when it re-starts. Your solution is complete when you pass the remaining tests reliably.

The maxraftstate limit applies to the GOB-encoded bytes your Raft passes to persister.SaveRaftState().

Think about when a kvserver should snapshot its state and what should be included in the snapshot. You must store each snapshot in the persister object using SaveSnapshot(), not along with the other Raft state in SaveRaftState(). You can read the latest stored snapshot using ReadSnapshot().
Remember that your kvserver must be able to detect duplicate client requests across checkpoints, so any state you are using to detect them must be included in the snapshots. Remember to capitalize all fields of structures stored in the snapshot.
Make sure you pass TestSnapshotRPC before moving on to the other Snapshot tests.
A common source of bugs is for the Raft state and the snapshot state to be inconsistent with each other, particularly after re-starts or InstallSnapshots. You are allowed to add methods to your Raft for kvserver to call to help handle InstallSnapshot RPCs.

Hand in procedure

Once you have finished both parts, perform all the tests many times (bugs in concurrent programs often only surface for specific runs of the program) to ensure that your program can pass the tests stably.

Hand in your code by pushing to the github repository before the deadline.

$ git commit -am "Finish lab-4"
$ git push

Please post questions on Piazza.