Distributed Storage Systems

Project information

Introduction

You will form a group of 2-3 students for the final project. You will build, design, implement a system of your choice. There are 5 deadlines:

Team list. Send your instructor names of the people in your project group by Feb 18.
Project proposal. The proposal is a one page description of what your project will be. It should state the problem you are solving, why it is an interesting or useful problem, what software/system you will build and what the expected results will be. The proposal is not graded; it is there to help you get started and get some feedbacks from us. It is due on Feb 24. We will give back the marked proposals in class on Feb 26. We will have our first project conference to critic each other's projects based on the written proposals in class on Mar 5.
Draft report. This report should include a draft of your report's abstract, introduction, related work, and design section. These sections should be in good shape and close to what they would look like in the final report. Be sure that the draft's introduction clearly states what your project's goals are, why those goals are worthwhile, and how you're going to achieve those goals. In addition to these mature sections, the report should also have an implementation and evaluation plan. Describe how you plan to implement the system (esp. the details of how it situates in the OS environment) and what experiments you will run on your final system. You should email us the draft in Postscript or PDF by Mar 31st. We will return our comments on Apr 2nd.
Demo day. You will give a presentation, followed by a demonstration of your system in action to the entire class. We'll supply a laptop projector, so you should run your demo from your laptop. Demo day is in class on Apr 30.
Final report. You should email us the final project report in Postscript or PDF by May 10, noon.

Grades

When evaluating your project, we look for the following:

usefulness of the system you've built.
quality of the report.
the extent to which your design is a good fit for the problem you are solving.
how useful your new ideas and techniques might to be others building distributed storage systems.

We expect your report to answer the following questions:

What problem are you solving?
What is the motivation for solving that problem? Why is the problem interesting and challenging? Why would a solution be useful?
How have you solved the problem -- how is your system designed?
Why is your design good? What key decisions and trade-offs have you made? Is your design the simplest reasonable design?
Does your solution fit well with the rest of the system? If your solution requires modifying every piece of hardware, software, and data in sight, it won't be credible, unless you can come up with a very good story why everything needs to be changed.
What new ideas or techniques have you developed as part of your design? What can others learn from your work?
How does your implemented system work?
Can you demonstrate that your system does indeed solve the original problem? Typically you'll do this with an experimental evaluation, and present quantitative results.
What is the relationship between your work and previous solutions to similar problems? Your report should include a Related Work section outlining the existing work that's closest to your project, and explaining how your design is different or better.
A good report will also be well written:
- Is the report easy to understand?
- Is it well organized and coherent?
- Does it use diagrams where appropriate?

Ideas

You should feel free to choose any project you like, as long as it is related to storage systems, distributed systems or operating systems. It must have a substantial system-building and evaluation component. A successful class project usually have very well defined goals and is modest in scope (Remember, you only have 2.5 months to finish it). You could look for inspiration about hot topics in the on-line proceedings of recent SOSP, OSDI, Usenix, and NSDI conferences. Here's a list of ideas that we think could lead to interesting projects.

Make NFS work in the wide area.
Conventional wisdom says NFS won't work over the wide area network. Why? What are the operations that make the system unbearably slow? Measure existing NFS' performance over the wide area and implement your improvements.
A peer to peer shared file system over the wide area Internet.
It would be neat to share your files with friends over the wide area internet over a file system interface intead of using special p2p applications. This usage scenario is different from NFS in the sense that there is no single server (or a set of servers) to store files on. Each node should be responsible for its own file storage but also maintain a user-specific namespace that automatically incorporates peers' files. How do you make use of this usage scenario to avoid many performance killers in a cluster file system at the cost of certain semantic violations.
Erasure coded distributed file system.
While storage space on an individual's PC seems unnecessarily massive, the demand of storage space on a shared storage cluster tends to outpace the supply. Implement a storage saver module that works with the storage system (e.g. the distributed file system) to transparently compress and erasure-code (instead of replicating) files on disks.
An energy-saving file system using flash memory.
Mechanical harddisk consumes significant energy (you should take what I say with a grain of salt and do the necessary background research and preferably experiment to verify it). Could you add a layer of cache in front of the harddisk using the cheap and large flash memory (2G, $30) to allow you to spin down the disk completely most of the time? Your system might solve that "oh no, i'm working on a presentation on the airplane and my laptop battery is running out in 2 minutes" problem.
Build a file synchronization tool.
We all have multiple computers that store copies of our data. A file synchronizer brings these copies of files in sync with each other as they are updated. File conflicts are very annoying. How should you reduce the possibilities for conflicts as much as possible? For example, can your synchronizer be more proactive with its syncing? Additionally, traditional file synchronizers sync at the granularity of individual files. Will syncing at a finer grain be better?
Build a more full-featured version of the distributed NFS server lab, for example, in the spirit of Frangipani and Petal.
Build a file system interface to another data system, e.g. cvs or web.
For example, in cvs FS, each file is represented as a directory with different versions of the files in it.
For example, you can access and upload your videos on Youtube via a file system interface.
A file system utilizing multiple disks.
It's like a userlevel RAID. Think of this usage scenario where I store recorded TVs using MythTV. Right now, I cannot make use of the space of my 2 harddisks because I cannot have a directory that spans multiple disks. Your userlevel FS should enable that. Furthermore, if you can stripe large file writes across multiple disks, you might increase the write performance also by fully utilizing the capacities of all disks.
A cryptographic file system.
A file system search tool.
One can use a rich pool of access information available at the file system to improve the search quality. Examples of such valuable information include: how long each file has been kept open since its creation, how many users have accessed it, frequencies of access etc.