Session 1: Data, Data, Data
Keypad: An Auditing File System for Theft-Prone Devices
The challenge is that mobile devices are prone to theft and loss, and encryption is not sufficient, because people have a habit of attaching the password to the device on a post-it, and it is vulnerable to social and hardware attack. Aim is to know what (if any) data is compromised in the event of a loss, and prevent future compromises. Solution is to force remote auditing on every file access (with encryption), by storing keys on the auditing server; this is done in the file system. File system metadata are stored on the trusted server. There are significant challenges in making this performant: caching/prefetching/preallocation are used to optimize key requests, but file creation is more challenging to optimize due to file systems semantics. Blocking filename registrations have correct semantics, but poor performance; vice versa for non-blocking registrations. To reconcile this, force a thief to use blocking semantics while allowing the user to use non-blocking semantics (as much as possible), which is based on using filenames as public keys. Second challenge is allowing disconnected access: the idea is to use multiple devices carried by the user to cross-audit file accesses, which still requires devices to hoard keys before going disconnected. - dgm36
Mobile devices get lost or stolen, and encryption systems are not sufficient to address this. Furthermore, failure of encryption is "silent", a drawback that is addressed in this work. The goal is to audit compromises and inform the user, preventing future compromises. All of this should work while there is no network connectivity. KeypadFS is an auditing encrypted file system. It requires users to retrieve file-specific keys from an server in order to decrypt and access any file. Even when a device is stolen, access is still granted, but an audit trail is left on the server.
File name are also held in the audit server; locally, file are identified using 'long random numbers'. They are encrypted under symmetric key L_F, and the key itself is stored in the file header as E_(R_F)(L_F), where R_F needs to be retrieved from the audit server.
The main issue with this is of course performance; they have implemented a bunch of optimizations (key caching, prefetching, preallocation), but spend 56% of the time on registering file names with the audit server (on creation and renaming, which are frequent operations). Doing the registration in a non-blocking fashion hides network latency, but has weaker semantics with regards to message loss and ordering of updates. So how can we get both strong semantics and good performance? The answer is to get the *thief* to reveal the correct file name -- although they only go down that path if a non-blocking registration failed previously. Identity-based encryption (IBE) is used as a means to enforce this. Engineered by encrypting the header under IBE using the file name as the public key. On create/ranme, L_F is cached in memory for 1s, and they fall back to using IBE if the non-blocking registration request fails.
The second challenge is ensuring that the system works even in disconnnected mode: they observe that users carry multiple devices, and use other co-located devices as an audit proxy. Hoarding policies ensure that even in a (not unlikely, e.g. in a mugging) double-loss case, the damage is restricted. A side-effect of this is that 3G latency can be mostly hidden. Evaluation using mostly anecdotal evidence and comparison to EncFS shows that the performance overhead from this is not enormous. - ms705
Database engines on multicores, Why parallelize when you can distribute?
This is about data and multi-cores -- and about a system that scales DBs to more cores. Turns out that when CPU-bound, DBMS (in this case,PostgreSQL) do not scale linearly even if you add more cores. Especially, in the example, a single low-throughput (in tps) workload brings down combined performance. Unsurprisingly, the degeneration is related to lock contention and various other interactions between parallel workloads. Usual approach to fixing this is to identify and remove bottlenecks until the issue goes away -- done in most commercial DBMS. This takes time, money and effort. An alternative is to throw everything out and remplace the DBMS with a completely different system. Problem: takes time, only a future solution, while they want a short-to-medium-term solution. The answer: single-master data-replication within machine. Replicate the data, and run multiple instances of the DBMS, one per partition (core). They have one master DB, which takes all the writes and has strong ACID properties, and a large number of satellites which are full or partial replicas and take read operations.
Multimed is middle-ware that sits in between the client and the distributed DBMS, transparently routing requests. They deployed this on a 48-core AMD machine ("Magny Cours"), placing master and middle-ware on one socket, and satellites on the others (isolating L3 caches, moving heavy queries to particular satellites and isolating them from other cores). Results show that they achieve almost linear scalability, albeit at a constant overhead. Performance can be increased further by removing durability guarantees from satellites (who only serve reads anyway), and using more satellites with partial instead of full replication. However, this only works well for mostly ready-heavy workload mixes (e.g. TCP-W) and main-memory resident datasets (!). In a 20% update workload (as opposed to 5% before), the naive version scales worse than default PostgreSQL, but after removing durability, the Multimed system outperforms PostgreSQL. - ms705
A system that improves database scalability on multicores. When databased performance becomes CPU-bound, you would expect linear scaling with more cores, but this only continues to about 10 cores, and tails off after that. The problem is unexpected, and due to interaction between workloads, which causes contention on shared data structures and interference in long scans. Databases are designed to be concurrent, but not parallel (yet). The approach is to take single-master data replication, and do it within a single machine. The satellite nodes can be optimized differently to support different workload mixes. Use topology-aware layout to run this on a 48-core machine (48-core AMD Magny Cours as an example). Achieves "linear" scalability of throughput as cores increase, with a slight gap between Multimed and the idealized PostgreSQL case; slight cost in latency due to the routing layer, but less contention on the satellites outweighs this in normal use. The target workloads are read-heavy, but include updates (5 to 20 percent), and are mostly stored in RAM. Naïve replication gets better throughput than PostgreSQL on TPC-W (5% updates); removing durability guarantees from only the satellites leads to better performance; partial replicas do even better. With 20% updates, you need to remove satellite durability to exceed PostgreSQL performance. - dgm36
DepSky: Dependable and Secure Storage in a Cloud-of-Clouds
If you have critical data to store, moving to the cloud is desirable, but one cloud is not enough. Replicating across multiple clouds has benefits, so the aim is to build an object store that runs across multiple clouds. The challenges are to implement efficient replication with passive storage nodes (i.e. existing cloud storage services), and then how to make it affordable. The system has to deal with Byzantine faults, so there will be four clouds. However, Byzantine replication does not solve the problems of confidentiality, and it multiplies your cost by 4. Confidentiality is provided using erasure codes and (f+1)-secret-sharing. The consistency provided is the same as the weakest consistency provided by any of the clouds. Whole thing can be implemented in 3000 lines of code, and provides a REST/HTTPS interface. Writes of 1MB files cost twice as much as reads, because reads only take metadata from multiple clouds. DepSky read latency is close to that of the cloud with the best latency (depends on client geolocation), but write latency is close to the worst latency. A useful optimization is to read the data from the cloud that returned the metadata fastest. - dgm36
Cloud storage is cool, but inappropriate for some use cases, e.g. where confidentiality is required. Two options to improve on this: (1) improve infrastructure, (2) use multiple cloud providers ("cloud-of-clouds"), and distribute/replicate across them: compensates outages, no vendor lock-in, better read performance, avoids data corruption and contains attacks.
As a principle, they do not trust any single cloud provider and assume that they cannot modify the server-side code. They implement the standard access operations on "data units", plus concurrency operations and garbage collection, and assume that individual clouds exhibit byzantine faults. They use 3f+1 clouds (where in practice, f=1). MRSW-style access is provided by default, and MRMW can be implemented using locking. Their data module stores two files inside a container for each data unit stored: a meta-data file and a data file. On write, only meta-data is overwritten (after successfully writing the data into a version of the data unit). Reads by default go to a single cloud as I understand it, but they can use additional clouds if the request takes too long. On read, meta-data is read first, then data.
However, thus far, the cloud providers still have access to the data and we need N*size space (where N=3f+1). So add erasure codes and secret sharing.
'Interesting' observation: this works with multiple consistencies -- namely, those provided by the underlying clouds. The consistency provided by DepSky is always the weakest out of those provided by the clouds.
In the evaluation, they find that data transfer cost depends on the cloud providers used, while the storage cost scales lineary, and erasure-code enabled DepSky costs about 2x(cost without DepSky). Their read latency is close to the latency provided by the cloud with the best latency ("read optimization"; presumably as they prefer this, or just make parallel read requests). Writing takes a long time and is close to the worst cloud (although they can skip the 3f+1th as they only need a quorum of 2f+1). Per-client throughput is orders of magnitude worse than LAN BFT storage systems: 64-1480 (!) kb/s for reading, and 3-108 kb/s for writing (!). They also did an availability analysis and found that some cloud providers do not appear to give the promised availability guarantees; sometimes they only get ~0.95 availability for individual clouds. DepSky improves on this by replicating across several providers. - ms705
Session 2: Not Your Textbook Solution
Feature Consistency in Compile-Time-Configurable System Software: Facing the Linux 10,000 Feature Problem
Large numbers of features are problematic from a compile-time configuration point of view as it increases complexity, which can be a source of bugs. Bugs come from differences between configuration declaration and implementation, mainly in #idef blocks and #define directrives. There is a lot of work on static analysis of systems source code -- but these are only looking at a single configuration instance as they are applied to the enabled code paths at compile time.
They differentiate between symbolic and logic inconsistencies. Example of a symbolic inconsistency is CPU_HOTPLUG vs. HOTPLUG_CPU, while an example of a logic inconsistency in code relating to NUMA and DISCONTIGMEMORY, where the former depends on the latter: in this piece of code, a #else is is unreachable. They specify a model to be used and then extract a logic expression from the source code. Using a SAT solver, they check if the extracted expression satisfies the model (which is extracted from configuration files?). Challenges: accurately indentify inconsistencies, make sure all code is covered (22 architectures), performance (run quickly on incremental builds). After identifying inconsistencies, they match them against a "won't fix" white list and finally generate a defect report if not matched.
Using this, they found 1776 configurability issues, submitted 123 patches for 364 defects. 20 of them were confirmed as new bugs, and they managed to point out 5129 lines of cruft code.
Future work: improve the configuration model extraction, support for #define, integrate into (currently) configuration-agnostic static analysis tools. - ms705
The complexity arising from the configurability of modern systems can be a large source of bugs, taking Linux KConfig as an example. The source of inconsistencies is the separation between configuration and implementation. Static analysis can't help here, because it only looks at the implementation. However, we can identify a correspondence between symbols and constraints in the configuration and the implementation, which helps to identify bugs due to non-existent CONFIG defines in the source code. There are also logical inconsistencies due to the dependencies between configurations (leading to redundant ifdef/else clauses). To resolve these, use a SAT solver to identify tautologies or inconsistencies. Challenges were to get high accuracy (got perfect accuracy) and coverage (derived a configuration model for all 22 architectures). The checker also needed to achieve sufficiently high performance to use it in incremental builds: a complete run on Linux took 15 minutes, and program slicing can make this shorter. The Linux kernel maintainers offered many reasons why these seeming inconsistencies should be left in the code, so the system includes a whitelist filter. Net results: 1776 issues, 123 patches, 364 defects, 20 new bugs and 5129 lines of cruft removed. - dgm36
A Case for Scaling Applications to Many-core Platforms with OS Clustering
Aim is to keep things looking like GNU/Linux while enjoying the "deliciousness" of multi-kernel operating systems. The idea is to cluster multiple operating system kernels on top of a VMM. The architecture supports a "super process" abstraction that runs across virtual machines, with threads running in different VMs. To support super processes, there needs to be a way to spawn processes or threads remotely on different VMs. Some syscalls can be handled natively (if they are stateless), and Cerberus implements as custom syscall hander for the rest. Super processes can share address space between VMs using the shared page table approach suggested in Corey. File system and network sharing is achieved by caching critical kernel state (e.g. a sentry) and dispatching most syscalls locally. Implemented on Xen with a 1800-LOC patch to the hypervisor. Evaluated on an 8x6 AMD Opteron, comparing Linux 2.6.18, XenoLinux Dom0 and Cerberus (this system). The main advantage comes from more efficient implementation of mutex-related syscalls (histogram), and less contention on the file system operations (drench). Limitations creep in when processes are short-lived, remote requests are common and memory-mapping sizes are small. - dgm36
This is about partitioning multi-core systems into batches of cores that are assigned to different OSes. The implement this as several OSes on top of a VMM, and have a thing called "super process" that can run across multiple OSes running the Cerberus daemon. Processes can migrate between different VMs, as can threads. They have some kernel and hypervisor-level support for passing VM boundaries (including a syscall interception layer). This is all built on top of Xen 3.3.
--- Unfortunately my laptop battery ran out at this point, so please look at Derek's excellent summaries for the remainder of the day --- (ms705)
Refuse to crash with Re-FUSE
It is hard to write file systems in the kernel, taking about 5 to 10 years skilled developer effort. This had led to the popularity of FUSE for implementing file systems in user-space. The problem is that, now anyone can develop a file system, and its reliability is no longer critical, they tend to crash a lot more. Re-FUSE is restartable FUSE, which provides a mechanism for filesystems to restart in a transparent (to applications) and stateful manner. Simple re-execution does not work because it can lead to an inconsistent file system. Undoing operations is not always possible. To enable re-execution, Re-FUSE performs dummy re-execution for operations that have already completed, then re-executes the remaining operations. Main components of Re-FUSE perform fault detection, fault anticipation and fault recovery. System call logging enables the system to take the correct behavior on replay: a table in the kernel records the parameters and a sequence number from the issuance of a request to the receipt of a response. Re-FUSE required no more than 10 lines of code added to existing FUSE FSs. Merely needed to daemonize local processes and notify some local state updates. Fault injection evaluation showed that consistency is maintained for a wide range of faults. - dgm36
Session 3: BFT Systems Made Practical
Increasing Performance in Byzantine Fault-Tolerant Systems with On-Demand Replica Consistency
Drawbacks of existing BFT include high resource usage and performance overheads. The RE-FIT project is looking at resource efficient BFT. To solve this, we can either optimize resource usage or optimize performance. The approach taken here optimizes performance. Given the split into agreement and execution stages, past papers have improved the agreement stage down to 1 to 2 milliseconds overhead, and the execution stage now dominates the execution time of non-trivial requests. The insight is that, in the absence of faults, only f+1 replies are needed to a request. The architecture adds a selector stage between agreement and execution, which selects whether or not to perform a request. Need requests to carry metadata about which objects they access, in order to make the protocol work. This can be like a file name or handle, which makes some categories of service easy to adapt. The protocol ensures that each object is maintained on f+1 replicas and unmaintained on others. If an object is unmaintained, it may be outdated. The selector ensures that requests only execute on machines that maintain the relevant objects, and other replicas merely log the request, which cuts the workload of each node; a checkpointing scheme allows garbage collection of the log. To handle requests on multiple objects, replicas maintaining at least one of the objects will execute the request, and will have to update the unmaintained objects (called a cross-border request). The aim now is to minimize cross-border requests by optimizing the object distribution: for NFS this might assign files and their parent directories to the same node. To handle faults, additional replicas must handle the request in order to generate enough replies. Evaluated using the Postmark benchmark with varying numbers of clients: regular BFT has >50s overhead over unreplicated for all numbers of clients, and increasing; ODRC does better than unreplicated beyond 20 clients; with optimized placement it does even better. With a replica fault, only some clients see degraded performance, due to the distribution of files. - dgm36
Efficient Middleware for Byzantine Fault-tolerant Database Replication
Unlike previous systems, a BFT database with no centralized component and good performance, by using techniques including (but not limited to) snapshot isolation. Byzantium uses PBFT underneath a modified JDBC driver t o achieve BFT. A database transaction becomes four BFT operations: BEGIN, READ, WRITE and COMMIT. However, this limits concurrency and leads to high BFT overhead. The insight is that just the BEGIN and COMMIT operations can be done using BFT, and the READs and WRITEs can execute concurrently. Since most databases use locks to avoid conflicts, Byzantium must avoid deadlocks in the presence of multiple transaction masters. Evaluated using TPC-C on the single and multiple master versions of Byzantium (versus a single server and PBFT). Read-only transactions can be optimized by running them on only f+1 replicas and ensuring that all responses are the same. - dgm36
ZZ and the Art of Practical BFT
In ZZ, when there are no faults, f+1 executing replicas can make progress. Fault detection is invoked when executing replicas disagree, which wakes up and spawns more executing replicas. ZZ assumes there will be multiple applications running in BFT virtual machines in your data center, and each server will run multiple BFT VMs. When replicas disagree, state is recovered on demand by replaying requests from the last checkpoint.