HotOS 2011 LiveBlog Day 3

11May/110

HotOS 2011 LiveBlog Day 3

What If You Could Actually Trust Your Kernel?

seL4 is our new toy: it's formally verified. It's guaranteed not to crash or suffer from security holes. (As long as you can trust the compiler.) Hypervisors are really very large and full of bugs. We could use this for isolating web browser processes as in IBOS (OSDI 2010). We can make TPMs useful (they currently aren't, because PCRs get updated for any app loaded; DRTM/TXT/SVM isn't much better). seL4 provides a trusted environment with a trusted loader that can be verified. It can also improve performance for things like TPC-C by providing a crash-proof virtual storage device that can be less aggressive in writing out to disk. - dgm36

What can you do with a verified kernel? Do a VMM right, do browser sandboxing right. More interesting: use a TPM for real. Existing remote attestation is problematic because it doesn't guard against the measured code being itself buggy, and it only works if you have a complete trusted chain. To use this for e.g. home banking you'd need to make the user boot a complete bootloader+VMM+kernel+app chain all of which measures their successor. Better: DRTM (?) which suspends the present system and does a parallel boot of an untainted kernel which can be measured -- but, suspends the original install which will do interesting things to hardware. Much better: run seL4 as a VMM and construct a trusted, measured chain through a mini OS and banking app. Another idea: can discard the sync-writes in a DBMS that are intended to ensure persistence past an OS crash. Of course this also buys you vulnerability to power failure! (but they assert you can use the capacitors in your PSU as a tiny UPS!) (cs)

Provable Security: How feasible is it?

Let's prove useful properties of real systems! Example of such a property: prove that software provides confidentiality. Verifications should be done against real C. Prove that the C is an implementation of a formal model. Aiming to prove properties of >MLOC systems with difficult code intended for performance. They've proved that seL4 can provide integrity guarantees. They anticipate building systems using chunks the size of seL4 in which each is proven functionally correct, making whole-system proofs easier. Problem: they want to prove confidentiality properties, but they can't make guarantees about timing channels without hardware that deliberately works to eliminate them. (cs)

It's very feasible, but it might not be easy. Real proofs are machine-checked and tell you something that you didn't already know. If you're looking for buffer overflows, you're not trying hard enough. Proofs should cover high-level security goals like integrity and confidentiality. It has to apply to real code (C or assembler), which has been written to be run, not to be proved. You need to structure your large system so that you don't need to prove the correctness of every one of millions of lines of code. 10kloc is a feasible size of component to verify on the present horizon. Proving things about timing effects is still too hard (we'd need a formal model of the hardware for that). - dgm36

Toward Practical and Unconditional Verification of Remote Computations

Want to verify that the server has executed a function correctly given the result, and without re-executing the code. This might be useful for cloud and volunteer computing. PCPs are probabilistically checkable proofs. These are very expensive to perform (hundreds of thousands of years), and the proposal is to make them more efficient. The impracticality comes from using Boolean circuits to represent computations, which makes the proof huge. The idea is to do PCPs interactively, whereby the client sends the checking locations to the server, and only generates those parts of the proof; the server uses cryptographic commitment so that it cannot change the function. Also, use arithmetic circuits instead of Boolean circuits. The refinements for polynomial evaluation went from 130000 years to 11.5 hours. - dgm36

Want to check that an untrusted remote party correctly calculated some function of its input without just plain old recomputing. There exists somthing called Probabilistically Checkable Proofs which allege to do this, but the cost of checking is huge! Basically they work by turning the desired function into a boolean circuit, then checking that a few randomly picked input values map to what the client gave us. More test values --> less probability of error. They use "arithmetic circuits" (dataflow graphs, I guess?) instead of boolean ones to save time computing results. Apparently you can also make major gains by having your remote worker do a bunch of parallel work and batch-verifying the results. The time savings are about 10 orders of magnitude. (cs)

MOMMIE Knows Best: Systematic Optimizations for Verifiable Distributed Algorithms

It sucks that often you have to mangle your ideal algorithm in the name of efficiency. Similarly layering can imply suckitude -- e.g. you might happen to be running a protocol with integrity properties on top of another one; could elide one of those checks. If only we had reasonable devices for composition! They define MOMMIE, which is a way to specify your desired algorithm which is compiled down to TLA+ (a logic for verification) or C++ (for execution). It's a sort of imperative specification... ish. It includes useful primitives for gathering messages etc. (cs)

Abstractions are good. In theory, P2 was a great abstraction; in practice, we wrote the equivalent of assembly code in a relational database. BFT is a great abstraction, but you end up having to cram a lot of implementation details into your proofs because you want it to be efficient. People need a way to do composition, so that proofs carry over into the implementation. MOMMIE is a high-level language for writing your system; it can be turned into a TLA+ specification and a C++ program, and composed with a bunch of implementation modules. A MOMMIE statement has an issuer, a verifier and an auditor, followed by immutable content in c-structs. Program looks like event-condition-actions. Actions are imperative code with loops and variables, etc. - dgm36

The Case for VOS: The Vector Operating System

A web server handling two simultaneous requests will make the same system calls multiple times. The similar system calls should be made only once, with vector arguments, and the vectorisation can be pushed down into the kernel implementation. This will cut down on syscall overhead, by batching the calls and using SIMD parallelism in execution. Memory protection with mprotect: some NVRAM work required doing this very often, getting up to 375k page protections per second. A vectorised mprotect was much better by amortizing context switches, optimizing data structures used and eliminating TLB flushes: 3x better. A challenge is handling convergence and divergence (where different elements require different work): we need some kind of fork-join parallelism or message passing. There is also a tradeoff in deciding when to join or to do replicated work. It will be hard to get the full benefits of this in Linux. The OS should be a staged event system instead, with message passing as an underlying primitive. - dgm36

Parallel webservers do lots of similar syscall sequences (accept, open, read, sendfile)... how about making an accept_vector thing? Could vectorise the rest of the thing too, like doing userland work to compress for Transfer-Encoding, or encrypt for HTTPS. Could ameliorate syscall costs, and could open opportunities to use SSE etc. Particular useful: vector-lookup-directory would make one walk not many. Another example: vector-mprotect using an async shared queue for U->K data movement. Things pushed into the queue are sorted so that again we make a single walk through the VMA-list. They think that lots of the gains were made by avoiding a TLB flush per syscall. Better still: pass flowgraphs down into the kernel which are used to route incoming packets or other hardware events. Kind of like sys_splice, or Vino-style SFI-in-kernel. (cs)

Operating Systems Must Support GPU Abstractions

GPUs are currently tricky to use. The OS conceives of them as an I/O device; we need a better abstraction to make them usable. There are user-mode libraries providing interesting interfaces, but the OS doesn't expose much of an interface (ioctl is all we get). We need a better interface because the kernel isn't providing fairness or isolation, and there's no interface usable by the kernel (without going via a userland surrogate). The schedulers for the CPU and GPU aren't integrated. Similarly there's no concept of a "background" GPU task, e.g. F@H vs. the user interface. Another problem: we can't easily pipeline GPU applications, because there's no OS-level "pipe" that's implemented by passing refs to GPU memory. Their proposed design does the logical thing of generalising pipes and also providing some support for event routing to facilitate a path from say a USB device straight to the graphics card without an indirection via system memory. (cs)

We need support for dataflow in the OS. The problem is that GPUs are underutilized, because they are difficult to program (SIMD execution model, no access to main memory and treated as an I/O device). Programmer-visible interface is things like processes, pipes and files, which map directly onto OS abstractions. For GPU programming, all you get is an ioctl, and this isn't enough for things like fairness, isolation, etc. There is no kernel-facing interface which inhibits the OS from using the GPU and managing it. It also leads to suboptimal data movement. A high CPU load will inhibit GPU performance because the CPU scheduler is not integrated with the GPU scheduler. This might be a problem when the GPU is used to do things like rendering the mouse pointer. Why can't we compose a graphics processing pipeline with pipes? Because of user/kernel mode crossings and the amount of data copying. The ideal programming interface is dataflow between tasks. - dgm36

Multicore OSes: Looking Forward from 1991, er, 2011

We have been saying that parallel programming is hard since 1991. High-level lessons are things like shared-memory with locks doesn't scale; shared-nothing hardware will prevail; and programming will involve message passing. The proposed model is lightweight messages and channels, like Erlang, CSP, etc. Message sending will be like procedure calling (or at least as efficient) and there will be a non-deterministic choice operator. In a channel-based OS, system calls become messages, like in a microkernel. There are some foreseeable issues including the implementation of choice, waiting for channels, and implementing protection. There is an assumption that there is hardware-level messaging. Microkernels use "middleweight messages" whereas this will use "lightweight messages". - dgm36

Hardware isn't getting faster; parallelism is the only way forwards. Hardware is heading towards shared-nothing; programming is likely to need explicit message passing. They advocate erlang-style programming with communicating lightweight threads, which can be assigned to OS threads as we desire. We need a complete system built along these lines, including the kernel (so, Barrelfish). Questions: what's the right model for virtual memory (shared-something, like Corey?); how should we deal with partial failure (i.e. one bit of your multikernel explodes). (cs)