Chris Smowton « syslog

11May/110

HotOS 2011 LiveBlog Day 3

Posted by Chris Smowton

What If You Could Actually Trust Your Kernel?

seL4 is our new toy: it's formally verified. It's guaranteed not to crash or suffer from security holes. (As long as you can trust the compiler.) Hypervisors are really very large and full of bugs. We could use this for isolating web browser processes as in IBOS (OSDI 2010). We can make TPMs useful (they currently aren't, because PCRs get updated for any app loaded; DRTM/TXT/SVM isn't much better). seL4 provides a trusted environment with a trusted loader that can be verified. It can also improve performance for things like TPC-C by providing a crash-proof virtual storage device that can be less aggressive in writing out to disk. - dgm36

What can you do with a verified kernel? Do a VMM right, do browser sandboxing right. More interesting: use a TPM for real. Existing remote attestation is problematic because it doesn't guard against the measured code being itself buggy, and it only works if you have a complete trusted chain. To use this for e.g. home banking you'd need to make the user boot a complete bootloader+VMM+kernel+app chain all of which measures their successor. Better: DRTM (?) which suspends the present system and does a parallel boot of an untainted kernel which can be measured -- but, suspends the original install which will do interesting things to hardware. Much better: run seL4 as a VMM and construct a trusted, measured chain through a mini OS and banking app. Another idea: can discard the sync-writes in a DBMS that are intended to ensure persistence past an OS crash. Of course this also buys you vulnerability to power failure! (but they assert you can use the capacitors in your PSU as a tiny UPS!) (cs)

Provable Security: How feasible is it?

Let's prove useful properties of real systems! Example of such a property: prove that software provides confidentiality. Verifications should be done against real C. Prove that the C is an implementation of a formal model. Aiming to prove properties of >MLOC systems with difficult code intended for performance. They've proved that seL4 can provide integrity guarantees. They anticipate building systems using chunks the size of seL4 in which each is proven functionally correct, making whole-system proofs easier. Problem: they want to prove confidentiality properties, but they can't make guarantees about timing channels without hardware that deliberately works to eliminate them. (cs)

It's very feasible, but it might not be easy. Real proofs are machine-checked and tell you something that you didn't already know. If you're looking for buffer overflows, you're not trying hard enough. Proofs should cover high-level security goals like integrity and confidentiality. It has to apply to real code (C or assembler), which has been written to be run, not to be proved. You need to structure your large system so that you don't need to prove the correctness of every one of millions of lines of code. 10kloc is a feasible size of component to verify on the present horizon. Proving things about timing effects is still too hard (we'd need a formal model of the hardware for that). - dgm36

Toward Practical and Unconditional Verification of Remote Computations

Want to verify that the server has executed a function correctly given the result, and without re-executing the code. This might be useful for cloud and volunteer computing. PCPs are probabilistically checkable proofs. These are very expensive to perform (hundreds of thousands of years), and the proposal is to make them more efficient. The impracticality comes from using Boolean circuits to represent computations, which makes the proof huge. The idea is to do PCPs interactively, whereby the client sends the checking locations to the server, and only generates those parts of the proof; the server uses cryptographic commitment so that it cannot change the function. Also, use arithmetic circuits instead of Boolean circuits. The refinements for polynomial evaluation went from 130000 years to 11.5 hours. - dgm36

Want to check that an untrusted remote party correctly calculated some function of its input without just plain old recomputing. There exists somthing called Probabilistically Checkable Proofs which allege to do this, but the cost of checking is huge! Basically they work by turning the desired function into a boolean circuit, then checking that a few randomly picked input values map to what the client gave us. More test values --> less probability of error. They use "arithmetic circuits" (dataflow graphs, I guess?) instead of boolean ones to save time computing results. Apparently you can also make major gains by having your remote worker do a bunch of parallel work and batch-verifying the results. The time savings are about 10 orders of magnitude. (cs)

MOMMIE Knows Best: Systematic Optimizations for Verifiable Distributed Algorithms

It sucks that often you have to mangle your ideal algorithm in the name of efficiency. Similarly layering can imply suckitude -- e.g. you might happen to be running a protocol with integrity properties on top of another one; could elide one of those checks. If only we had reasonable devices for composition! They define MOMMIE, which is a way to specify your desired algorithm which is compiled down to TLA+ (a logic for verification) or C++ (for execution). It's a sort of imperative specification... ish. It includes useful primitives for gathering messages etc. (cs)

Abstractions are good. In theory, P2 was a great abstraction; in practice, we wrote the equivalent of assembly code in a relational database. BFT is a great abstraction, but you end up having to cram a lot of implementation details into your proofs because you want it to be efficient. People need a way to do composition, so that proofs carry over into the implementation. MOMMIE is a high-level language for writing your system; it can be turned into a TLA+ specification and a C++ program, and composed with a bunch of implementation modules. A MOMMIE statement has an issuer, a verifier and an auditor, followed by immutable content in c-structs. Program looks like event-condition-actions. Actions are imperative code with loops and variables, etc. - dgm36

The Case for VOS: The Vector Operating System

A web server handling two simultaneous requests will make the same system calls multiple times. The similar system calls should be made only once, with vector arguments, and the vectorisation can be pushed down into the kernel implementation. This will cut down on syscall overhead, by batching the calls and using SIMD parallelism in execution. Memory protection with mprotect: some NVRAM work required doing this very often, getting up to 375k page protections per second. A vectorised mprotect was much better by amortizing context switches, optimizing data structures used and eliminating TLB flushes: 3x better. A challenge is handling convergence and divergence (where different elements require different work): we need some kind of fork-join parallelism or message passing. There is also a tradeoff in deciding when to join or to do replicated work. It will be hard to get the full benefits of this in Linux. The OS should be a staged event system instead, with message passing as an underlying primitive. - dgm36

Parallel webservers do lots of similar syscall sequences (accept, open, read, sendfile)... how about making an accept_vector thing? Could vectorise the rest of the thing too, like doing userland work to compress for Transfer-Encoding, or encrypt for HTTPS. Could ameliorate syscall costs, and could open opportunities to use SSE etc. Particular useful: vector-lookup-directory would make one walk not many. Another example: vector-mprotect using an async shared queue for U->K data movement. Things pushed into the queue are sorted so that again we make a single walk through the VMA-list. They think that lots of the gains were made by avoiding a TLB flush per syscall. Better still: pass flowgraphs down into the kernel which are used to route incoming packets or other hardware events. Kind of like sys_splice, or Vino-style SFI-in-kernel. (cs)

Operating Systems Must Support GPU Abstractions

GPUs are currently tricky to use. The OS conceives of them as an I/O device; we need a better abstraction to make them usable. There are user-mode libraries providing interesting interfaces, but the OS doesn't expose much of an interface (ioctl is all we get). We need a better interface because the kernel isn't providing fairness or isolation, and there's no interface usable by the kernel (without going via a userland surrogate). The schedulers for the CPU and GPU aren't integrated. Similarly there's no concept of a "background" GPU task, e.g. F@H vs. the user interface. Another problem: we can't easily pipeline GPU applications, because there's no OS-level "pipe" that's implemented by passing refs to GPU memory. Their proposed design does the logical thing of generalising pipes and also providing some support for event routing to facilitate a path from say a USB device straight to the graphics card without an indirection via system memory. (cs)

We need support for dataflow in the OS. The problem is that GPUs are underutilized, because they are difficult to program (SIMD execution model, no access to main memory and treated as an I/O device). Programmer-visible interface is things like processes, pipes and files, which map directly onto OS abstractions. For GPU programming, all you get is an ioctl, and this isn't enough for things like fairness, isolation, etc. There is no kernel-facing interface which inhibits the OS from using the GPU and managing it. It also leads to suboptimal data movement. A high CPU load will inhibit GPU performance because the CPU scheduler is not integrated with the GPU scheduler. This might be a problem when the GPU is used to do things like rendering the mouse pointer. Why can't we compose a graphics processing pipeline with pipes? Because of user/kernel mode crossings and the amount of data copying. The ideal programming interface is dataflow between tasks. - dgm36

Multicore OSes: Looking Forward from 1991, er, 2011

We have been saying that parallel programming is hard since 1991. High-level lessons are things like shared-memory with locks doesn't scale; shared-nothing hardware will prevail; and programming will involve message passing. The proposed model is lightweight messages and channels, like Erlang, CSP, etc. Message sending will be like procedure calling (or at least as efficient) and there will be a non-deterministic choice operator. In a channel-based OS, system calls become messages, like in a microkernel. There are some foreseeable issues including the implementation of choice, waiting for channels, and implementing protection. There is an assumption that there is hardware-level messaging. Microkernels use "middleweight messages" whereas this will use "lightweight messages". - dgm36

Hardware isn't getting faster; parallelism is the only way forwards. Hardware is heading towards shared-nothing; programming is likely to need explicit message passing. They advocate erlang-style programming with communicating lightweight threads, which can be assigned to OS threads as we desire. We need a complete system built along these lines, including the kernel (so, Barrelfish). Questions: what's the right model for virtual memory (shared-something, like Corey?); how should we deal with partial failure (i.e. one bit of your multikernel explodes). (cs)

Filed under: Uncategorized No Comments

10May/110

HotOS 2011 LiveBlog Day 2

Posted by Chris Smowton

Disk-Locality in Datacenter Computing Considered Irrelevant

Hadoop et al try to put tasks where the disk containing their source is. But network bandwidth > disk bandwidth! And net bandwidth is alleged to be a plentiful resource! And disks aren't getting faster at anything like the rate of the network! (SSDs?) Memory locality is still worthwhile, though, but even modern datacentres can't just keep everything in memory (Facebook example: mem_bytes = data_bytes/200. Sadly this isn't much use: datacentre jobs typically have very large working sets, so it's all-or-nothing when it comes to in-memory caching. 96% of FB jobs can fit entirely in memory. Cache replacement algorithms need to be adapted to fit this sort of use case.Â (cs)

The key idea in datacenter computing is co-locating the computation with their input, and much work has gone into this. Looking at the Facebook production cluster, 85% of jobs see tasks reading from the network running just as fast as disk-local tasks; also, disk-local reads are not faster than rack-local reads. Networks are getting faster and over-subscription is reducing; however, disk speeds are plateauing. Locality matters a lot, because memory is two order of magnitude faster than disk. Problem for caching is that the working set is close to the size of the whole input, and you need to cache everything because a single from-disk job will be much slower. Apparently, 96% of jobs at Facebook can fit in memory. - dgm36

Optimising Data Partitioning for Data-Parallel Computing

Data partitioning controls the degree of parallelism. We have many possible choices of partition function, and degrees of partitioning which can have a big effect on performance and cost. There are problems of data and computation skew. Also, a balanced workload does not necessarily give optimal performance because we can get more cross-node traffic with more partitions. The problem is harder than for databases because we need to understand user-defined code, and data with less structure than simple tables. Analyzing user-defined code may be tricky if the amount of processing is data-dependent. Perform cost modeling and optimization to get a low estimated cost. Two approaches: white box (analytical formulation) and black box (based on down-sampling). - dgm36

Suppose we've got a boatload of data to process: how many chunks should we split it into to achieve data parallelism, and how should we perform the split? We want to avoid skew in data size, and also in compute costs applied to that data (i.e., one record is not necessarily like another). Perfect balance might not be the best idea either -- we have to consider the volume of cross-node traffic that will result. Challenge: analyse the user's code to figure out how long it will take to process a given record in advance of execution. Then apply a do global cost analysis using the predicted runtimes for each partition and the supposed properties of the network, etc. (cs)

Disks Are Like Snowflakes: No Two Are Alike

Historically, disks were fairly precisely spec'd and performed the same as others of their type notwithstanding partial failures. Now they're much more variable because disks are built to use "adaptive zoning". Disks have always used zoning: concentric rings with differing numbers of sectors per track. This means that ordinary disks show bulk throughput rate that varies in the manner of a staircase with logical block number, as we move between zones. More modern drivers vary much more unpredictably with logical offset because they've been adapatively zoned -- presumably fitting as many sectors as possible without exceeding some intolerable error rate. Apparently the variance stems mostly from the head -- how rapidly it responds to changing magnetic properties of the underlying disk surface. Different platters have different heads, so you can see differing rotation rates / zoning density per platter, as well. Therefore: we should expose that sort of knowledge to RAID drivers and the like, as the disks are not as alike as you might suppose, either within the disk or between them. (cs)

In the past, buying a batch of disks, you could measure one and assume that the rest would perform the same. Now, disks are unique by design. In 2002-era disks, the performance would be the same. Disk zoning techniques provide bandwidth proportional to the number of bits along the track, so store more data on the outer tracks. In 2006-era disks, there is much more variability, and this trend has increased in 2008-era disks. Adaptive zoning figures out the density capability adaptively based on how good the disk head is. This means there is no longer such a thing as a homogeneous set of disks, which has implications for striping, equal-work partitioning and other low-level disk techniques (about placement, etc.). - dgm36

The Case for Power-Agile Computing

Mobile devices have a variety of applications with differing power needs. We want to have the capability of all of a desktop/laptop/supercomputer/mote all in a mobile device. The idea is to have an ensemble of different storage and processing elements. There are challenges in measuring efficiency, predicting different ensemble's performance, selecting the appropriate ensemble, and transitioning between ensembles. - dgm36

Enter the genie. Actually. Man dressed as a genie. Pretty good sketch comedy for two ensues, the point of which is that it'd be neat if your mobile phone could scale between the characteristics of tiny mote-like devices all the way up to a desktop. Proposed solution: slam a bunch of diverse components into the same box, and power up the appropriate one at the right time. Like sellotaping my good ol' fashioned Nokia 6310 (for most of the time) to an iPhone (for the odd occasion I really need it). (cs)

Mobile Apps: It's Time to Move Up to CondOS

Phones have a bunch of sensors. Can figure out the user's current activity, location, "are they interruptible" (?). Can't do it with an app, because mobile apps are typically not allowed to run all the time. So let's do it in the OS. Or, just permit a background app, since that's functionally the same. Supply this sort of stuff to subscribers. Could use this sort of info to preload apps when we think they'll need them. Or adjust ringer volume based on background noise. Or customise the way the WLAN card operates depending on whether you're moving (done at NSDI). Apparently you could selectively apply password protection if you think they're away from home (!) Or, predict time to recharge and do power conservation based on that. (cs)

Devices are very programmable these days, and they have a bunch of different sensors on board. The question is how new apps might use new sensors. Sensors can provide contextual information that applications might use. Since there are some risks to this, we can't let applications have unfettered access, and cloud offload might be expensive. ConDOS will do it in the OS. Sensors export "context data units" rather than raw sensor data, and these can be composed into a graph (like e.g. Click modular router). If context is OS-managed, the system services can use context (e.g. for memory management to make applications more responsive, security, energy management, etc.). This abstraction can also aid sensor privacy (by adding policy decisions to how sensor data is exposed). - dgm36

Free Lunch: Exploiting Renewable Energy for Computing

Part of a grander scheme from Andy Hopper about Computing for the Future of the Planet. Computing uses a lot of energy, but some companies have responded by installing renewable energy technology. Wind and solar renewable energy at a given location is intermittent. The idea is to migrate workloads and data instead of energy. Data centers must be connected with high speed links. Challenges include predicting VM migration time and storage synchronization (solved in MASCOTS papers), and scheduling and placement. Migration has a 0.5s downtime overhead due to memory and disk over a 10 Gbps link. With 615 migrations a year, the downtime is 415 seconds. This fits within a 99.95% SLA. The energy consumption of migrating Â one VM is 57.5 kilojoules, assuming 7.5GB of RAM and 20GB of modified disk. - dgm36

Renewable energy == good. Renewable availability varies with time (sun, wind). So migrate your VMs to where the power is. Challenges: predict migration time, virtualise net+storage well enough that you can do this without causing service outages. They estimate that downtime will be okay if you've got a 10Gbps link between the data centres. (cs)

Debug Determinism: The Sweet Spot for Replay-Based Debugging

Logging for replay debugging is expensive, especially if it's a data race. Existing systems use "relaxed determinism": produce a replay that's not quite the same but good enough. The best so far is "ESD", which guarantees only that the same failure will occur. Can be too relaxed, though -- they might produce the same output by back-inferring reasonable inputs, rather than unreasonable inputs plus a data race. It'll only really work if the output observed can *only* occur due to a data race. We need higher fidelity than that. They propose a system which has high debugging fidelity (indicating the right root cause for a problem) and low overhead. They propose to do this by recording with high fidelity the root cause and the failure, but we don't know what's going to be a root cause in advance. Need a heuristic for where the root cause is likely to lurk. (cs)

Non-Deterministic Parallelism Considered Useful

Distributed execution engines benefit from nondeterministic parallelism! They have some useful properties, like taking care of synchronisation and scheduling. Most distributed execution engines provide guaranteed termination; Ciel sacrificed this in the name of flexibility and expressiveness. Should we sacrifice our support for fault tolerance along similar lines? Yes! Because fault tolerance as we know it (Jim) requires determinisitic tasks, but the real world tends to perturb your programs and introduce nondeterminism (e.g. network delays, competing processes)... Requiring us to be deterministic in the face of such things requires us to sacrifice neat things like wait-for-multiple-handles, async interprocess signals... But what would we do when failures occur? We could fail the whole job, or at the other end we could record all nondeterministic events for replay; there are also a few intermediate points on this spectrum. Non-determinism is useful for other things too: handling interactivity, adapting the job to cluster conditions... (cs)

Finding Concurrency Errors in Sequential Codeâ€”OS-level, In-vivo Model Checking of Process Races

Process race: when multiple processes race for shared system resources like the filesystem. Neat example: ps aux | grep XYZ | wc might return 0 or 1 depending on whether grep is already running when ps runs, and if ps is started first, who wins out of the shell spawning grep and ps getting through /proc's directory listing. They did a survey for bug reports and found process races to be at least as common as thread races, and to often have serious consequence. Challenge: detect process races. Syscall interactions are complex... Their solution: do in-kernel record/replay of process/syscall events, and do offline analysis of the log, looking for racey interactions. They try flipping the order of apparently racey interactions and test the program (failure detection?) They have a prototype that's found some real races. (cs)

Privacy Revelations for Web and Mobile Apps

What OS support is needed to track personal information exposure? The research community has wasted time on coming up with clever (e.g. side-channel) attacks. The real problem is things that people in fact authorize without knowing it. The OS should give us transparency about what information about you is being sent over the network when we use mobile and web apps. The paper talks more about the idea of "privacy revelations" which are OS-supported, high-level descriptions of application exposure. They need to be relevant to the user, and actionable by the user, which current tracing frameworks don't give you. - dgm36

It's the TaintDroid people! Question: what OS support do you need to be able to control information release? Assertion: OS community has done privacy control badly. It's been identifying clever attacks rather than helping to plug the simple vulnerabilities, or help the users understand their privacy. The OS should make people aware when something they value is becoming known; then the it's the user's job to decide whether they're happy or not. (cs)

Do You Know Where Your Data Are? Secure Data Capsules for Deployable Data Protection

We need to share data in ways which we might not always be aware of, e.g. medical records. How would we manage propagation of sensitive health data without tying up your doctors? Need to support existing apps as there's a huge glut of it out there in the medical community, and perform well enough / be reliable enough that we never leave them without the information they need. Solution: wrap your app in a taint tracker to detect when stuff escapes, and wrap network-transmitted data with unforgable provenance information. This is essentially a way to implement the goals of the NHS National Program for IT (a disaster, incidentally) with the OS providing enforcement that I suspect the NPfIT does by trusting its applications. They're working on formally proving the intermediate components -- the network communication wrapper and the application sandboxing kit. (cs)

Critical data (e.g. healthcare records) often need to be shared between different stakeholders, often out of your custody. Today we have a choice between allowing anyone to do anything undetected (with your records) or die. Data Use Controls are the owner-set policy. We need to support current OSs and applications without limiting choice. Getting this to work with legacy is the killer app. Take the OS out of the TCB with Overshadow. Take the application out of the TCB with system call interposition and taint tracking. Need to store provenance and policy with the data. The DUC engine is a module in the OS that releases access to data based on policy. Validating and verifying formally the TCB is one of the other things being looked at. - dgm36

Making Programs Forget: Enforcing Lifetime for Sensitive Data

Lingering data is a key obstacle to privacy: it might be swapped out to disk, or lying around in a cache. Closing the application doesn't work; neither does rebooting. Can we rely on the application? Probably not. The aim of this talk is to come up with a system that can guarantee that all sensitive data and data derived from it will disappear after a given time limit. No application support is expected. Observe that there are equivalent states of the program that don't depend on sensitive data. The approach is "state reincarnation". Equivalent states are derived by deterministic replay with perturbed input. Picking the replacements for sensitive data is tricky---it can't just be arbitrary garbage. So is knowing that you've got everything completely scrubbed. Could use random garbage for replacing a sensitive email, but there might be some constraints on what we can use. Implemented recording at user-level and it gave a 1.2x slowdown on bash. - dgm36

Data you might want erased often gets copied all about the place. They want to allow the user to put a constraint on when your data will be destroyed, without requiring application support. They have a prototype: It tries to replace "sensitive state" with equivalent non-sensitive state. They use deterministic replay gubbins to replay your session without the sensitive input (presumably network traffic in most cases). Challenges: how do you track all the sensitive stuff? How do you keep the costs of using replay down? Problems: "equivalent state" often more complicated than removing the sensitive packets, e.g. might effect later sequence numbers. (cs)

More Intervention Now!

We want to predict what will happen if we intervene in a system to change some of its hardware. Observation only shows correlation; for causation we need intervention. We need some math, some theory and some engineering. We want to see if we can discover some causality from observational data if we don't have a full functional model. A Bayesian network is the formalism we need. The dual calculus gives us a set of rewriting rules we can use. A Dryad cluster with a few hundred machines gives us enough observational data for statistical significance. - dgm36

make world

Programs are rubbish, and you should improve them by using partial evaluation: meaning programs like Eclipse, Firefox and OpenOffice, which repeat a lot of work that they've done before (either in the same session or in a previous session). Example is a spelling checker, which is based on the global and a local dictionary, converts it into a machine-readable form and then uses it for checking. It wastes time by doing the conversion every time I use it. We could amortize this we converting it only once each session, but it might be hard to identify what a "session" is. Or we could rebuild the dictionary using a process like `make` each time. What if we could do this automatically? Partial evaluation does this for us. We can somehow treat the dictionary as a quoted constant, then do constant propagation through the rest of the program. Whole program optimization provides us with the mechanism for doing this. A prototype exists, and it can eliminate a program that does word counting on a file. Loop peeling optimizations evaluate the code on the program, and completely unwind the loop. There are challenges in adaptive optimization (with respect to the program's environment), using IPC between multiple processes (read from pipe or socket), and making it efficient. - dgm36

Filed under: Uncategorized No Comments

9May/110

HotOS 2011 LiveBlog Day 1

Posted by Chris Smowton

Mind the Gap: Reconnecting Architecture and OS Research

Arch people don't think about the OS. OS people overemphasise running on commodity hardware. OS people should ask for features they need; arch people don't have much in the way of useful benchmarks to indicate whether their hardware is useful for OS development. They have simulators, but they're not geared for such huge whole-system sims. Benchmarks like TPC-W are ideal, but they're hard to get working because you'll need drivers etc that work with your new hardware. He sets some goals for arch designers: support multiplexing, don't couple features unless you must, don't introduce arbitrary limits, avoid mandatory abstractions e.g. HyperThreading (heh, Exokernel for hardware). List of wants: lightweight inter-core messaging, fast SYSENTER, software-controlled cache, better performance counters. (cs)

Operating System Implications of Fast, Cheap, Non-Volatile Memory

Bulk NVRAM is on the way. It's disruptive, because it's usable as main memory and also as secondary storage. Could integrate the stuff by replacing the disk (SSD+++), or by integrating it into a big shared address space. They're actually looking at a more radical approach: replace all memory with NV. Introduces a bunch of interesting stuff, e.g. processes are likely to be long-term persistent. Makes it harder to achieve a "reboot", much easier to hibernate. Lack of a clean boot also has security problems. Another question: how do you do virtual memory in such a system?Â Is your "disk" space in VAS? Is "disk" mediated by the MMU? Merger between SHM and shared FS seems likely. They suggest the "reboot" thing should be a rollback to a checkpoint of which their might be many. On the one hand there's an increased chance of finding a "bug" in the rollback algo; on the other hand current-day "reboot"s don't always reset everything we need -- e.g. some file is corrupted in such a way that it trashes the process next time it starts. (cs)

Virtually Cool Ternary Content Addressable Memory

TCAMs are useful for all manner of search operations because they provide an O(1) matching operation. Used regularly in routers; could use them for implementing part of a SQL query for example. Sucks up power, though. Goal: make something that provides fast search without the power devoural. Idea: provide a virtual match-o-matic implemented using some real TCAMs and some backing store implemented as ordinary memory ("location-addressable memory", by contrast with CAM). Can implement an appropriate data structure in the LAM, e.g. an ordinary hash table. Then to find something you'd hash to find the right "page" of memory, load it into CAM, then do the O(1) last step. Or, could use another CAM to do the find-next-level-page operation. So you'd do a CAM operation to find the pages you're after, then page each into second-level CAM and do the search for each. I think. This is pretty complicated! (cs)

The Best of Both Worlds with On-Demand Virtualisation

Virtualisation isn't ubiquitous. They argue this is because it imposes overhead at all times, and can cripple your h/w e.g. graphics, or anything else not using an easy-to-forward device interface. So: let's virtualise on demand -- pay the costs when you use it. Use cases: in the data centre, virtualise when utilisation is low. In desktop environment, use it when you need to kernel-debug, or do migration, or... Main challenge: how to migrate to/from VMM. Propose to use hibernate/suspend since they record the complete machine state in memory. Device switching to be implemented by running a stub "virtual disk" --> real disk driver most of the time, which is capable of transparent hotplugging. Device migration is made easy by the hibernate operation. Obviously some devices will mysteriously go missing during the hibernate, e.g. the GPU; this is probably acceptable as the OS needs to anticipate this during ordinary hibernation. (cs)

Resource consolidation, checkpointing and debugging are advantages, but it is not being used everywhere (e.g. Google/Facebook/your laptop). This is because the features are only being used occasionally, but the overhead of virtualisation is all the time. There is a lack of functionality (e.g. GPUs, WLAN, etc. instead relying on the host OS) and performance for I/O and virtual memory operations. On-demand virtualisation gives normally native execution, and overhead paid only when features are being used. This has applications in the datacenter and desktop environments. (Also security - introspection!) Idea: pause the OS, transfer its state into a VM and resume its execution. Several challenges: how do we transfer it into a VM? How does it continue to run unmodified? Hibernate/resume provides exactly the facility we need by snapshotting the state to disk. But we need to resume with the same hardware profile, and the VMM will not provide everything. Device hotplug supports this: detach devices before hibernate, and scan for virtual devices after resume. Another problem: maintaining application connections to various devices, so need to transfer device bindings across different models and types of devices -- the VMM tends to lag behind physical hardware in its capabilities. Solve this with another level of indirection: use logical devices to retain states and hide changes (e.g. bonding driver for network, device mapper for block devices). Costs arise from setting aside a VMM partition on disk (though no cost of preallocation in CPU or memory), hardware access through logical devices (for devices whose state must be preserved). Prototype implemented for Linux 2.6.35 on KVM (with TuxOnIce hibernation patch). Supports a one-way conversion for physical to virtual machine. SCP/SSH remains open. Takes about 90 seconds to do the transition. Future work: hibernate to RAM, exec and devirtualisation. - dgm36

Repair from a Chair: Computer Repair as an Untrusted Cloud Service

Computer repair today resembles TV repair: you bring your computer to a retail service which is inconvenient and insecure. Remote desktops aren't ideal because you either have to monitor the repairer or you end up in the same problem. Most repairs are software-only, virtualisation is common, and outsourcing is big. The customer runs their OS on a hypervisor, ships an image to the repairer and the repairer ships back a repaired image. Want customer to continue to be able to use the machine while it is in the virtual repair shop. Studied Geek Squads in Austin TX to find out what sort of repairs they had to do. Went incognito to Apple Genius bars with broken laptops and asked what they had to deal with. Apparently, Geniuses don't have to deal with malware. UT IT services had 15,500 trouble tickets, of which 76% were software-related and 24% were hardware-related. A module called the repair helper lives between the OS and hypervisor, but don't talk about implementation here. For integrity/availability, use action history and dependency tracking, then merge customer's and repairer's changes. Idea is equivalent to a git rebase. Also have a story for rollback to allow the customer to get rid of undesired fixes and allow them to replay any of their subsequent actions.
Hope that repairs could be encoded in a canonical representation (pruned action graph e.g.). Benefit is that it enables auditing/signing of clean repairs which encourages trust. Some insight about machine learning from an OSDI 2010 paper (KarDo).Â - dgm36

Repair guys get very invasive access to your PC. It'd be good not to trust them, whilst they retain the ability to mess with your machine to the necessary degree. Found using a market survey to find out what Best Buy's Geek Squad and Apple's (ptooie) Genius Bar minions dealt with about 75% pure-software problems, so it should generally possible to use a virtual computer repair service. Propose to use existing dependency tracking work to merge changes made by the repairer and by the customer during the repair time. They generally want to have the repairer's changes happen first so that future customer errors are prevented. Relatedly they want to support rollback by essentially rebasing the customer's changes since the repair on top of the timeline up to the repair. The repairer could help by expressing his changes in an easy way to merge with concurrent mods, e.g. raise the level of abstraction from FS operations to the invocation of a standard script. Hope that many repairs are standard and so can be verified easily. (cs)

Structuring the Unstructured Middle with Chunk Computing

We can distributed-program using GPUs, multicore, cloud... all require difficult problem-fragmentation by the programmer. Idea: we should use something more structured than ASM to describe the task to the processor. They used linked Chunks, which are slabs of data with explicitly declared connections to other chunks. They can be used as hints to localise related computation. Could represent the stack of a thread in execution using linked chunks; a chunk per frame. A powerful hint would be two threads sharing a chunk -- they'll likely need coherence, so we should run them on the same core / related core / same machine / same rack etc. Chunks are fixed-size in order to make chunk-count a good predictor of migration-cost. (cs)

GPGPU, massively multi-core and cluster and cloud systems are new computing substrates that promise you an "unlimited" of processing elements. However, there is a management cost, which is not well understood, and various different strategies for dealing with them. Both programs and the machine have structures, but these are orthogonal. Assembly code is an unstructured intermediate that is not well-suited to very specialized machines. The chunk is a proposed structural abstraction. A chunk is a fixed-size block in memory that abstracts program structure and are individually mapped onto machine structure. Chunks have a fixed number of slots (fixed-size), and slots are typed and can contain a scalar or a link to a chunk. Explicit links expose the structure in the chunk graph, which the developer is forced to do by the restricted size of a chunk. Idea is that you can migrate computation from one set of processing elements to another set. Links between chunks can cross architectural boundaries. Need garbage collection for chunks, and some way to figure out what chunks should be migrated. Threads start out being represented by a thread chunk with a link to a (possibly linked-list) stack of chunks. Stack chunks may point to function chunks, or to object chunks. The links give us both a distance and a size metric. Therefore, nearby chunks should be collocated. Use a distance-k neighborhood of the thread object to show the important chunks to colocate. If these neighborhoods overlap, we can infer things like necessary synchronization, contention, etc. Distant threads can run without interference and need scheduling to avoid false sharing/contention. - dgm36

Macho: Programming with Man Pages

Programming is tedious, because you have to give a very precise and detailed description of your solution in your code. We want to automate programming: but existing tools (code snippet search etc.) don't have enough context to help you automatically. Macho goes from natural language text to candidate programs and hence to solutions. The example given is a high-level specification of `ls`. Programmer's labels (variable names) can act as clues to match against the input. The type system is also helpful here. Many candidate programs are kept in case you need to debug, and a previous version is more suitable as a starting point for fixing the bug. - dgm36

Programming is hard! Let's fully automate it! They've made an insane thing that tries to take a natural language description to real code. It works by stitching together provided code snippets that do useful things, and using an iterative unit-testing style thing to figure out whether it works. They have an example that can generate "ls" given a simple description, including omission of hidden files, error messages for not-a-directory... They can extract the specifics like that last from specific examples of correct execution. (cs)

Pursue Robust Indefinite Scalability

The guarantees that supposedly apply to computation (e.g. determinism, reasonable input...) are lies! Our original mistake was to delegate reliability to hardware, whilst software was to be efficient. Problem: these two goals are at odds! We should be as inefficient as possible given our requirements! Then he introduces a peculiar model of physically inspired computation. I cannot possibly explain this. (cs)

Computation must be born again. Determinism and correctness are childish mathematical fantasies. Maybe the answer can be wrong. The Faustian pact was to have reliable hardware and efficient software, and we have so internalized the importance of shall be efficient. However, efficiency and robustness are mortal enemies. Indefinite scalability is the idea that you have an indefinite spatial tiling of machines that could continue to increase performance. For example, cellular automata achieve some of this, but we need them to be programmable. Here we have a demonstration of an object-oriented cellular automaton. Each tile is 64 bits associated with an event (a bit of compiled code) that atomically modifies all of its neighbors in 2-D space. - dgm36

Benchmarking File System Benchmarking: It *IS* Rocket Science

Looking at tools that are used for file system benchmarking. Compile-based benchmarks are CPU-bound so it's unlikely that they will stress the file systems. Ad hoc benchmarks are the most common. The fundamental problem is that we don't know what a file system benchmark is or should be. File systems are multi-dimensional and should be evaluated as such, trying to define and isolate dimensions, then evaluating them separately. No single number is sufficient to characterize file system behavior, but we do want to combine some of the statistics into a usefully comparable quantity. A case study shows how random-read throughput varies over time. Latency follows a bimodal distribution, so reporting an average is meaningless: we need something more like 3-D histograms. - dgm36

A study of medical research showed that new findings are *highly* likely to be dodgy! Let's try to avoid that in CS... They surveyed recent papers and found that a large majority of them used ad-hoc benchmarks. Those who used benchmarks often used ones with known problems, or which were CPU-bound! Benchmarks are ill-characterised -- what do they stress about the FS? Why? Examples: how is their raw throughput? What about efficiency of allocation/deallocation? Do they use caches well? etc. They experiment with ext2/3 and XFS running FileBench; find that different run-lengths benefit different FSes. Which should we report? Recommends we report the whole thing, as some parts are I/O bound, others CPU bound, and others expose FS details. In another example as they varied transfer size they exposed multiple levels of caching; a simple average, often reported, would obscure these results. (cs)

Multicore OS Benchmarks: We Can Do Better

Multicore OS benchmarks don't measure isolation between multiple applications. Need a proper mixed-applications benchmark that stresses the OS and is sensitive to resource availability. They automatically produce mixes by doing sensitivity analysis on candidate apps then mixing them to produce reasonable resource occupancy. Sens. analysis means measuring how total runtime varies with restriction of each resource. Want to keep their resource usage high enough to expose the OS performance but low enough that preformance isn't mainly constrained by external resources. (cs)

Current benchmarks don't evaluate performance isolation between independent apps. We need to extend existing benchmarks with an idea of a workload mix. A good mix must use a significant amount of system resources (to get the OS involved), not overcommit resources (to limit overhead to that caused by the OS), and be sensitive to resource availability. The mixer takes candidates, does sensitivity analysis, and chooses an optimal mix based on sensitivity. It the runs the mixed workload and evaluates the results. The sensitivity analysis sees how changing each available resource (CPU, cache, memo, disk, net) affects the application-specific "goodness score". The mixer optimizes the solution using an integer linear programming solver. Comparing the mix workload goodness scores and running the applications in isolation: if the difference is low, then the OS is doing a good job. Mixer computes an optimal mix per OS, then compares result with running it in isolation. Currently doing microbenchmarks, but need to improve this to deal with real applications. There are open issues due to real applications, bursty applications and dynamic workloads. The aim is to extend existing benchmark suites and not replace them altogether. - dgm36

It's Time for Low Latency

Web applications in the future will need low latency: they will access more data, and more pieces of inter-dependent data. Facebook is a representative example, keeping huge datasets in DRAM with small requests and low locality. A simple RPC takes 300-500us in a current datacenter. Commodity NICs are optimized for bandwidth and not low latency. Furthermore there is kernel/syscall overhead. Server execution and propagation delay are not the limiting factor. Low latency is available for HPC in the form of Infiniband (100ns switching time, OS bypass, 1us NIC latencies, etc.). There has been some migration of ideas into commodity Ethernet, but we need to pull the rest of the ideas into the OS. The goal is 5--10us RTTs soon. Infiniband lets the application talk directly to the NIC, but the protocol is very complicated and baked into hardware. We'd like to use something that is compatible with the sockets interface. Sockets over Infiniband isn't particularly fast. One idea is to get the NIC on the memory bus, to get rid of PCIe and memory access latencies. - dgm36

Assertion: future web apps will need low latency data access as they'll use more bandwidth and more independent pieces of data. Example: Facebook: keeps loads of data in DRAM at all times, and is limited to 100-150 dependent accesses to serve one page. RPCs take 300-500 microseconds because of the store-and-forward switch, NIC optmised for bandwidth, OS net stack, lots of syscalls. Server execution time accounts for only 1us of all that. Infiniband can do this better -- 100ns switch, for example. This stuff is beginning to migrate over to Ethernet: we have 500ns switches, and RDMA in NICs. Problem with Infiniband: too complicated, and RDMA is too low-level. Has neat stuff though, like TX/RX queues mapped into userland to dodge system calls. Asserts that the NIC will soon become the bottleneck, adding 10us at each end. Future designs will write directly to cache rather than having to do main memory DMA followed by a fetch to cache. (cs)