HotOS 2011 LiveBlog Day 1

Posted by Chris Smowton

Mind the Gap: Reconnecting Architecture and OS Research

Arch people don't think about the OS. OS people overemphasise running on commodity hardware. OS people should ask for features they need; arch people don't have much in the way of useful benchmarks to indicate whether their hardware is useful for OS development. They have simulators, but they're not geared for such huge whole-system sims. Benchmarks like TPC-W are ideal, but they're hard to get working because you'll need drivers etc that work with your new hardware. He sets some goals for arch designers: support multiplexing, don't couple features unless you must, don't introduce arbitrary limits, avoid mandatory abstractions e.g. HyperThreading (heh, Exokernel for hardware). List of wants: lightweight inter-core messaging, fast SYSENTER, software-controlled cache, better performance counters. (cs)

Operating System Implications of Fast, Cheap, Non-Volatile Memory

Bulk NVRAM is on the way. It's disruptive, because it's usable as main memory and also as secondary storage. Could integrate the stuff by replacing the disk (SSD+++), or by integrating it into a big shared address space. They're actually looking at a more radical approach: replace all memory with NV. Introduces a bunch of interesting stuff, e.g. processes are likely to be long-term persistent. Makes it harder to achieve a "reboot", much easier to hibernate. Lack of a clean boot also has security problems. Another question: how do you do virtual memory in such a system? Is your "disk" space in VAS? Is "disk" mediated by the MMU? Merger between SHM and shared FS seems likely. They suggest the "reboot" thing should be a rollback to a checkpoint of which their might be many. On the one hand there's an increased chance of finding a "bug" in the rollback algo; on the other hand current-day "reboot"s don't always reset everything we need -- e.g. some file is corrupted in such a way that it trashes the process next time it starts. (cs)

Virtually Cool Ternary Content Addressable Memory

TCAMs are useful for all manner of search operations because they provide an O(1) matching operation. Used regularly in routers; could use them for implementing part of a SQL query for example. Sucks up power, though. Goal: make something that provides fast search without the power devoural. Idea: provide a virtual match-o-matic implemented using some real TCAMs and some backing store implemented as ordinary memory ("location-addressable memory", by contrast with CAM). Can implement an appropriate data structure in the LAM, e.g. an ordinary hash table. Then to find something you'd hash to find the right "page" of memory, load it into CAM, then do the O(1) last step. Or, could use another CAM to do the find-next-level-page operation. So you'd do a CAM operation to find the pages you're after, then page each into second-level CAM and do the search for each. I think. This is pretty complicated! (cs)

The Best of Both Worlds with On-Demand Virtualisation

Virtualisation isn't ubiquitous. They argue this is because it imposes overhead at all times, and can cripple your h/w e.g. graphics, or anything else not using an easy-to-forward device interface. So: let's virtualise on demand -- pay the costs when you use it. Use cases: in the data centre, virtualise when utilisation is low. In desktop environment, use it when you need to kernel-debug, or do migration, or... Main challenge: how to migrate to/from VMM. Propose to use hibernate/suspend since they record the complete machine state in memory. Device switching to be implemented by running a stub "virtual disk" --> real disk driver most of the time, which is capable of transparent hotplugging. Device migration is made easy by the hibernate operation. Obviously some devices will mysteriously go missing during the hibernate, e.g. the GPU; this is probably acceptable as the OS needs to anticipate this during ordinary hibernation. (cs)

Resource consolidation, checkpointing and debugging are advantages, but it is not being used everywhere (e.g. Google/Facebook/your laptop). This is because the features are only being used occasionally, but the overhead of virtualisation is all the time. There is a lack of functionality (e.g. GPUs, WLAN, etc. instead relying on the host OS) and performance for I/O and virtual memory operations. On-demand virtualisation gives normally native execution, and overhead paid only when features are being used. This has applications in the datacenter and desktop environments. (Also security - introspection!) Idea: pause the OS, transfer its state into a VM and resume its execution. Several challenges: how do we transfer it into a VM? How does it continue to run unmodified? Hibernate/resume provides exactly the facility we need by snapshotting the state to disk. But we need to resume with the same hardware profile, and the VMM will not provide everything. Device hotplug supports this: detach devices before hibernate, and scan for virtual devices after resume. Another problem: maintaining application connections to various devices, so need to transfer device bindings across different models and types of devices -- the VMM tends to lag behind physical hardware in its capabilities. Solve this with another level of indirection: use logical devices to retain states and hide changes (e.g. bonding driver for network, device mapper for block devices). Costs arise from setting aside a VMM partition on disk (though no cost of preallocation in CPU or memory), hardware access through logical devices (for devices whose state must be preserved). Prototype implemented for Linux 2.6.35 on KVM (with TuxOnIce hibernation patch). Supports a one-way conversion for physical to virtual machine. SCP/SSH remains open. Takes about 90 seconds to do the transition. Future work: hibernate to RAM, exec and devirtualisation. - dgm36

Repair from a Chair: Computer Repair as an Untrusted Cloud Service

Computer repair today resembles TV repair: you bring your computer to a retail service which is inconvenient and insecure. Remote desktops aren't ideal because you either have to monitor the repairer or you end up in the same problem. Most repairs are software-only, virtualisation is common, and outsourcing is big. The customer runs their OS on a hypervisor, ships an image to the repairer and the repairer ships back a repaired image. Want customer to continue to be able to use the machine while it is in the virtual repair shop. Studied Geek Squads in Austin TX to find out what sort of repairs they had to do. Went incognito to Apple Genius bars with broken laptops and asked what they had to deal with. Apparently, Geniuses don't have to deal with malware. UT IT services had 15,500 trouble tickets, of which 76% were software-related and 24% were hardware-related. A module called the repair helper lives between the OS and hypervisor, but don't talk about implementation here. For integrity/availability, use action history and dependency tracking, then merge customer's and repairer's changes. Idea is equivalent to a git rebase. Also have a story for rollback to allow the customer to get rid of undesired fixes and allow them to replay any of their subsequent actions.
Hope that repairs could be encoded in a canonical representation (pruned action graph e.g.). Benefit is that it enables auditing/signing of clean repairs which encourages trust. Some insight about machine learning from an OSDI 2010 paper (KarDo). - dgm36

Repair guys get very invasive access to your PC. It'd be good not to trust them, whilst they retain the ability to mess with your machine to the necessary degree. Found using a market survey to find out what Best Buy's Geek Squad and Apple's (ptooie) Genius Bar minions dealt with about 75% pure-software problems, so it should generally possible to use a virtual computer repair service. Propose to use existing dependency tracking work to merge changes made by the repairer and by the customer during the repair time. They generally want to have the repairer's changes happen first so that future customer errors are prevented. Relatedly they want to support rollback by essentially rebasing the customer's changes since the repair on top of the timeline up to the repair. The repairer could help by expressing his changes in an easy way to merge with concurrent mods, e.g. raise the level of abstraction from FS operations to the invocation of a standard script. Hope that many repairs are standard and so can be verified easily. (cs)

Structuring the Unstructured Middle with Chunk Computing

We can distributed-program using GPUs, multicore, cloud... all require difficult problem-fragmentation by the programmer. Idea: we should use something more structured than ASM to describe the task to the processor. They used linked Chunks, which are slabs of data with explicitly declared connections to other chunks. They can be used as hints to localise related computation. Could represent the stack of a thread in execution using linked chunks; a chunk per frame. A powerful hint would be two threads sharing a chunk -- they'll likely need coherence, so we should run them on the same core / related core / same machine / same rack etc. Chunks are fixed-size in order to make chunk-count a good predictor of migration-cost. (cs)

GPGPU, massively multi-core and cluster and cloud systems are new computing substrates that promise you an "unlimited" of processing elements. However, there is a management cost, which is not well understood, and various different strategies for dealing with them. Both programs and the machine have structures, but these are orthogonal. Assembly code is an unstructured intermediate that is not well-suited to very specialized machines. The chunk is a proposed structural abstraction. A chunk is a fixed-size block in memory that abstracts program structure and are individually mapped onto machine structure. Chunks have a fixed number of slots (fixed-size), and slots are typed and can contain a scalar or a link to a chunk. Explicit links expose the structure in the chunk graph, which the developer is forced to do by the restricted size of a chunk. Idea is that you can migrate computation from one set of processing elements to another set. Links between chunks can cross architectural boundaries. Need garbage collection for chunks, and some way to figure out what chunks should be migrated. Threads start out being represented by a thread chunk with a link to a (possibly linked-list) stack of chunks. Stack chunks may point to function chunks, or to object chunks. The links give us both a distance and a size metric. Therefore, nearby chunks should be collocated. Use a distance-k neighborhood of the thread object to show the important chunks to colocate. If these neighborhoods overlap, we can infer things like necessary synchronization, contention, etc. Distant threads can run without interference and need scheduling to avoid false sharing/contention. - dgm36

Macho: Programming with Man Pages

Programming is tedious, because you have to give a very precise and detailed description of your solution in your code. We want to automate programming: but existing tools (code snippet search etc.) don't have enough context to help you automatically. Macho goes from natural language text to candidate programs and hence to solutions. The example given is a high-level specification of `ls`. Programmer's labels (variable names) can act as clues to match against the input. The type system is also helpful here. Many candidate programs are kept in case you need to debug, and a previous version is more suitable as a starting point for fixing the bug. - dgm36

Programming is hard! Let's fully automate it! They've made an insane thing that tries to take a natural language description to real code. It works by stitching together provided code snippets that do useful things, and using an iterative unit-testing style thing to figure out whether it works. They have an example that can generate "ls" given a simple description, including omission of hidden files, error messages for not-a-directory... They can extract the specifics like that last from specific examples of correct execution. (cs)

Pursue Robust Indefinite Scalability

The guarantees that supposedly apply to computation (e.g. determinism, reasonable input...) are lies! Our original mistake was to delegate reliability to hardware, whilst software was to be efficient. Problem: these two goals are at odds! We should be as inefficient as possible given our requirements! Then he introduces a peculiar model of physically inspired computation. I cannot possibly explain this. (cs)

Computation must be born again. Determinism and correctness are childish mathematical fantasies. Maybe the answer can be wrong. The Faustian pact was to have reliable hardware and efficient software, and we have so internalized the importance of shall be efficient. However, efficiency and robustness are mortal enemies. Indefinite scalability is the idea that you have an indefinite spatial tiling of machines that could continue to increase performance. For example, cellular automata achieve some of this, but we need them to be programmable. Here we have a demonstration of an object-oriented cellular automaton. Each tile is 64 bits associated with an event (a bit of compiled code) that atomically modifies all of its neighbors in 2-D space. - dgm36

Benchmarking File System Benchmarking: It *IS* Rocket Science

Looking at tools that are used for file system benchmarking. Compile-based benchmarks are CPU-bound so it's unlikely that they will stress the file systems. Ad hoc benchmarks are the most common. The fundamental problem is that we don't know what a file system benchmark is or should be. File systems are multi-dimensional and should be evaluated as such, trying to define and isolate dimensions, then evaluating them separately. No single number is sufficient to characterize file system behavior, but we do want to combine some of the statistics into a usefully comparable quantity. A case study shows how random-read throughput varies over time. Latency follows a bimodal distribution, so reporting an average is meaningless: we need something more like 3-D histograms. - dgm36

A study of medical research showed that new findings are *highly* likely to be dodgy! Let's try to avoid that in CS... They surveyed recent papers and found that a large majority of them used ad-hoc benchmarks. Those who used benchmarks often used ones with known problems, or which were CPU-bound! Benchmarks are ill-characterised -- what do they stress about the FS? Why? Examples: how is their raw throughput? What about efficiency of allocation/deallocation? Do they use caches well? etc. They experiment with ext2/3 and XFS running FileBench; find that different run-lengths benefit different FSes. Which should we report? Recommends we report the whole thing, as some parts are I/O bound, others CPU bound, and others expose FS details. In another example as they varied transfer size they exposed multiple levels of caching; a simple average, often reported, would obscure these results. (cs)

Multicore OS Benchmarks: We Can Do Better

Multicore OS benchmarks don't measure isolation between multiple applications. Need a proper mixed-applications benchmark that stresses the OS and is sensitive to resource availability. They automatically produce mixes by doing sensitivity analysis on candidate apps then mixing them to produce reasonable resource occupancy. Sens. analysis means measuring how total runtime varies with restriction of each resource. Want to keep their resource usage high enough to expose the OS performance but low enough that preformance isn't mainly constrained by external resources. (cs)

Current benchmarks don't evaluate performance isolation between independent apps. We need to extend existing benchmarks with an idea of a workload mix. A good mix must use a significant amount of system resources (to get the OS involved), not overcommit resources (to limit overhead to that caused by the OS), and be sensitive to resource availability. The mixer takes candidates, does sensitivity analysis, and chooses an optimal mix based on sensitivity. It the runs the mixed workload and evaluates the results. The sensitivity analysis sees how changing each available resource (CPU, cache, memo, disk, net) affects the application-specific "goodness score". The mixer optimizes the solution using an integer linear programming solver. Comparing the mix workload goodness scores and running the applications in isolation: if the difference is low, then the OS is doing a good job. Mixer computes an optimal mix per OS, then compares result with running it in isolation. Currently doing microbenchmarks, but need to improve this to deal with real applications. There are open issues due to real applications, bursty applications and dynamic workloads. The aim is to extend existing benchmark suites and not replace them altogether. - dgm36

It's Time for Low Latency

Web applications in the future will need low latency: they will access more data, and more pieces of inter-dependent data. Facebook is a representative example, keeping huge datasets in DRAM with small requests and low locality. A simple RPC takes 300-500us in a current datacenter. Commodity NICs are optimized for bandwidth and not low latency. Furthermore there is kernel/syscall overhead. Server execution and propagation delay are not the limiting factor. Low latency is available for HPC in the form of Infiniband (100ns switching time, OS bypass, 1us NIC latencies, etc.). There has been some migration of ideas into commodity Ethernet, but we need to pull the rest of the ideas into the OS. The goal is 5--10us RTTs soon. Infiniband lets the application talk directly to the NIC, but the protocol is very complicated and baked into hardware. We'd like to use something that is compatible with the sockets interface. Sockets over Infiniband isn't particularly fast. One idea is to get the NIC on the memory bus, to get rid of PCIe and memory access latencies. - dgm36

Assertion: future web apps will need low latency data access as they'll use more bandwidth and more independent pieces of data. Example: Facebook: keeps loads of data in DRAM at all times, and is limited to 100-150 dependent accesses to serve one page. RPCs take 300-500 microseconds because of the store-and-forward switch, NIC optmised for bandwidth, OS net stack, lots of syscalls. Server execution time accounts for only 1us of all that. Infiniband can do this better -- 100ns switch, for example. This stuff is beginning to migrate over to Ethernet: we have 500ns switches, and RDMA in NICs. Problem with Infiniband: too complicated, and RDMA is too low-level. Has neat stuff though, like TX/RX queues mapped into userland to dodge system calls. Asserts that the NIC will soon become the bottleneck, adding 10us at each end. Future designs will write directly to cache rather than having to do main memory DMA followed by a fetch to cache. (cs)

Filed under: Uncategorized 3 Comments