This year’s OSDI in Hollywood is entering its final day; as usual, we will be covering the sessions live on syslog.
Continue reading below the fold for talk-by-talk coverage.
Session 8: Debugging and Testing
SymDrive: Testing Drivers without Devices
Matthew J. Renzelmann, Asim Kadav, and Michael M. Swift, University of Wisconsin—Madison
Driver stability is critical. Many approaches to improving it have been proposed, but few widely deployed. The prevalent methodology is simply testing the driver and performing code reviews. Testing does, however, require access to the actual physical hardware, and many drivers cover dozens of devices. Kernel evoluation also occasionally necessitates changes to many drivers, all of which then must be tested again. Existing approaches all fall foul: formal specifications are too large an effort, though finding many bugs; static analysis scales well and requires little effort, but only catches some bugs; testing and code reviews are somewhere in between on both effort and effectiveness dimensions. Their system, SymDrive, aims to achieve high effectiveness at low effort. This is achieved using symbolic execution of driver code.
Three main goals: (1) find deep, meaningful bugs, including those spanning multiple entry points and involving pointer/object queues; (2) do this without significant developer effort; (3) enable broader patch testing and apply to many classes as driver. To achieve this, they require a model of the device behaviour, so that access to the real hardware is not required. SymDrive builds on S²E, a symbolic execution engine that provides a symbolic device, and kernel modules for symbolic buses etc. The challenge with symbolic execution, as always, is avoiding path explosion. They use “special, invalid x86 opcodes” generated statically that instruct the execution engine to favour certain paths over others. For example, success paths are prioritized. Loops are instrumented with opcodes informing the engine of the fact that it is entering an iterating loop, which can then be elided. They also have a special “high coverage mode”, which will try to explore all paths, and forks on control flow branches. All of this magic is inserted into the driver source by a static analysis tool called SymGen.
Another challenge is to define what the correct driver behaviour is. For this purpose, SymDrive provides “checkers”, which are essentially sophisticated assertions. Somehow, these are chosen automatically by something called the “test framework” (I did not fully understand how this works). Checkers, however, are limited to verifying properties at the kernel-driver interface, and cannot check if the device works as expected.
Evaluation: 39 bugs found across 26 Linux/FreeBSD drivers on five buses; all of these verified, some patched. Of the 39, 22 were found by checkers, and 17 by symbolic execution in the kernel. Code coverage is high, with the median >80%, and only 1 annotation on average (max: 7) was required. Median runtime for SymDrive is ~25 minutes, although massive outliers (>8h) exist.
Q (Philip Levis, Stanford U): How do you handle the path explosion problem for interrupts, which can happen at any point in the execution stream?
A: This approach will miss some bugs, since we do not model interrupts at arbitrary points, but only fixed time intervals.
Q (someone from EPFL): What if you have multiple loops, and want to go through some of them multiple times, rather than taking the shortest exit? How do you deal with this?
A: The loops that SymDrive is most concerned with are those that iterate repeatedly and generate new paths. Most of the loops we found that generate states could be broken out of early. Worst case example here is a checksum loop, which prevents the driver from executing correctly if it does not fully execute, but this will cause a warning asking for annotation to be printed.
Q: Why do you favour success paths? It seems that bugs are more likely to linger on poorly tested error handling paths…
A: True. For this, we provide high coverage mode, and the LED bug we showed actually used that. The Linux driver development model is one of continuous patching and patch testing, and this matches SymDrive’s behaviour well.
Q: False positives?
A: Could occur if you have incorrect checkers, or hardware-dependent bugs (i.e. depending on unexpected behaviour). We strive to minimize false positives at all costs, and we have not really found this to be an issue.
Be Conservative: Enhancing Failure Diagnosis with Proactive Logging
Ding Yuan, University of Illinois at Urbana-Champaign and University of California, San Diego; Soyeon Park, Peng Huang, Yang Liu, Michael M. Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage, University of California, San Diego
Errors in production settings are notoriously hard to diagnose, since they can only be reconstructed if execution environment and inputs are exactly replicated. However, customers are often reluctant to give this information to vendors. Logs, however, are considered less sensisitive and often more readily available. That said, a real-world problem today is that software is not producing enough diagnostic log output, and log messages are often only added reactively (in response to hard-to-debug error reports). Hypothesis: there are many missed opportunities for developer to add logging output. Indeed, these opportunities map to a small set of classes, and code can automatically be instrumented with appropriate log messages.
They gathered 250 bug reports from various open source software repositories. Found that only 43% of them have error log messages associated with them — i.e. more than half fail silently. However, 77% have “easy-to-log” opportunities. Example: Apache developers fail to log errors returned by open() syscall as it occurs in wrapper function. However, good practice ways of dealing with this exist (e.g. SVN’s SVN_ERR macro). There is something called the Fault-Error-Failure model, which formalizes how errors should be detected, handled and communicated. They claim that conservatively “over-logging” is the right action, and present manual analysis that shows that developers in many of the 250 cases considered missed such opportunities.
Errlog (their system) automatically detects exception patterns (e.g. syscall returns) in source code and inserts log messages where they do no exist. They can also learn “application specific” error conditions, and instrument those that do not have log messages. But there can be false positives: e.g. error return from stat() when used to verify that a file does not exist. They use a heuristic to deal with these: only log each 2^nth occurrence of the error to avoid logspam. Errlog can successfully add log messages covering 65% of the identified failures; the remaining 35% still fail silently. To do so, they add 60% more log messages. When compared to existing manual log messages, Errlog covers 83% of these cases. Of course, there is some overhead: a few noisy error messages during normal execution are introduced, but this is only 1.4% on average, and under 5% in the worst case (OpenSSH). Also did a user study with 20 programmers to evaluate Errlog. Gave them some bugs, partly reproducible, partly not. Found that Errlog reduces diagnosis time by 61% on average (though with large error bars). Got positive feedback from users.
Limitations: maybe not representative, since only looked at five projects, all written in C/C++. Still have 35% silent failures. Semantics of auto-generated log messages not as good as manually written ones.
Q: How many of the silent failures would disappear if you turned on the “verbose” option (which is usually turned off in production)?
A: Indeed. Logging overhead with verbosity on is 90%, but would help. Undesirable, though, since developers have to ask users to turn it on and this involves an extra round of user interaction.
Q: How far could you get without access to source code? (e.g. LD_PRELOAD tricks etc.)
A: Probably could cover some bits, such as syscall error conditions, but others require more sophisticated static analyses, so must have source code.
Q (Margo Seltzer, Harvard U): Shouldn’t we be teaching people to write besser log messages?
Q: Can you tool add too many log messages and overwhelm the developer?
A: This is not about debug messages, but about error messages, which are by definition rare, as they only occur if error conditions are present. Still, could happen — but could use frequency-based ranking or similar techniques.
Q: Any suggestions on better tools to help developers logging?
A: Ideally, would use something like Errlog as an IDE plugin, and automatically generate template error messages.
X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software (Best Student Paper)
Mona Attariyan, University of Michigan and Google; Michael Chow and Jason Flinn, University of Michigan
Performance problems are hard to debug, especially by end users. Tools for end users are different from tools for developers — particularly in that they do not want to (or are able to) look at source code. Often, configuration files have a great influence, but are complex to understand. What is missing from current tools is an explanation of “what” went wrong, and what the root cause of a problem is. Profilers are a brute-force approach to this, by attributing costs to everything. X-ray improves on this by automatically attributing costs to root causes, and presenting a list of these. To do so, it uses a combination of deterministic replay. Recording overhead at runtime is low, so could have this on all the time. When performance problem occurs, the user can take the recorded log and send it to an offline analysis machine, which will then return a prioritized lists of root causes.
They found existing deterministic replay systems to be unsuitable for their purpose, as the recoded and the replay execution different (why? surely the whole point of deterministic replay is for it not to do that?). They solved this by modifying the replay tool to be “instrumentation aware” (it did not become entirely clear what this means). X-ray can work at different granularities: entire execution, timeslice, or individual requests. For the last option, especially, a challenge is to identify what code relates to which request. They have various methods of doing so, some of which are based on taint-tracking for data. Once they have established a mapping from basic blocks to requests, they attribute costs to them, and finally map costs to root causes.
To find out why two requests differed in performance, they use differential performance analysis. X-ray then extracts the control flow graphs for each request, merges them and figures out where the execution paths differ. This is at conditionals, and they quantify the extra cost of taking a different path by comparing it to the cost of the shortest path to exit. Various bits of cleverness deal with comparing many requests; ultimately, the costs are attributed to root causes and a list of root causes is presented.
Note that X-ray is limited to identifying root causes which are configuration settings or inputs. Evaluation: looked at 4 applications (Apache, Postfix, PostgreSQL, lighttpd). For 17 selected performance bugs, X-ray ranked the correct root cause first or tied first-second in 16 our of 17 cases. These results are better than they actually expected themselves. Runtime for all of the applications is on the order of <10 minutes.
Q (someone from Harvard): Experienced any situations in which certain performance bugs mask other bugs?
A: They will all turn up in the performance summarization list, ordered by their impact. Differential analysis helps excluding issues that the user is not interested in.
Q: Is this work also applicable to multi-component and distributed systems?
A: Yes, can run on multiple machines, but X-ray itself does not communicate.
Session 9: Isolation
Pasture: Secure Offline Data Access Using Commodity Trusted Hardware
Ramakrishna Kotla and Tom Rodeheffer, Microsoft Research; Indrajit Roy, HP Labs; Patrick Stuedi, IBM Research; Benjamin Wester, Facebook
[I missed the first half of this talk due to having to check out of the hotel; this is something about using crypto and TPMs to ensure that data remains private and accessible even when offline. Goal appears to be to be able to give data to people, and be able to audit if they accessed it, even if they did so while off-line. Fairly heavy on crypto; they also support access revocation. Application appears to be something like secure movie rental, with logged access. The Pasture library can be linked into any application, and they show an example of integration into Outlook. Common operations are fast, uncommon ones take on the order of seconds, key generation is most expensive with ~5s.]
Dune: Safe User-level Access to Privileged CPU Features
Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis, Stanford University
Privileged CPU features can actually be quite useful to applications. For example, garbage collection, intra-process privilege separation and safe native code execution in browsers could all benefit from having access to privileged instructions. One way to expose them is to patch the kernel — but this does not scale, as the patches are application-specific. How about using an extensible OS, like Exokernel? Works, but need to rewrite entire OS. What about virtualization with custom kernels? Strict partitioning and isolation of VMs means that inter-application communication is inhibited. What Dune does is to provide safe user-level access to such privileged features, while maintaining the standard POSIX process semantics and the same kernel-process interface. This is achieved by using existing virtualization hardware, and existing kernel features to access it. It is used in a very different way to how it’s used in virtualization, though.
Let’s consider the example of GC. The application will have its own page table, implemented using the guest PT hardware support for virtualization, while the kernel continues to use the separate host PTs. They leverage VT-x, which gives them access to four types of privileged CPU features: privilege modes, virtual memory, exceptions and segmentation. Note that this enables user processes to receive e.g. exception notifications and traps in a very efficient way (delivered by hardware). The kernel runs in host mode (VMX root mode on Intel; need access to the VT-x instructions), while processes run in guest mode (which does give access to privileged instructions as it would for VMs). 2.5k LOC kernel module for Dune, which manages the virtualization hardware and provides a process abstraction and forwards syscalls/page faults etc. from guest mode to host mode kernel. Processes link libDune, which is 10k LOC, but untrusted code. It’s essentially a utility library that helps leveraging the privileged instructions.
Let’s look into the kernel module. First, as part of the process abstraction, it needs to provide memory management. Address translation works as guest virtual -> guest physical and then uses the EPT to ensure safety of the guest physical memory (restricting access; fairly standard virtualization stuff). Unlike virtualization, they configure the EPT to reflect the entire process address space. For syscalls, the Dune processes will trap back into themselves (using libDune’s syscall handler), and then invoke a VMCALL to the kernel system call code. This means that we can run untrusted code inside the process, and have it use syscalls to interact with the outside world (and leverage things like supervisor mode PT bits to protect parts of the process’s memory from the untrusted code). Singal handling injects interrupts into the process, effecting a switch to ring 0 for delivery.
There were many challenges when implementing this — reducing overheads, dealing with insufficient EPT space, reconciling POSIX process semantics and VM semantics, etc. — details in paper. Overhead caused by VMX transitions and EPT translations — for example, getpid syscall now takes 895 cycles instead of 138. Page faults are also twice as expensive. However, this is partly cancelled out by the opportunities for optimization in application code that can now use privileged features: ptrace from user mode is now ~27x faster than on normal Linux (due to fewer context switches). Exception/trap deliverly 587 cycles instead of ~2.8k, virtual memory manipulation ~7x faster. Macro-eval using three example applications: app sandbox, GC and privilege separation system (multiple protection domains within a single process). Sandbox overheads are low on SPEC CPU2000 (<10%), except for outliers due to TLB misses, which can be fixed by adding large page backing of large memory allocations. This actually transforms the slowdown into a speedup! For lighttpd, the slowdown is around 2% in connections/second; this is way higher than e.g. running inside VMware player, where two network stacks are used, whereas Dune just uses a single kernel network stack. 40% improvement in GC performance, 3x faster privilege separation (as compared to using multiple Linux processes).
Q: Thought about making IO devices that support virtualization and exposing them through Dune?
A: Yes, not done for this paper, but it could be done. Could safely expose network devices, storage controllers etc. to user programs, and even partition them safely between different Dune processes.
Performance Isolation and Fairness for Multi-Tenant Cloud Storage
David Shue and Michael J. Freedman, Princeton University; Anees Shaikh, IBM T.J. Watson Research Center
Predictable performance in cloud storage is hard. Co-located tenants (sharing the same K-V store in this thought experiment) lead to resource contention, and heterogeneity in tenant workloads can stress different parts of the system. Their system, Pisces, basically provides weighted fair sharing of a distributed K-V store. Related work is Amazon’s DynamoDB, which assumes per-tenant provisioned rates, uniform object popularity and a single uniform resource request format. Pisces makes none of these assumptions, and yet provides max-min-fairness (and is work-conserving). As their guarantees are system-wide, they need a mechanism to translate global weights into local (per-machine) weights.
Randomly placing data partitions (of the key space) on different machines, they can be placed according to fairness constraints and bin-packed onto machines (avoiding to overload any machine; how do they know this in advance? or is this reactive?). Allocating equal local is not ideal — instead, they make reciprocal swaps between nodes so that global weights still match, but local weights may differ. Replica routing is also integrated with the weighting system, aiming to saturate allocations. Since resources are multi-dimensional, they employ dominant resource fair queuing/sharing. [I missed some detail here]
Evaluation: does Pisces provide the fairness it’s designed to provide? Even system-wide fairness, weighted system-wide fairness, local dominant resource fairness? Experiment: 8 tenants, 8 clients, 8 storage nodes, membase. Unmodified fairness is pretty poor, in Pisces, almost everyone gets the fair share almost all the time. If the tenant request rates are unbalanced, fair queuing along does not reach fairness, bbut the combination of all of their features (+partition placement, weight allocation, replica selection) does. The overhead added by Pisces is small — <5% for 1kB requests, but ~19% for 10B requests (CPU-bound). For different tenant weights, fairness is mostly good, although low-weight tenants get somewhat poor fairness. Fixing this is WIP, workaround is to set their weights higher (?). Local dominant resource fairness is also provided; finally, Pisces adapts well to different demand shapes (constant, diurnal, bursty).
[missed two questions here]
Q (someone from Google): How do you deal with caches on the server; do you partition them?
A: Don’t explicitly deal with this kind of resource.