EuroSys 2011, day two

Session 4: Joules and Watts

Energy Management in Mobile Devices with the Cinder Operating System

A new mobile device OS, whose aim is to allow users to control their energy use, and allow applications to become more energy-efficient. First abstraction is throttling, which limits the draw that a particular application may have. However, the energy use is bursty, so this uses a reserve buffer that allows an application to use more energy if it has been running below maximum for a while. A process with an empty reserve will not be scheduled. To prevent hoarding of energy, the reserve drains with multiplicative decrease (e.g. 10%/sec). Reserves may be nested, to, for example, isolate the energy usage of a plugin like Adobe Flash. Energy may also be ring-fenced in "virtual batteries" for uses such as emergency calls. The OS abstraction is a process launcher called "enwrap", which launches an application with an allocation of power consumption. Background applications draw power from a smaller virtual battery to prevent unexpected power draw from applications you can't see; this is managed via a custom window manager. Development issues arose from the implementation of the HTC Dream, which uses a binary blob shared object to interact with the secure ARM9 core, and the exposure of the battery level as an integer 0 to 100; this led to concerns that future mobile phones will be more difficult to develop research OSs for, as there is a move to more use of secure cores and signed code. As a result of these frustrations, they moved to implement their abstractions in Linux, giving Cinder-Linux. One challenge was IPC: it was necessary to attribute energy use in daemons to the process making the IPC request. (This was easier in Cinder due to the use of gates, based on the same mechanism in HiStar.) One application developed was an energy-aware photo gallery, which modulated its download rate depending on energy properties. Next step is working out how to use these primitives, in terms of UI design (presenting a breakdown of energy use to users), energy modeling (currently use a simple energy model based on offline profiling, but could use something more sophisticated such as the approach described in the following talk), userspace code instrumentation and running Android (Dalvik) on Cinder.

Q: Is it not limiting to assume that background applications will not use much energy? The example was over-simplistic: it is possible to allocate more energy to background reserves, and applications can draw from multiple reserves (e.g. for a streaming music app). However, by default, you don't want random applications drawing energy in the background; it might be better to have a whitelist.

Q: How do you actually measure the energy usage and attribute it to processes, and what's the overhead of doing this? Cinder doesn't innovate in this regard, and there is a lot of research going on in this area. See for example the next talk. The profile is based on the measured power states of the device when doing various things. This all boils down to an XML file that boils down to the amount of energy used per packet (based on signal strength, for example), etc. The model is not perfectly accurate, but it will be easy to integrate other research on this topic.

Q: How far could you go in getting all of these abstractions into userspace? This is easier in Cinder, where, for example, even the networking stack is in userspace. You'd probably have to modify the socket interface to intercept power-critical information; this may also be easier in Android, which does much more in user-space. - dgm36

People use smartphones a lot. Energy is a constrained resource in those, but there is no OS-level support for energy management. They developed Cinder, an OS with the ability to track energy consumption on a per-task level. They forked something called HiStar and modified it, rather than forking Android. Their system also runs on x86_64 desktops. They throttle per-application energy consumption, e.g. rate-limiting the web browser. However, the browser is a bursty application and consumes large amounts (>700mW) of energy in short bursts. So they introduced a "buffer" (token bucket kind-of scheme for reserving energy). If a process has used up its energy allocation, the CPU scheduler will disallow it from running. However, they need to avoid applications gathering energy "tap" for a long time and then consuming a huge amount very quickly. So they have a "flow" of energy back from the per-application buffer to slow down the rate at which "taps" are accrued. They support fine-grained application policies, e.g. nesting, so they can put a browser plugin as a child of the browser, and the browser will allocate it a share of energy from its buffer. They've also implemented a tool called "ewrap" to wrap legacy UNIX application into an energy-aware sandbox.
From a user perspective, their principle is that apps that cannot be seen should not case battery drain. However, this assumption doesn't hold in the general case, because apps have background operations going on. For this purpose, they have a "background" energy buffer, for which apps compete. They ensure that the foreground window always has the largest (700mW) energy allocation, and allocate 10% of this to the background buffer.
They then spoke about the perils of developing on modern mobile hardware. On the G1, there are two isolated ARM cores and the battery is controlled by the "secure" core. Only a binary blog to provide the IPI between the cores. He complains that the community are being locked-out by being unable to get full access to devices.
Due to the frustration with development, they moved to a Linux platform, in addition to developing ground-up themselves. One problem there is that daemons may be doing work on behalf of applications, so it is tricky to bill the correct app for the energy use. In HiStar-based Cinder, this is not a problem as they support a primitive called "gates", which I think are special code entry points that will cause accounting to happen. On Linux, this was difficult, because IPC mechanisms are not built with a facility to identify the caller in mind. They have some ideas how this could be fixed (new IPC mechanisms, or custom communication mechanisms passing information using a user-level protocol), but haven't done any of this yet.
Future work: UI to make simple, supporting "training" and "enforcement" modes; more sophisticated energy models; better abstractions on Linux; Android on Cinder.
I asked them how exactly they conduct their energy measurements, and what the overhead was. It turns out they're not measuring anything live at all -- instead, they made an "energy model" from micro-measurements for things like energy used to transmit a network packet, and extrapolate from that. He implied that this was standard practice and pointed towards the next talk, which goes into detail about power modelling using system call tracing. -ms705

Fine-Grained Power Modeling for Smartphones Using System Call Tracing

They want to improve on the state of the art in smartphone power modelling. The first scheme they talked about I missed due to posting my previous summary to the blog. Second scheme is utilization-based modelling, which makes some bogus assumptions (only utilization causes energy to be consumed, energy scales lineary, power is additive). They debunk those assumptions one by one using experimental results. They've also found that many device drivers have intelligent power control rules, and that system calls appear to play a role in power consumption.
Hence, they decided to use system calls as power triggers: they can be traced back to the caller, and allow for accounting for non-utilization-based energy accounting. They use a finite state machine for their power model, based on an approach that is charmingly called "systematic brute-force": (1) model a single system call, (2) model multiple system calls for the same component, (4) model multiple components (entire phone). So, for example, they profiled the read() syscall, then straighten and simplify the power utilization curve and extract states (levels) from it, which can be converted directly into a FSM. They observe that a component can only have aq small finite number of power states, and use this assumption to extend their approach to multiple components. However, different components may also interact with each other and influence power states. So they use brute-force to model all possible combinations.
This has been implemented on Windows Mobile 6.5 by extending CeLog and on Android using system taps and extended debug logging. They compared their FSM-based model to something based on linear regression in terms of the energy estimation error (how do they get the ground truth?!), and found that they see an error of less than 4%, while LR has about 10-20%. In the paper, they have some more stuff: e.g. a call-graph energy profiler that generates source-code heat maps for applications. -ms705

The key question is where energy is being spent: can we go down as far as function-level profiling? The approach is to develop an online power model, because power metering is too expensive and also holistic (and so hard to break down). Modelling has two phases: training and prediction. The state of the art uses linear regression and superimposition to compute utilization based on observed triggers. However, this makes the errant assumption that only active utilization will lead to power consumption. This is not true on smartphones. The second mistake in this assumption is that energy consumption scales linearly with the amount of work (e.g. 20x data sent = 20x energy consumed, which is not true). Even the assumption that power consumption composes linearly is mistaken, due to interactions between different components. This may be due to adaptive behavior in the OS (such as changing the transmit rate for different amounts of data). The key idea is to use system calls as triggers in power modeling, as these can be traced back to the function that invoked them. Can build FSMs for system calls, e.g. read(fd, buf, size), which has a sequence of states: base, start (initial high peak), tail (lower level consumption), base. Since a component can only have a small finite number of power states, it is possible to compose the FSMs. To model multiple components, use a brute-force approach to try to get into different combinations of states. Implemented for WinMobile 6.5 (using CeLog), and Android. Estimation error is almost always much lower than using linear regression, <4% on Android for five applications, and <4% on WM6.5 for ten applications. In the applications, the 80th percentile error is less than 10% for all apps, which is much better than linear regression.
Q: Are you running applications in isolation for your experiments, or do you consider concurrent applications? We run one application at a time, but we monitor system-wide syscalls. We can still model the power, but it becomes harder to attribute the application responsible for the power consumption.

Q: How do you account for simultaneous use of the same hardware device? A good example of this is the Android GPS daemon. The answer would be to implement this above and below the daemon, but we haven't yet implemented this for the buffer cache.

Q: Is LR state-of-the-art? For example, see Currency (10 years ago)? LR is state-of-the-art for smart phones, based on the last three years of research.

Q: How can we manage this in the future when we have even more devices? Is there a way to automate this, or should we hope for more information from the hardware? Hardware approaches don't provide sufficient granularity of accounting, so even with perfect hardware, we would need some modeling. Currently this is done manually, but we are looking at an automatic approach. However, even the manual approach is a one-time effort per handset. A MobiSys paper has started to look at automating this, but it makes some of the mistaken assumptions that we have discredited. - dgm36

Sierra: practical power-proportionality for data center storage

In data centers, server utilization is subject to diurnal variations, peaks and troughs. A zero-load server draws 60% of power compared to a fully-loaded server. The aim is to have a power-proportional data center, but, since hardware doesn't provide this, we'll have to do this in software. However, storage state is hard to migrate (CPU and network are much easier), so how do you turn servers off, keeping data available. The context is a replicated storage system where compute and storage are collocated (cf. Azure, GFS, Ursa Minor), there is a highly-available metadata service, and a client library for accessing these. Sierra supports variable levels of replication, where the "gear level" (available replicas) is less than or equal to the number of stored replicas. Sierra uses r-way replication, with careful placement to maximize the number of servers that can be powered down. For write availability and read/write consistency, a distributed virtual log is used. Naïve random gives worse power-down possibility than naïve grouped; naïve random gives better rebuild parallelism than naïve grouped. Sierra achieves the sweet spot between these (for large clusters). For the rack and switch layout, allocation can be rack-aligned (allowing whole racks to be turned off) or rotated (giving better thermal balance). The distributed virtual log enables offloading (if an object is in low gear (some secondaries are offline)) and reclaim (when secondaries come back online). When replicas fail and an object is in low gear, there is brief unavailability while a server is woken up. Server power cycling is rotated among servers, so that there are no hotspots of power cycling (which may lead to higher failure rates). The whole system (chunk servers, metadata server, distributed virtual log and client library) was implemented in around 30k lines of C. It was evaluated on 48-hour I/O request traces from eight Hotmail backend servers, and the testbed was provisioned to keep similar performance to the real trace. The performance during steady state and down-shift is similar, and slightly worse when there is an up-shift. - dgm36

Data-centres use a lot of power, but workloads are not always maxing out the resources. In fact, servers still use 60% of the peak energy when idle. Also, workloads don't always compose completely (corelated spikes may cause issues). They have developed Sierra, a storage systems that allows them to dynamically put servers in a cluster to sleep in a no-load situation.
Gear level g: there are g replicas available, where we always have g < r. Since they can have any server asleep at any time. they need to carefully place their data in such a way that it remains available. In order to reconcile the storage of servers that have been asleep, they use a distributed virtual log and reconcile it over night.
If they use naive random data placement with 2-way replication, they can only ever safely turn off a single server. In a naive grouped layout, they make sure that replications are corelated, so that N(r-g)/r servers can be turned off (as opposed to (r-g) before. Sierra combines the two approaches: it replicates diagonally across replicas, but not vertically across replica groups. They can still turn off N(r-g)/r servers, but also get good load-balancing, at the cost of the rebuild operation being more expensive.
For write availability, they use the distributed virtual log (DVL). Any write to a single primary replica in offloading mode is replicated to three (virtual) loggers. In "reclaim mode", the loggers send their recorded writes to the replicas. In actuality, they could either have dedicated loggers or co-locate virtual loggers with the replicas, and they do the latter for better multi-plexing. For load prediction and gear scheduling, they use a simple history-based prediction model. I didn't really manage to follow their evaluation as we was going through it very quickly, but it looks as though the system works well and changes between gears as required, while not encurring a large access latency increase (although one variant of Sierra was quite a bit slower than the baseline). -ms705

Session 5: Testing for the Real World

Parallel Symbolic Execution for Automated Real-World Software Testing

Current industrial software testing practices are suboptimal as they are expensive and not as effective as we would like. They have looked into improving the scalability and usability of symbolic execution, an automated testing approach that currently has limited adoption outside academia. The exponential explosion of symbolic execution trees can be ameliorated by parallel exploraition on multiple machines. Cloud9 distributes symbolic execution in this way and hence enables testing of programs that have a code size that previously prohibited symbolic execution.
A challenge in the distribution of this is that the growth of execution subtrees is not balanced, and single tree can still grow very large. They use KLEE as their local symbolic execution engine on the workers. In order to keep the work balanced, they have a load balancer that monitors reports from the workers and instructs them to negotiate redistribution of work in a P2P fashin if they become imbalanced. The division of the global execution tree is done using "fence" nodes on the boundaries, and "candidate" nodes at the leaves, where new nodes are added. If, say, worker 2 has a subtree and creates nodes in a part of the tree that is assigned to worker 1 node, it puts fence nodes there and notifies worker 1 to place candidate nodes in these positions. No part of the tree is evaluated redundantly, and they use an efficient binary encoding to keep the messages small.
For usability, they provide a POSIX environment model. A common problem with real programs is that symbolic execution engines cannot directly execute code beyond the boundaries of the program itself symbolically, e.g. fork(). They play the familiar trick of plugging in a behavioural model in this case. In this way, they can support through models for pthreads, processes, pipes, signals and networking.
They also made some changes to the symbolic execution engine to support multithreading and scheudling. They added deterministic and symbolic schedulers, and a non-preemptive execution model as well as address space isolation and copy-on-write. This functionality is available to the models mentioned previously, which also support fault injection.
They've used Cloud9 to test big pieces of software like Apache, memcached, coreutils, lighttpd, the Python interpreter and others. They found that code coverage increases significantly after adding workers (or rather, that the time taken to execute is reduced so far that they can up the code coverage barrier). Cloud9 achieves linear scalability on increasing numbers of workers. -ms705

Industrial software testing lags behind the academic state-of-the-art, because of three challenges: scalability, applicability and usability. Cloud9 is the first parallel symbolic execution engine to scale on commodity clusters, support full POSIX and provide an easy-to-use platform API. Conventional symbolic execution runs into a CPU bottleneck and memory explosion. Parallelizing tree exploration in a scalable manner is an open research problem. The problem is that the structure of the execution tree is not known a priori, and may be very unbalanced. Each worker runs KLEE locally, and the tree is split up, with candidate nodes that may be explored and fence nodes (which are candidate nodes on a different worker) that stop exploration of other subtrees. Work stealing is supported, which requires workers to update their candidate and fence node sets, in a peer-to-peer fashion. Path-based binary encoding makes the representation of paths very compact, which cuts down on network bandwidth. Many POSIX commands cannot be symbolically executed directly, so a model is used to support equivalent functionality. Cloud9 adds support (above KLEE) for threads, multiple processes, pipes/message-passing, signals and networking. For multithreading/scheduling, Cloud9 assumes deterministic and symbolic scheduling, without preemption. Address spaces are modeled by copy-on-write domains for memory sharing. Cloud9 was used to test Apache, Memcached, GNU coreutils, lighted, Python and others. Adding workers gives speedup, and increased code coverage in coreutils is possible. Instruction throughput scales linearly with the number of workers. The macro experiment took a client process and a server, and tested the "whole world" symbolically. - dgm36

Striking a New Balance Between Program Instrumentation and Debugging Time

This paper addresses the trade-off between debugging time and instrumentation overhead to record branches taken in a buggy program. Static and dynamic analysis is used to minimize instrumentation. Only branches that depend directly on input need to be instrumented, which cuts down the amount of "symbolic branches" that must be symbolically executed. In experiments, only 10% of branches are input-dependent, and most branches or always symbolic or always concrete. Static analysis marks variables as "symbolic", starting with argv, and propagating using data-flow and points-to analysis; however, the analyses are imprecise and tend to over-estimate, and they don't scale to libraries such as lib. Dynamic analysis is precise and works for libraries, but coverage is limited to the branches taken in a given run of the code. The approach is to use dynamic analysis for branches taken at run-time, and static analysis for the rest. This generates a log along with bug reports, which may be shipped to the vendor who can use symbolic execution to identify the true path to the bug. To deal with non-determinism in system calls, e.g. select(), return any set of ready file descriptors. There are microbenchmarks in the paper, but the evaluation in the talk focuses on uServer (32 KLOC): dynamic+static instrumentation achieves 26% overhead (compared to 334% for static analysis and 19% for dynamic; and 345% for instrumenting all branches). In terms of replay time, all-branches and static take 170s, dynamic takes over one hour, and dynamic+static takes 532s. Logging syscall return values makes the replay time much shorter. Therefore, dynamic+static strikes a good balance between the approaches. - dgm36

This is about sending useful, yet not overly invasive bug reports to vendors (basically with all private information removed). To do this, they basically send only the information required to enable to vendor to run symbolic execution to reproduce the bug. They define branches that do depend on their input as "symbolic branches" (as they require symbolic execution). In experiments, they found that only 10% of branches are symbolic, so there's a big opportunity for optimiziation here. The aim of this work is to instrument symbolic branches, and they've implemented static, dynamic and hybrid approaches. In the static apprach, they mark variables as symbolic. Drawbacks of this are that the analysis is imprecise (overestimates and instruments too many branches) and that it doesn't scale (e.g. to libc). The dynamic analysis approach executes the code at least once and marks branches. Drawback: limited coverage. The hybrid approach simply runs the dynamic approach first and then uses static analysis to complement it for full coverage and high precision.
Their work allows a a "branch log" to be submitted to a vendor on a bug occuring (presumably via something a la Windows crash reporting), which the vendor can then run in a symbolic execution engine.
So what are the approximations here? If they make too many branches symbolic, the overhead for symbolic execution is increased. If they underestimate, they have to explore all paths and back-track when not hitting the bug, which is expensive.
They also need to deal with non-determinism (e.g. from system calls), but I didn't quite catch how they do that. Branch logs are represented in binary, with each bit denoting a branch (taken/not taken).
Evaluation shows that logging *all* branches results in a 345% overhead. Static analysis is almost as bad (334%), but dynamic does much better (19%) and hybrid is only a bit slower (26%). Now, for replay, the peformance is better when more information about branches is known: 170s for both all-branches and static approach, but many hours for the dynamic version (in one case; still large in the other); hybrid again strikes the best balance by being slightly slower than static/all-branches.
Limitations of their work: no support for starting replay from check-points for long-running programs, no support for multi-threaded applications. - ms705

Finding complex concurrency bugs in large multi-threaded applications

Complex bugs fall into two classes: semantic bugs (return incorrect results) and latent bugs (silently corrupt state and erupt later). Detecting those bugs is hard. One possible way is to write a specification (full or partial, i.e. assertions), but again, this is hard for programmers. In this work, they take the approach to use concurrency itself to detect concurrency bugs. Unlike other systems, they don't purely focus on detecting data races.
Basic hypothesis: A correct execution behaves in the same way as one of the sequential executions. How do we work out which one? Can check externally visible behaviour (highlights semantic bugs) or internal behaviour, i.e. application state (highlights latent bugs). They run multiple concurrent executions (using something called the PCT algorithm) and check for each one if it is linearizable. The main challenges they faced were checking application state and dealing with false positives.
For the former, simple bitwise comparison of course doesn't work due to pointers, so they need an abstraction. They use state summary functions (written by the progrmmer), specifying what state they expect to exist. Not entirely clear on how this would detect corruption, though.
False positives are caused by deliberate violations of linearizability, and they deal with them by "inspecting" every result found and inserting a filter into their tool if it was a false positive, otherwise generate a bug report.
They've evaluated their tool, Pike, against stable MySQL (360kloc). For the tests to use, they manually adapted MySQLs testing suite to provide concurrent tests (in future work, they plan to look at automated test generation). They created state summary functions for six important data structures in MySQL (~600loc, 2 man-months to understand and annotate the MySQL code). Initially, they saw 30% false positives, which turned out to be caused by application caches and MySQLs conservative conflict policy on those. After removing those, only 27 false positives remained. As for actual bugs, they found 12 tests that triggered bugs (out of 400 interleavings of ~1550 tests). - ms705

The aim is to take advantage of concurrency to identify concurrency bugs. This paper considers bugs that might not be caused by data races. The hypothesis is that a correct concurrent execution behaves the same way as one of the sequential executions; thus concurrency bugs can be found be checking for linearizability (though, of course, there may be many linearizations of a given concurrent program, for example imagine serving two concurrent requests sequentially). For analyzing the state of the application, bitwise comparison is insufficient to capture equivalence (e.g. a set may have elements stored in a different order, but these are semantically equivalent), so the programmer must write simple "state summary functions". For experience, Pike (this tool) was applied to MySQL. For test generation, they used MySQL's sequential tests, and adapted them to make the test suite concurrent. State summary functions were created for caches, indices, etc., and amounted to around 600 lines of code and two man-months of work. Initially, one third of the tests led to false positives (due to caches in the applications), so two filters were added to check for containment instead of equality. This left only 27 false positives, most of which were easy to rule out. The evaluation ran 400 interleavings of each of 1550 tests. 12 tests triggered concurrency bugs, with 8 instances of state corruption and 12 instances of inconsistent results. For example, making DROP and SHOW TABLE STATUS requests concurrently could lead to invalid fields in the result from SHOW TABLE STATUS. Concurrent SELECT and INSERT requests could see subsequent SELECTs returning stale results. - dgm36

Session 6: Hardcore OS

Operating System Support for Application-Specific Speculation

People are doing their own custom versions of speculative execution for each new application; this paper describes a general systems approach to support these applications. Implementing this in the application takes a lot of developer effort (and efforts end at the application boundary); doing it in the OS is overly conservative because the OS doesn't have enough semantic understanding of what's going on in the application. The idea is to separate the problem into the mechanism of isolation, and the policy that describes customizations. The mechanism has a common implementation with wide scope, while the policy gives semantic understanding. The customizable policy describes task creation, safe outputs and a commit policy (what is acceptable to commit). The mechanism checkpoints and logs processes, files, IPC and so on. The mechanism is exposed by a spec_fork() system call: a control process actually runs the application-level task, while a second, speculative process makes a prediction and carries on as if the prediction were true. The policy is implemented as get_prediction() (creation policy), equiv() (commit policy) and set_output_policy() (output policy). The first application is predictive launching in Bash, which uses machine learning to guess the next command that will be input: the commit policy strips spaces from the end of a line, and the output policy allows some safe X11 messages (that don't actually modify the screen). One application is to allow Firefox to speculate past SSL certificate verification, but this requires an output policy to stop private data being sent: this saves 60ms. Another application uses PBFT with client speculation to speculate using the first reply from the replicas. The performance approaches that of an application-specific version (within 8%). - dgm36

Fast prediction of task results provides additional opportunities for parallelization. This work is about making speculation a primitive, i.e. an OS service that applications can use. Consider four different designs:

  1. in-application speculation: just leave it to the application developer.
    + complete information
    + can predict arbitrary operations
    + safe operations allows
    - no reuse at all, needs to be re-designed for every application
    - limited scope, unsafe operations block (e.g. external output, side-effects)
  2. generic OS speculation: transparently provided by OS.
    + applications need no modifications at all
    + wide scope: unsafe operations taint
    - OS lacks semantic understanding
    - can predict syscalls only
    - handles apps conservatively
    To get the best of both extremes, we want to separate policy and mechanism. This allows apps to expose semantic information to the system.
  3. expose predictions: apps tell the OS what actions they predict to happen
    + predict arbitrary ops
    + reuse OS mechanisms (with app assistance!)
    + wide scope for taint propagation
    - speculative external output never allowed (e.g. network comms)
    - commit on identical results
  4. expose safety:
    as in (3)
    + safe operations

Implementation is based on the speculator kernel, which checkpoints and logs processes, files, IPC etc. Policies are expressed using the syscall API. spec_fork() breaks a process into a speculative process and a control process. The speculative process does the prediction of task A's output and runs task B; control process runs first  runs A and then commits if the prediction came true. Otherwise, it aborts the speculative process and re-runs task B. In an example, they speculate on the value of a variable x by adding a few lines of code. In order to finally print x, however, they have to set a policy that allows output (as it's an unsafe side-effect that would be forbidden for applications running in speculative mode).
They advocate predictive shell command launching as a possible application (this literally means that bash runs commands ahead of time in case you're deciding to run them, based on a machine-learning trained model). They evalute the usefulness of this using anecdotal evidence from a couple of example applications and conclude that they can hide 86% of application execution time.
Second application is predictive establishment of SSL connections in Firefox. They speculatively parallelize certificate validation and connection establishment (session key exchange etc.). This requires a couple of output policies: need to allow connection establishment, and block access to private data (i.e. don't send out the request itself before the validity of the certificate has been established).
Third application example is speculation in BFT (speculate that consensus on the first reply will be achieved). I probably would have mentioned this one first as it is the most compelling example (however, slightly worrying in case the "work" scheduled depending on agreement is launching a nuclear rocket! Again, this would require appropriate output policies, I suppose).
They show that the overhead of doing this vs. application-specific speculation is low. - ms705

SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Cache from Thrashing in Multicores

There are different types of multicore applications: VM-intensive (accessing virtual memory pages a lot), and file-intensive (mainly access buffered file blocks). When applications of different types are co-run, the last-level cache (shared between cores, usually L3) may experience thrashing as the buffer cache blocks eject the VM cache blocks. Their goal is to implement an OS-level buffer management technique to prevent this.
Conventional OS buffer cache usually maintains a list of file blocks in random colors (assuming the VM system does page colouring), and as misses happen, blocks are brought in (and some ejected, if necessary). They claim that this has a particularly bad effect on the LLC (although they aren't presenting experiments to substantiate that yet). If too many colours as given to the buffer cache, the VM pages are hurt (--> high miss rates), and if too few are given to it, the BC for file blocks is too small (--> high miss rates).
SRM stands for "selected region mapping": they assign BC colours on a temporal basis (instead of going, say, blue, orange, green, white, they go blue, blue, blue, white, ...). This avoids the drawbacks of fixed colour (= region) assignments, yet doesn't limit BC size. First challenge is to identify which BC blocks are going to be accessed in sequence/together. They describe some heuristics of how this can be done. Second challenge is they want to minimize LLC thrashing without sacrificing either VM or BC performance. They've invented something called a "colour zone", which holds lists of same-colour pages to be allocated. I didn't quite understand how this interacts with the rest of the system, but somehow they assign same-colour sequences to BC pages and sequential (different) colours to VM pages.
Evaluation on Linux 2.6.30 with a max. sequence length of 256 pages. They used a quad-core Xeon with two 3 MB shared L3s and a quad-core i7 with 4 MB shared L3. The ran two data-warehousing workloads on them, a VM-intensive one and (SJ) a file-intensive one (SS). Co-running them without SRM-buffer results in some slowdown (around 5%), but when running multiple SJ together, it is much higher, and SRM-buffer manages to reduce it by 33%. Another query-based benchmark shows similar results. For a combination of file-intensive workloads (e.g. grep) and VM-intensive scientific computing applications, they manage to reduce the slowdown over the baseline (applications running concurrently with themselves?) from 64% to 28%. They also show that there are no significant overheads from SRM-buffer. - ms705

Applications can be divided into virtual memory-intensive and file-intensive (buffer cache-intensive) applications. Running applications from both caches at once can cause cache-thrashing and will degrade performance. The problem is the lack of coordination from the last-level hardware cache and the software buffer cache. How can an OS-level technique prevent this thrashing? Page color bits (lower bits of the physical page number, corresponding to higher bits of the cache index) can be used to manage the cache more explicitly. Therefore, a small number of dedicated page colors can be given to the buffer cache. However, this gives a very small (128MB) buffer cache, which will lead to a high page miss ratio. The aim is to coordinate demand from VM and buffer cache to correctly allocate page colors. SRM-Buffer is the Selective Region Mapping buffer. A subordinate aim is to identify sequences, so that they can be assigned to the same color, since the will be accessed together. Sequences are determined using heuristics: same-file and same-application. The policy has been implemented in Linux 2.6.30, and evaluated on a 2x4-core Xeon and a 4-core i7 workstation. The experiments looked at PostgreSQL running a mix of queries, and gave a slowdown improvement of up to 33%. Other experiments considered TPC-H, scientific computing and file-intensive (e.g. PostMark) benchmarks. - dgm36

Is Co-scheduling Too Expensive for SMP VMs?

If we have SMP virtual machines, with dependent processes on different cores that share spinlocks, we need some sort of co-scheduling to avoid long synchronization delays on the locks. However, simply gang scheduling all the CPUs for a VM may lead to wasted CPU if the VMs are not fully loaded. The idea here is "balance scheduling", which never puts sibling VCPUs in the same runqueue, and doesn't force gang scheduling. It causes no CPU fragmentation and can improve the performance of SMP VMs to the same extent as co-scheduling. The evaluation is compared to the Completely Fair Scheduler (CFS, default in KVM), an affinity-based algorithm and a co-scheduling algorithm. The setup is 14 VMs with 4 VCPUS on a machine with 4 physical CPUs. One benefit is that spinlock latency could trigger TCP retransmissions, which decline using balance scheduling. - dgm36

This is about VMs with multiple virtual CPUs (i.e. the VM is SMP). They assume that if vCPUs run dependent tasks (e.g. accessing the same I/O resource), the vCPUs are dependent by transitivity and should not be co-scheduled, as one will have to wait for the other to release the resource.
Their experiments show that with multiple 4-vCPU VMs on a 4-core host, when only running 1 VM, there is only a small (5%) change of vCPU stacking, but at 2 VMs, it already goes up to 43% (for a CPU spin benchmark).
Co-scheduling of a VMs vCPUs can lead to physical CPUs being idle while they wait for vCPU gangs to be scheduled. Consequently, VMMs have relaxed co-scheduling in the past (VMware) or moved to selective co-scheduling (Xen).
They propose "balance scheduling" instead: vCPUs have lists of allowed CPUs (which are the ones that are not currently running another vCPU of the same VM) and pick the one with the shorted run queue. This means that the vCPU will get to run as soon as possible, but not necessarily at the same time as the others.
They compare against various other scheduling algorithms and find that the run queue length is the same on average (possibly better in maximum). Spin-lock latency on TPC-W benchmark is much better than some of the competition. Another benchmark shows significant speedups on various applications compared to KVM's CFS (completely fair scheduling), even for single-threaded applications in guest VMs, although more significant on multi-threaded apps. They also have a bunch of other benchmarks (bonnie++, ping+H264, different hypervisors). Synchronization-intensive performance on their hacked-up KVM is much better than default CFS and approximately as good as VMWare ESX, and they do best on a synchronization-free workload. - ms705

Comments (0) Trackbacks (0)

No comments yet.

Leave a comment

No trackbacks yet.