Session 7: Better Clouds
Kaleidoscope: Cloud Micro-Elasticity via VM State Coloring
The problem is that load on internet services fluctuates wildly throughout the day, but the bursts are very short (median around 20 minutes) and cloud providers are becoming "less elastic" (bigger VMs up for longer), and cannot support such short bursts because VMs are too heavyweight. The solution is based on VM cloning (SnowFlock), but the lazy propagation of state in SnowFlock leads to lots of blocking after the clone (for TPC-H). The solution is to do page coloring to work out the probable role of the page (code vs data, kernel vs user, etc.), and then tune the prefetching by color (such as read-ahead for cached files). Kaleidoscope also reduces the footprint of cloned VMs by allocating memory on-demand, and performing de-duplication. Most server apps tolerate cloning (only change is a new IP for the clones), and SPECweb, MySQL, httperf work fine. The experiments involved running Apache and TPC-H. Blocking decreases from 2 minutes to 30 seconds. TPC-H takes 80 seconds on a cold Xen VM, 20 seconds on a warm one, 130 seconds on a SnowFlock clone, and 30 seconds on a Kaleidoscope clone. Based on a simulation of an AT&T hosting service, Kaleidoscope achieved 98% fewer overheads using a 50% smaller data center. - dgm36
Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters
Data-intensive computing thrives on data-locality: map phases take 59% of job life times and 60% of the resources in the Bing cluster, and map tasks use data-local scheduling. A problem arises when you have a free machine and a task to run, but the task's data is not on that machine. In the Bing Dryad cluster, 21% of tasks get evicted and re-run, wasting 1/5 of the cluster. Facebook's Hadoop cluster achieves node-locality for only 57% of map tasks. Since storage is becoming a large proportion of the cost for a data center, it is no longer possible to replicate as much; Microsoft is going from 3 to 2 replicas, and Facebook is considering going to 1 replica + Reed-Solomon coding. Skew in object popularity leads to contention for access. Scarlett controls replication automatically, while conforming to a storage budget. The replication factor is calculated by counting the number of concurrent accesses to each file in the cluster, and taking the minimum of that and a "base replication factor". The replication budget limits the overhead of the system, based on either a priority (weighted by file size) or round-robin scheme. Replication must be efficient, due to limited cross-rack bandwidth, so the amount of link capacity consumed is constrained, and the replication traffic is compressed. In the evaluation, jobs speed up by a median 20.2%. The locality with 1 replica per block plus a 13% Scarlett budget has equivalent performance to triple-replication. - dgm36
CloneCloud: Elastic Execution Between Mobile Device and Cloud
Mobile apps are becoming more elaborate, making extensive use of compute resources. However, mobile devices are underpowered. Powered devices are well-powered, and network connectivity is widely-available. The approach is to split apps dynamically, so that the system adapts as app demands change. The strategy is to partition apps automatically, with seamless local/remote execution, and optimized time/energy by adaptation. Virtualization is used to make this feasible. The partitioning framework does static analysis to identify valid points at which to split the code; a profiler constructs a cost model; then the best choice is taken to optimize the objective (time or energy). For distributed execution, thread migration is used. To reconcile reference IDs, which are not globally unique, a reference table is added to the migration process. Three test apps were built: image search, virus scanning and privacy-preserving user-profiling. As the number of images to be processed grows, the benefits of CloneCloud for image search become more apparent. - dgm36
Session 8: Off the Beaten Path
Symbolic Crosschecking of Floating-Point and SIMD Code
SIMD implementations of algorithms are usually manually ported from serial/scalar code. There are various subtleties that arise from the use of floating point. For example, min() and max() are not commutative in floating point: depending on where a NaN is, the result may differ; it's also not associative. They use symbolic execution to check if the SIMD version of the code is identical to the scalar version. Challenges: huge number of paths (the usual), and symbolic execution tools' lack of support for floating point and SIMD code.
To cut down on the number of paths to be explored, they do something called "static path merging", which aggregates both sides of conditional branches into a single basic block (note that the code must be side-effect-free).
Adding floating point support to constraint solver and symbolic execution engine amounted to adding a canonicalization step before path equivalence test, which presumably transforms FP values to a form in which they can be compared using simple expressions.
They evaluated this on OpenCV, verifying 58 SIMD/SSE implementations against scalar implementations, finding some to be identical up to a bounded image size, and some inconsistencies (and some false positives). They only managed to run up to an image size of 16x16, due to memory limitations. Currently working on extending KLEE-FP to support GPU languages. - ms705
Scheduling Large Jobs by Abstraction Refinement
Background in formal verification community. In general, formal verification is undecidable, but using abstractions, it can be made tractable by ignoring irrelevant information. In this work, they consider the "cloud scheduling problem". They assume a task graph where each task knows its worst-case runtime and edges are labeled with the data transferred. The cloud is represented as a graph too, with corresponding capacity annotations. Conventionally, dynamic scheduling is used, and this is the only option if the work done is not known in advance. However, if that is known, a schedule can be generated in advance, giving the user a deadline for job completion. Working out the optimal schedule is NP-hard though, and even heuristics are still quite expensive.
They overestimate the computation performed and underestimate the resources provided by the cloud, merging tasks into larger abstracted groups (in order to speed up the scheduling). They abstracted inputs are fed to the scheduler, and the resulting schedule is investigated. If it is "good enough", it is used, otherwise the abstraction is reduced and the scheduling phase re-run.
FISCH and BLIND are examples of abstraction-refinement schedulers. FISCH is the standard greed scheduler: if all dependencies of a task are satisfied and there is an idle node, the task is scheduled. However, nodes are abstract, so they need to carry information about the concrete node utilization, and in order to schedule M tasks of duration D, we need to find M gaps of length at least D in the schedule. They use inverted indices (gap size --> (node, time)) to make this efficient.
In the BLIND scheduler, they start with a single (!) abstract cloud node and X-similar topological job abstractions. As tasks are scheduled, the cloud abstraction is refined, but it is abstracted again when nodes are idle (i.e. it's always just as detailed as required).
Metrics used to evaluate this are "cloud utilization" (basically, cumulative relative idle time) and "makespan". Unsurprisingly, they evaluated this using simulation, considering different job types (MR, Wavefront, matrix multiply, FFT) and a simple cloud model for a two-tier cloud with 210 nodes (two types of nodes, 50/50). Comparison to Hadoop used static scheduling with backfilling (as they need some dynamic scheduling; basically, they start the next job early to fill gaps). 10-15% improvement in job completion time compared to Hadoop. They attribute this to Hadoop framework scheduling overhead. - ms705
Cycles, Cells and Platters: An empirical analysis of hardware failures on a million consumer PCs
-- BEST PAPER --
Key question: do hardware faults actually matter? (especially micro-faults such as DRAM bit flips). They found that failure rates for consumer-grade machines are non-trival (chance of 1 in 190 over 8 month interval), that DRAM faults are spatially co-located, that over- and underclocking have an effect on failure rate and finally that faults are recurrent (97% reccuring failures within 10 days of previous failure).
Data comes from Windows Error Reporting logs, and they filtered it down to hardware failures, covering a pool of 1M machines. They focus on three types of failures:
- CPU subsystem failure: this is detected by the CPU throwing a machine-check exception (MCE)
- disk subsystem failure: failure to satisfy a critical disk request (e.g. read from page file)
- 1-bit DRAM failure: based on the mini-dump sent with the error report; they diff against the known-good code at MS (only considered kernel code)
Note that they define Total Accumulated CPU Time (TACT) as the cumulative time a machine has been on for. For example, for a 5 day TACT, the likelihood of CPU subsystem failure is 1 to 330, and with 30 days TACT it goes up to 1 in 190. The conditional probability of a second failure given a first failure is 1 in 3.3 and 1 in 2.9 respectively! (Third even worse: 1 in 1.8 and 1.7). Also found that faults are not rare over time.
They found that DRAM bit flips have spatial locality, and the same bit is likely to flip again (so not caused by alpha particles or cosmic rays!).
Furthermore, they looked at dependencies between the three fault categories and found that CPU/DRAM and CPU/Disk are dependent, but DRAM/Disk are independent. Over-clocked CPUs (more than 5% above design speed) have higher failure rates (as would be expected), but over-clocking is also corelated with higher DRAM flip rates (not so expected). Conversely, under-clocked CPUs (even by as little as 1%) were much more reliable (80% over normal design speed machines).
Looking at system manufacturers, "branded" (defined as from one of the top 20 manufacturers by sales) machines are more reliable in CPU and DRAM category, but there is no difference in disk category. They also, surprisingly, found that laptops are more reliable than desktop machines.
For non-overclocked machines, they found that when related to TACT, faster CPUs are less reliable, but since they run more cycles, normalizing this to probability of failure per cycle makes sense, and after doing that, there is no difference. Using the BIOS release date as an approximation to machine age, they find that younger CPUs are more likely to fail than older CPUs, DRAM exhibits a bathtub curve of high failure rates in the extremes and, unsurprisingly, older disks are more likely to fail than newer ones. Comparing to previously published results, they find that they can confirm most previous results, and add some new ones. As a conclusion, they wonder if it is time for a hardware fault-tolerantOS. - ms705
This is an empirical study into the effect of hardware faults on OS reliability. Across consumer machines, the failure rates due to hardware (CPU/one-bit DRAM/disk) are non-trivial, recurrent failures are common, CPU speed has a large effect and DRAM faults have spatial locality. The data set is based on Windows Error Reporting logs, filtered for failures that can confidently be attributed to hardware. CPU failures are identified by machine-check exception (MCE). Disk failures are identified by a failure to read data within the critical kernel code. 1-bit DRAM failures are identified by a mini-dump captured around the instruction pointer, diffed against the canonical version of the code held by Microsoft. After a hardware failure, a machine is almost certain to crash with the same failure within 30 days. The locality of 1-bit DRAM faults (79% of machines that crashed, crashed again at the same physical address) makes it unlikely that this was a stray alpha-particle. Intriguingly, CPU/Disk and CPU/DRAM failures are interdependent, but Disk/DRAM failures are independent. Between two major vendors of CPUs, overclocking tends to increase the chance of CPU failure (and also 1-bit DRAM flips); underclocking CPUs lead to an 80% decrease in failure probability. "Brand name" machines (from the top 20 OEMs by sales) tend to be more reliable than "white box" machines (except for disks); laptops tend to be more reliable than desktops. Faster CPUs are more likely to fail, and all machines have the same probability of failure per cycle. Younger CPUs are more likely to fail than older CPUs; DRAM failures exhibit more of a bathtub curve; older disks are more likely to fail. The results may make the case for a hardware fault-tolerant OS. - dgm36