With some delay owed to a laptop battery running out and a busy travel schedule following the event, here are my summaries for (part of) the second day of HotCloud in Boston.
Again, I had to duck out of parts of some of the sessions in order to do last-minute work on an imminent paper submission, but I at least covered two complete sessions.
Some of these summaries are going to turn up in USENIX's ;login: magazine (albeit in a more polished form); I am very grateful to USENIX for supporting my attendance at HotCloud with a generous travel grant.
[I skipped part of these sessions to work on my paper submission, and attended some USENIX ATC talks, but did not take notes -- sorry!]
Session 7: Working with Big Data
Big Data Platforms as a Service: Challenges and Approach
James Horey, Edmon Begoli, Raghul Gunasekaran, Seung-Hwan Lim, and James Nutaro, Oak Ridge National Laboratory
This talk is about the "big data" work done at ORNL, motivate and describe it. ORNL has ~4k scientists, and some of the world's fastest supercomputers; also extensively collaborate with federal agencies. They're interested in "big data" and "cloud computing" -- fed govt is interested in processing large data, and "mandates" move to the cloud. Many agencies have "big data needs", but the actual nature of the data varies quite a bit. Data sets in the 100 GB -> PB range, mostly 10 TB-ish. Application requirements also vary, with everything from business analytics to visualization and webapps. Bottom line: no single "sponsor need", so they use cloud computing in order to be flexible and satisfy sponsor needs. They have found that they run into limitations with all of the IaaS, PaaS and SaaS paradigms. What their sponsors really want is "Big Data Analytics as a Service". And they want to use these distributed services using standard methods and commands they are familiar with. Key insight: treat VMs like processes, and services like applications. So want something like a cloud package manager. They implemented "cloud-get", and apt-get inspired concept of a repository for services like Hadoop/Cassandra etc. Example command: "cloud-get install cassandra --nodes=20 --storage=20TB". So what they want to achieve is standardization of deployment procedures, and provide common infrastructure that package writers can hook into. Also want to support elasticity, meaning that they can "re-configure" packages with different numbers of nodes or amounts of storage. Finally, moving data between systems is hard today, and requires ad-hoc scripts. What they would like is "data-get", standardizing the process of moving data, akin to a named pipe. Example: "data-get data_package --from=cass_inst1 --to=cass_inst2". Components of their system are service packages, data packages and brokers (user-facing). They do, however, rely on an external cloud manager (in their prototype, Xen + ClockStack). Service package authors must define VM types, dependencies between classes and event handlers. Note that this is a slightly higher level than IaaS; users should not need to know about VM or storage placement. They don't necessarily control scheduling and co-location, but nonetheless need to make QoS guarantees (it remains unclear how they do so). Summary: proposal of the package manager/repository concept for cloud computing, would like to get feedback from the community, and put system out for people to make packages.
Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica, University of California, Berkeley
Most jobs are small (82% of Hadoop jobs at FB have fewer than 10 tasks). Assert these are interactive, latency-constrained jobs, e.g. data analysis. Those small jobs are, of course, more sensitive to stragglers. Previous approaches to migitation: black-list bad machines, but this doesn't help with non-deterministic stragglers. Speculation can help, but it is hard to make the call for speculation, as we need to model complex systems. LATE and Mantri are existing approaches, but only really help with large jobs -- small jobs still have 6-8x difference in runtime between median and slowest task. In a nutshell, speculative execution techniques work in two steps: (1) wait and monitor progress, (2) launch speculative tasks. This doesn't work well for small jobs, since they don't run long enough to get good predictions for whether a task is a straggler. Also, with small jobs, all tasks run in parallel, so there is no amortization over time. In contrast, what they propose is to proactively launch entire job clones, and then pick results from the earliest clone to finish. This "probabilistically migitates stragglers". Now, question is, is this feasible in terms of the resources used? Turns out the answer is yes, since most clusters are under-utilized (~20% at FB, as provisioned for peak utilization). Lots of people propose energy-efficiency optimizations as a result of this, but they are not actually deployed and not popular with real cluster operators. Nonetheless, this could lead to a tragedy of the commons, where everyone is using extra resources on the assumption of low utilization. But: 90% of jobs only use 6% of resources, so we can easily support doubling that, given the low overall utilization. It does, however, turn out that we need well over 3 clones in order to migitate straggler effects, which leads to contention on input data when naively implementing redundancy at job-level granularity. Instead, if we do task-level cloning and pick the first task to succeed out of every equivalence class, far fewer clones are needed. They evaluate this against the baseline of LATE, using a trace-driven simulator, on a month-long Facebook trace, with a 5% cloning budget. Headline number: on small (<10 tasks) jobs, they get ~42% reduction in completion time, with task-level cloning massively outperforming job-level cloning. An unresolved issue with task-level cloning, though, is contention on the intermediate data, since e.g. the early-finished map outputs will be used by all reduce task clones. This is, as of yet, an open question.
Predicting Execution Bottlenecks in Map-Reduce Clusters
Edward Bortnikov, Yahoo! Labs, Haifa, Israel; Ari Frank, Affectivon Inc. Kiryat Tivon, Israel; Eshcar Hillel, Yahoo! Labs, Haifa, Israel; Sriram Rao, Yahoo! Labs, Santa Clara, US
Similar question addressed to previous talk. Stragglers are bottlenecks. People try to combat this using avoidance (reduce probability of occurence, e.g. data locality) and detection (identification of stragglers, then speculative execution). All previous approaches are heuristics, though -- in this work, they use machine learning to predict stragglers. Motivation: speculation is actually quite wasteful. 90% of speculative tasks are killed because the original ended up finishing first (e.g. only transient bottleneck). When looking at the 15% top straggler nodes over time, they find that there are some consistently pathological nodes. So why not take advantage of historic information to predict clusters? Jobs are often re-run -- 95% of mappers and reducers are part of jobs that run 50+ times over 5 months. They introduce a new metric: the task slowdown factor, which is the ratio between a task's running time and the median running time among the sibling tasks in the same job. Root causes of stragglers: data skew (although >4x is rare, apparently), and hotspots (hardware issues, contention). A sample over ~50k jobs show that 1% of jobs had mappers with more than 5x slowdown, and 5% among those jobs with more than 1000 mappers (60% due to skew, 40% due to hotspots). With reducers, 5% overall, but 50% (!) among large jobs (10% due to skew, 90% due to hotspots). Proposal: a slowdown predictor ("oracle"). Takes as input the node features and task features, and produces a slowdown estimate as its output. Large number of proposed features (see paper). For slowdown prediction of mappers, they find an R^2 value of 0.79, for reducers, R^2 = 0.401 (where 1.0 is optimal). So map stragglers are better predicted than reduce stragglers. Might be able to alleviate this using stage-specific predictors.Summary: data skew is important signal for straggler detection, but node hardware and network traffic are also very important. Note that this work focuses on the "average straggler", rather than the pathological outlier.
Session 10: Scheduling
Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources
Jeongseob Ahn, Changdae Kim, and Jaeung Han, KAIST; Young-ri Choi, UNIST; Jaehyuk Huh, KAIST
In a virtualized environment, VMs share resources. Contention can lead to performance degradation. Multi-core increases the degree of resource sharing. Especially cache contention is a problem. Conherence protocols mean that excessive misses on one core can cause evictions on others. Previous work tries to migitate this by balanching threads to homogeneize miss rate and minimize interference in shared caches. However, NUMA makes such cache-aware scheduling harder, since some memory requests will go to remote memory controllers (expensive!). But: cloud virtualization systems open up a new opportunity for load-balancing at the VM level using live-migration. The goal is to try and minimize global LLC miss count, and/or global remote page access count. They discuss some kind of interleaved memory allocation system, which achieves performance somewhere between the best and worst case. Take-away: there is significant benefit to be had from placing VMs in an architecture-aware manner. They propose a system with a "front-end node" that makes global scheduling decisions, using a cache-aware or memory-aware cloud scheduler. Backend nodes report their statistics to this node. There are "local" and "global" phases, the former moving VMs between soeckts inside a machine, the latter moving them globally across the cluster. In their experiments, they use a 4+1 node testbed, with 12 MB shared LLC between 4 cores in each machine. They run 8 VMs with 1 GB memory and 1 CPU on each machine. They find that the cache-aware and NUMA-aware schedulers can give significant improvements over the worst case (?!) when combining CPU-bound with memory-bound workloads. Less improvement if homogeneous workloads (e.g. all CPU-bound).
North by Northwest: Infrastructure Agnostic and Datastore Agnostic Live Migration of Private Cloud Platforms
Navraj Chohan, Anand Gupta, Chris Bunch, Sujay Sundaram, and Chandra Krintz, University of California, Santa Barbara
Private PaaS brings "cloud technology" (elasticity, distribution, fault tolerance, high availability) into the local environment, and thus enables programmer productivity, as they no longer have to worry about "scaffolding" work. Focus on their thing, AppScale (GAE clone-ish), which is infrastructure- and datastore-agnostic. This is achieved by abstraction layers, allowing high-level use of GQL and Google's datastore API. ZooKeeper for coordination and transactional semantics. But the PaaS systems are being upgraded all the time, both at the underlying hardware layers and at OS-level and datastore software layers. Goal: be able to sub in another datastore underneath the system, eliminating lock-in. Do this with minimal downtime and overhead, in a real-system, with backwards compatibility and transactional semantics, and without data loss. Consider an example: from Cassandra on OpenStack to HBase on Eucalyptus. Steps: 1) intialize new deployment, 2) synchronize ZooKeeper meda-data, 3) warm up a mem-cache, in order to minimaize load on DB (storage backend). Do this by doing a copy-on-write and copy-on-read for each key. 4) initiate a snapshot of the data store, transfer and load into new system, 5) data proxy more, 6) handover. Some results: a single-node deployment, moving from Cassandra to Hypertable. Find that the per-request overhead is less than 1%, because the memcache is very fast. ZooKeeper sync takes about 45 sec for 100,000 locks. Latency difference between normal operation and migration situation is small.
Automated Diagnosis Without Predictability Is a Recipe for Failure
Raja R. Sambasivan and Gregory R. Ganger, Carnegie Mellon University
Systems research is all about trade-offs. There are some commonly-agreed metrics: correctness, performance, reliability, power... but this talk argues that predictability has been ignored, but is very important. This is especially true for distributed systems, as it affects our ability to optimize other metrics. There is also no single answer -- doing so requires lots of hard work when building systems! To get predictability, low variance is essential. If we don't have predictability, we waste resources (because the slowest resource dominates), and also have a harder time providing SLA's. Finally, it is also important for the success of automated diagnosis. Many automated diagnostics tools are developed, but few are making it to production use. The problem is not the tools, it's the system, or rather, the layering of many systems on top of each other. Diagnosis tools focus on deviations in metrics to localize performance problems. Consider for example the sample distributions of two metrics. If they do not overlap much, they probably represent different underlying distributions. But if they do, because of high variance, we can't really tell. With a complex setup of interacting systems, a performance tool's life is incredibly hard. Even if the tool can analyze the services in isolation, it may not get conclusive data about the source of a performance anomaly because of high variance in some of the distributions. Take-away: the usefulness of a diagnosis tool is really limited unless the entire system has good predictability.Unfortunately, there exists no secret sauce here, just hard work. Developers must identify sources of high variance, and rigorously isolate them. They explain the "three I's" of variance: inadvertent variance is unintentional, but can be reduced; intrinsic variance is fundamental (e.g. disk performance), so we must isolate it; intentional variance is a result of a trade-off made by developer (e.g. low-latency scheduling), and we must isolate it. Still, many open questions: how much reduction in variance do we need to achieve predictability properties? if intrinsic variance is significant, maybe we need to change something about hardware? if intentional variance is significant, maybe we need to re-evaluate trade-offs?
[Again, I sat in this session, but worked on my paper, so no notes -- apologies.]