I am a visiting preacher at HotCloud today, delivering a sermon about the seven deadly sins of cloud computing research. Of course, syslog provides you with the usual live-blogging service, although some sessions may be missing, as I nip out to work on a soon-to-be-submitted paper!
Here goes the first day...
Session 1: Cloud risks
I presented in this session, so there is no blog coverage. You can read the papers, though :)
Session 2: Cloud hardware
Saving Cash by Using Less Cache
Timothy Zhu, Anshul Gandhi, and Mor Harchol-Balter, Carnegie Mellon University; Michael A. Kozuch, Intel
This is about using RAM to cache stuff in the cloud (mostly for serving workloads, with a DB as the backend). The contribution is that the cost impact of caching (extra VMs) can be reduced by scaling down the size of the cache in periods of low load. So this is really making use of elasticity of cloud resources, which is nice.
They look at three questions: (1) does reducing the cache size increase cache misses to the extent that the backend DB gets overwhelmed. Their solution is to shrink only so far that the workload at the DB remains the same (but since the overall request rate has gone down, a lower cache hit rate can be tolerated).
(2) are the savings significant? Answer is that it depends on the data popularity distribution. Uniform looks good, but only achieves a small reduction in cache size. In reality, popularity is often Zipfian, and in that case, small hit rate reduction translates into large savings in cache size.
They have a theoretical model that shows that at a 4x peak-to-low load ration, they 50% cache size reduction is possible. Experimental results validate this.
(3) how do you actually implement this, and how do you migrate "hot" data? Initially, they don't migrate anything, in fact, and fault it in from the DB. After a few minutes of slowdown, the response rate stabilizes at the previous level again. But if they migrate the hot data explicitly, this response time spike can be avoided. They evaluate a naive scheme, partitioning cache servers into a "primary" and "retiring" group, and migrate hot items into the primaries before killing the retiring group.
Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2
Zhonghong Ou, Hao Zhuang, Jukka K. Nurminen, and Antti YlÃ¤-JÃ¤Ã¤ski, Aalto University, Finland; Pan Hui, Deutsch Telekom Laboratories, Germany
Amazon seem to have heterogeneous hardware platforms in their data centres. They have some evidence for time-variance, suggesting rolling replacements. Also confirm that different availability zones have different proportions of certain processor models. The find the CPU information by looking at /proc/cpuid, and using the non-trapping "cpuid" instruction (so they can look through the hypervisor).Microbenchmarks (they did a total of 10) show consistently heterogeneous performance of the different architectures. On a more macro level, they find that Redis (in-memory K-V store) performance differs, and consistently so across different numbers of cliends accessing. Httperf (HTTP benchmark) is slightly different: there is no diffence up to a certain level of request rate, but then starts diverging. Overall take-away is that the difference in performance can be as high as 40-60% for some (CPU-bound!) workloads.They also did some kind of cost analysis, quantifying the cost savings one can get by dropping badly performing instances (of course, this is only worthwhile on long-running jobs, as the charging granularity on EC2 is per-hour).
RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store
Yiming Zhang, National University of Defense Technology; Chuanxiong Guo, Microsoft Research Asia; Rui Chu, National University of Defense Technology; Guohan Lu, Yongqiang Xiong, and Haitao Wu, Microsoft Research Asia
RAM caching is of increasing interest in the cloud world. RAMCloud has introduced the idea of fault-tolerant replicated in-memory data storage. But their approach is not ideal for data-centre networks, as recoveries cause traffic spikes. They are investigating the performance of a RAM-based storage system on a known data centre topology, with servers forming a directly connected tree (primary at root, recovery servers 2nd level, backups at 3rd level). This builds on top of an existing system called BCube, onto which they map their key space (keys naming data items in K-V pairs) using a multi-layered ring. Each primary server has a recovery ring, and each recovery server has a backup ring. There's a bunch of heart-beating and probing going on in order to discover failed servers. Their principle is that backups are always direct neighbours (i.e. have a 1-hop connection), so that traffic spikes are localized.Â Evaluation shows that their performance is similar to RAMCloud and BCube in the no-failure case, and the maximum recovery time is about 9 seconds, with ~456 MB/s for the first couple of seconds (compared to a 720 MB/s ideal on 10GbE). They can also deal with a (single) switch failure, at the cost of a small degradation in performance. This is a preliminary prototype and they continue evolving it. Thinking about including SSDs and low-latency Ethernet technology, and utilizing multiple cores.
Session 3: Networking
I missed the first three talks of this session. Sorry!
Opening Up Black Box Networks with CloudTalk
Costin Raiciu, Mihail Ionescu, and Dragos Niculescu, University Politehnica of Bucharest
GRIN: Utilizing the Empty Half of Full Bisection Networks
Alexandru Agache and Costin Raiciu, University Politehnica of Bucharest
EyeQ: Practical Network Performance Isolation for the Multi-tenant Cloud
Vimalkumar Jeyakumar, Mohammad Alizadeh, David MaziÃ¨res, and Balaji Prabhakar, Stanford University; Changhoon Kim, Windows Azure
(I missed most of this talk.)
Something about 10 GbE performance when sharing a link between bursty UDP flows and high-throughput TCP flows.
A Case for Performance-Centric Network Allocation
Gautam Kumar, Mosharaf Chowdhury, Sylvia Ratnasamy, and Ion Stoica, University of California, Berkeley
Users of data-parallel frameworks often do not have an idea of what the network resource requirements of their code are. Existing work does explicit accounting for network resources, or focus on providing fairness and/or isolation. The drawback of the former is that the precise requirements are often not known, and of the latter that it does not give users any performance guarantees.
What they propose instead are performance-centric allocations.Â Â Consider two types of communication patterns: shuffle and broadcast. Observation: as you have twice as many workers ("scaling the application") in shuffle, the total data moved across the network remains constant, but in the broadcast case, it increases by 2x! We will need to allocate more network resources to the application. Considering a shuffle-only cluster, per-flow allocations end up wasting resources (less data communicated than expected; didn't quite get why), but a proportional allocation fixes this (didn't quite catch this bit). In a broadcast-only cluster, proportional allocation is under-allocating (as parallelism is limited), but per-flow allocation fixes this.
Of course, real clusters are heterogeneous (mixed shuffle and broadcast workloads). When they scale up both a shuffle and a broadcast job, shuffle will get 1/3 of bandwidth (before: 1/2), broadcast 2/3 (before: 1/2). This means that, as contention increases, both jobs' completion times degrade uniformly. This is preliminary work, they are working on a mathematical formulation to predict runtimes and experimental evaluation (the previous is all theoretical so far).
Session 4: Programming models
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica, University of California, Berkeley
Lots of people want to do stream processing (e.g. user activity stats, spam detection, road traffic estimation, network intrusion detection). Want to do this at large scale, with O(1 sec) latency. Challenges: need to supply fault-tolerance and cost efficiency. Traditional stream processing systems fail at one or the other. They often use a "one-record-a-time" model with message passing between nodes. Hard to make this fault-tolerant: either need to replicate entire nodes (wasting resources). Alternative: buffer downstream messages, and re-route them on failure, but this increases latency. Neither of these deal with stragglers (important in public cloud!). Their approach is to discretize the stream: buffer time windows of data and process it using batch jobs ("mini-MR jobs"), then produce output. This can deal with failure using the familiar replay approaches. They implement this on top of Spark, and achieve up to 2 GB/s throughput at sub-second latency (workloads: grep/wordcount). Also can recover from failures within one second.Â A discretized stream (D-stream) is basically a sequence of RDDs. For durability, they checkpoint periodically. The API is a LINQ-like language with transformations and operators. They also have some other benefits: atomic processing of each record gives consistency, and the ability to combine batch computations with streaming computations.
Using R for Iterative and Incremental Processing
Shivaram Venkataraman, UC Berkeley; Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber, HP Labs
People want to use complex algorithms on "big data". Â Common algorithms (machine learning, graph algorithms) share the property that they can be expressed as iterative linear algebra operations. Matrix operations are nicely concise, and there are existing techniques to implement them. The have structure and are coarse grained -- however, they need global state (sometimes). Existing platforms to do this on "big data" are data-parallel (MR, Dryad), or graph-centric (Pregel, GraphLab).Â First challenge: dealing with the unbalancedness in sparse matrices. Second: dealing with incremental updates. Their system, Presto, extends R (an array-based language) to operating distributedly on big data. They introduce a new abstraction: a distributed array. Constructs like "foreach" can just run in parallel on all partitions of the array. Others, like "onchange" and "update", selectively compute on parts of the array only (how do they deal with changes that trigger a recomputation larger than the partition?). They can register event handlers by putting code inside an "onchange" block.Â For sparse matrices, they use "dynamic partitioning" (really just splitting long-running partitions into subpartitions). They learn the correct partitioning structure over multiple runs of the job. This gives them up to a 2x improvement "depending on the algorithm and data" (after how many re-runs?! Rigor!).Â For incremental updates, they use consistent snapshots based on version numbers (sounds similar to a bunch of the things that were in EuroSys).Overall, they achieve a 20x speedup over an in-memory data-parallel Hadoop implementation.
The Resource-as-a-Service (RaaS) Cloud
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, and Dan Tsafrir, Technionâ€”Israel Institute of Technology
Trend in cloud computing: move from IaaS to RaaS; shorter duration of rental periods, more fine-grained resources offered. Claim: we should go to a micro-charging model that allows for low-latency dynamic addition and release of resources. Apparently there are companies that do this kind of fine-grained accounting, and also give SLA guarantees. Classic QoS argument: can differentiate pricing depending on guarantees and demands. But need to consider the invasiveness of fine-grained resource profiling. Also need a strategic agent on part of the cloud provider, trying to maximize profits. All kinds of economic implications (trade resources on "futures" market etc.).