Various people from Cambridge are currently in Bern für EuroSys 2012, and will be reporting live from the conference here, as well as summarizing the trends and highlights afterwards.
So here goes -- we're kicking off. Read more below the fold!
- ~300 attendees, most from Switzerland, otherwise US second, DE third, UK 5th (better than before!)
- exactly 50/50 students/non-students
- 3/4 of attendees from Europe
- 27 accepted papers (up from last year)
We're now getting a spiel from the vice-rector of the University of Bern about the place, but the technical program will kick off shortly.
- Some points from the PC chairs -- heavy/light PC this year (varying levels of commitment); they tried to ensure that only experienced people were on the "heavy" PC and attended the PC meeting.
- PC-authored papers were held to higher standards
- Successful shadow PC
- Novelty: "review summaries"; rejected papers got short summaries why their papers were not accepted, accepted papers have longer reports on strengths and weaknesses, which will be published openly online!
- All submissions (total 179) got three (first round) or five (second round) reviews; 69 discussed in PC meeting, 27 accepted.
- 11% of papers have "cloud" in the title, 50% have it in title/abstract/topic.
- Just over 50% of authors from U.S.; next largest are China and Germany.
- 17 out of 27 accepted papers from the U.S.; none from the U.K again :-(
- 19/27 papers are student papers.
Now we're actually getting into the technical program.
Session 1: Software Transactional Memory
STM in the small: trading generality for performance in software transactional memory
Aleksandar Dragojevic (EPFL) and Tim Harris (Microsoft Research)
They have developed SpecTM, which is a specialized STM thing designed for concurrent data structures, following the usual adage that, despite general-case STM being slow, one can make something fast for domain-specific cases. SpecTM has performance that is almost identical to a hand-crafted CAS implementation for skip lists. The main things making STM slow are software book-keeping and expensive memory operations (STM calls, transforming one instruction into ~10). They speed things up by using special “short” and “single-access” transactions that have very limited scope (in terms of locations accessed, and number of accesses). In general, the theme seems to be that they introduce assumptions that allow them to have static state representations, and get rid of logic by making limiting assumptions. They also prefer to use lots of short transactions (e.g. to update different parts of a data structure) and “glue” them together using lock-free algorithm techniques, and only fall back to “normal” transactions for (rare) corner cases. One neat implementation trick is that they co-locate ownership records and data in memory (rather than having them in separate management data structures), so they end up in the same cache line, although this does break standard library constructs like memcopy!In the skip list optimization, they make up 50% of the performance difference between CAS and naive STM by using short transactions, and further optimizations (including the data/meta-data colocating) bring them within 2% of CAS performance. They also confirmed these results on a hash table implementation.
Improving Server Applications with System Transactions
Sangman Kim, Michael Lee, Alan Dunn, and Owen S. Hofmann (The University of Texas at Austin), Xuan Wang (Stony Brook University), Emmett Witchel (The University of Texas at Austin), and Donald E. Porter (Stony Brook University)
Starts off with the usual 2D taxonomy of parallelism vs. maintainability/effort, using coarse vs. fine-grained locking as an example. They observe that server applications are often single-threaded and use coarse-grained locking. So, clearly, there is a low-hanging fruit for parallelization here. Sounds like there is a previous system called TxOS, which provides system transactions (as part of the OS, and without special hardware), and puts a JVM between the application and TxOS (didn’t quite catch why?). Their latest thing is TxOS+, adding pause/resume support and commit ordering, results in ~88% speedup. Server apps often synchronize on OS services (e.g. file system), or are inherently dependent on determinism, which makes it hard to parallelize them well. They look at the examples of state-machine replication and an IMAP server. The former can be parallelized by using ordered transactions; the latter is a poster-child application for system transactions because lockless designs result in anomalies (e.g. messages disappearing from maildir listing), and lock-based designs do not parallelize well.When doing system transactions, there are some “middleware” (e.g. libc) operations that cannot be rolled back without problems. They introduced the pause/resume operations to expose state changes by middleware (libc, JVM GC) to other, parallel threads. They also made 17k LOC of kernel changes (whoa, matey!) -- presumably to TxOS -- to enable parallelization of many OS operations.Evaluation on a “BFT graph server” shows that their throughput can improve up to 88% over Linux (I think -- didn’t quite get the details here). In the Dovecot mail server, they changed 40 LOC (out of ~138k) to parallelize mail directory access. For 4 clients, they see increased throughput as the write ratio increases, apparently due to better block scheduling. I am not sufficiently familiar with this area to be allowed a qualified opinion, but it seems to me that their argument of requiring few application code changes is only persuasive as long as the massive (!) library and OS changes are one-off, which I think is what they are arguing. There’s a bit of a discussion in Q&A about how we can know that middleware is well-behaved, and the answer is that the person inserting the syscalls in the middleware will just have to know what’s going on.
Session 2: Everything green -- Energy matters
Where is the energy spent inside my app? Fine Grained Energy Accounting on Smartphones with Eprof
Abhinav Pathak and Y. Charlie Hu (Purdue University) and Ming Zhang (Microsoft Research)
Mobile is booming. Application performance matters. Back in the old days, we invented profiling tools to optimize compute performance. But in the mobile world, energy, not compute speed, is the key issue. So, the world needs an energy profiler! There exist various power models already, but they are either not applicable to mobile devices (traditional models), or coarse (EuroSys ‘11 FSM model using syscalls as power triggers). Mobile apps are very complex and composed of fairly independent components. Their eprof tool supports various granularities (thread/process/routine), uses per-component state machines, and considers specific mobile issues such as lingering energy consumption (e.g. because of tail energy consumption due to network/storage access). This is hard, because the tails can overlap (e.g. consider sending two packets within a short time -- they will share a a tail!). Another problem specific to mobile devices is that they rely on programmer-managed wake-management to allow freezing the OS when the screen is turned off, while still giving a programming model that allows arbitrary communication. eprof instruments application source code with energy profiling calls and generates traces when running on an eprof-enabled mobile phone OS. They get around needing the source code on Android by simply instrumenting every application in library code.A case study on the Android web browser found that it spends 35% of its energy budget on something called “TCP conditioning”. Angry Birds spends 45% of energy on user tracking! They also found a bug in the Facebook app that erroneously acquired a wakelock. Overall, by far the most energy is spent on I/O (>60% for all apps tested) -- not really a big surprise here, though. Most of it is spent in bursts, which they group into “bundles”. They have few of those per app, and a key question is what the app does in the “tail states” between the bursts inside a bundle. In some cases (e.g. some chess app), there is nothing happening in those states (in this case, after fetching an ad). Clearly, batching/aggregating I/O would have helped here!In the Q&A, someone asked how they deal with buffering (e.g. network “read” where packet is already there, buffer cache). I didn’t hear all of the answer, but I think it was that they don’t currently, but there are mechanisms that could be used. Someone also asked about validation -- the answer is that there is no ground truth, although they did some basic experimental sanity checks by stopping and resuming apps.
Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis
Yanpei Chen and Sara Alspaugh (UC Berkeley), Dhruba Borthakur (Facebook), and Randy Katz (UC Berkeley)
Power consumption in data centres is a big deal in terms of OpEx. They use “efficiency” as a key metric, which is just work divided by energy input, so the obvious choices are increasing utilization or decreasing energy consumption. Idea: run non-time critical batch jobs in low utilization periods. Other option: power machines down in order to reduce energy consumption, but this results in data unavailability and workload spikes.Their key contribution is the differentiation between interactive MapReduces and non-interactives, and then optimize energy based on this knowledge. They analyzed a month-long trace from a 3k machine Hadoop cluster and found the usual things: most jobs are small, have small inputs, and run for a short time. They conclude that these must be “interactive” jobs.The proposed technique is to isolate the interactive jobs in one part of the cluster (i.e. partition it). The interactive zone is always-on, whereas the “batch zone” alternates between low-power and on states depending on job arrivals. This makes more sense, because they assume that the non-interactive batch jobs are long-running, so the overheads of bringing machines up and down are acceptable. They also treat interruptible jobs separately, and checkpoint them. To decide what the size of the interactive zone should be, they use simple heuristics: use the same percentage of the cluster as used by the storage for the interactive jobs (e.g. 10% of storage is due to interactive jobs => use 10% of cluster). All of their evaluation is simulation -- they simulated assignment based on the trace, and used an on-off power model to work out the energy savings. They find up to 50% energy saving. This does not come for free -- the delay for interactive (!) jobs goes up -- about 40% experience a higher latency (although only a few seconds), and there is a long tail.
GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks
Inigo Goiri, Kien Le, and Thu D. Nguyen (Rutgers University), Jordi Guitart and Jordi Torres (UPC), and Ricardo Bianchini (Rutgers University)
They assume a data center with both a green and a traditional “brown” energy source. The green energy source only provides energy for some of the time, and ideally we would like to avoid wasting any of that free energy, so they try to shape the workload such that it fits the energy availability curve. They focus on MapReduce workloads because they are simple, and delay jobs in low-energy phases, subject to a maximum completion timeout. They also turn off machines to save energy in low-supply phases. They also take the weather forecast into account (!) to make predictions for the future. The energy budget for jobs is computed based on historical information, and dynamically updated. With the brown energy component, they also consider peak vs. off-peak electricity pricing. They then assign energy in order of price, so as to minimize the amount of high-price energy used. Prior work on sending servers to sleep had an always-on set of machines that held all the data; their approach is to have only the “required” data available, and thus they require fewer active servers. The policy for this is the obvious one -- they look at the waiting job queue and determine the data needed, then migrate data and shut down machines accordingly. Evaluation on a 16-machine cluster (!), using Hadoop, EAHadoop (prior work) and GreenHadoop; using an energy profile from New Jersey and a Facebook workload (jobs with “up to” 37 GB of input data?!). The energy prediction is reasonably accurate, apart from unpredictable events like clouds, rain and thunderstorms. Predicting 6 hours in advance, they have ~20% inaccuracy, which goes up to 40% for longer time periods. Unsurprisingly, they find that they save energy by re-shaping the workload (comparing vanilla Hadoop not shutting down any nodes, vs. GreenHadopp), seeing 31% more green energy used, 39% cost savings. Spikes in energy profile are due to data migration prior to machine shutdown.
Session 3: Cloud I
Frugal Storage for Cloud File Systems
Krishna Puttaswamy, Thyaga Nandagopal, and Murali Kodialam (Bell Labs, Alcatel-Lucent)
Data is moving to the cloud. Claim that all file accesses are served from the cloud (dubious). There are different options and cloud service providers, with different performance and cost tradeoffs. They have a system that dynamically adapts depending on changes in FS size, access patterns, and provider policy changes. They consider different setups: S3, and S3+EBS (with EBS cache size 10% of total data size). However, for different FS traces, the ideal EBS cache size varies between 0.25% and 30% of the data size! So they built FCFS, which can dynamically resize the cache, and move data between “cache levels” depending on access patterns. When doing the cost optimization, they consider both the memory (durable storage) cost, and the I/O access cost (per-op cost). Logically, the latter implies that migrating data incurs a cost that must be amortized away afterwards. The time for which a block should remain in cache before being evicted is 18 hours for reads, and 1.8 hours for writes (EBS cache over S3), if we know future accesses (OPT strategy; impossible in reality). Possible strategies approximating OPT: DET and PROB, deterministic wait for until timeout before eviction, and probabilistic eviction in time interval [0, T]. The “performance ratio” (some metrics which I did not get the definition of) is 2 for DET and ~1.58 PROB. For different traces, they find that the optimal cost is $46 for each $100 spent by “today’s systems” (unclear to me what they used as a baseline here), and they get close to that (~$54). There was some stuff about CloudCache/ElastiCache, which seems to be a high-performance, high-cost system (1000x more expensive than S3), and they found that having a 64 MB cache of this kind is a good approach. Both algorithms are again very close to optimal (although they showed no numbers to back this up).
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Raymond Cheng (University of Washington), Ji Hong (Fudan University), Aapo Kyrola (Carnegie Mellon University), Youshan Miao (University of Science and Technology of China), Xuetian Weng (Peking University), Ming Wu, Fan Yang, and Lidong Zhou (Microsoft Research Asia), Feng Zhao (Microsoft Research Asia), and Enhong Chen (University of Science and Technology of China)
Real-time data is upon us. We need to process lots of data at near real-time. This sounds a lot like Pregel and friends, but without the BSP. Example: maintaining a mention graph for Twitter and compute influence ranking over it. This work appears to be restricted to dealing with graphs, but provides a consistent graph view to computation, despite high update rate. Executing static graph mining algorithms over this should provide timely results, and be fault-tolerant, too. It also takes care of a lot of stuff -- kinda like MapReduce, but for graphs. They use an epoch commit protocol to separate graph construction from graph computation. Graph updates are transactional, and regular snapshots are manufactured, so we get a series of consistent snapshots. “Timeliness” is defined as the time between the beginning of the delta window that the snapshot is constructed from, and the end of the computation on the consistent snapshot. The distributed system consist of ingest nodes, a master and graph nodes (holding partitions). When the snapshooter decides that the end of an epoch is happening, it takes the contents of the master’s progress table and uses this to put a snapshot barrier in the graph node logs. The incremental computation interface is based on computing per-vertex values (à la BSP), then each vertex decides if it made a “significant” change, and can trigger a graph-scale aggregation. They also batch messages between partitions (I think). Influence rank for graph with 8M vertices, 29M edges in 2.5 minutes. They get throughputs up to 180k tweets/second at 16 ingest nodes. Scaling is sub-linearly, but 2PL does not scale at all; they could not scale beyond 40 nodes (total; although it remained unclear if they just didn’t have more machines available). The timeliness of incremental computation, unsurprisingly, is much better than that of non-incremental computation. Data ingress is replicated, so node failures do not affect ingress throughput. Computation, however, is not replicated, so we need to back-track and re-compute on failure (though results are replicated). Master fault-tolerance could be done using Paxos/Zookeeper etc., but I don’t think they actually did it.
Jockey: Guaranteed Job Latency in Data Parallel Clusters
Andrew D. Ferguson (Brown University), Peter Bodik and Srikanth Kandula (Microsoft Research), Eric Boutin (Microsoft), and Rodrigo Fonseca (Brown University)
Cluster jobs have variable computation latency, but deadlines (SLAs) are important for various reasons. We would like to schedule jobs such that they are guaranteed to complete by a deadline. Reasons for variance: noisy environment, pipeline complexity. Dryad clusters often have pipelines of multiple jobs, where the overall workflow deadline depends on lots of per-job deadlines. Priorities are considered harmful here, since they results in preemptions, which can delay entire pipelines of jobs, and destroy predictability. Weights (an alternative) are difficult for users to set, so they take utility curves as an input, capturing deadlines and penalties. Jockey uses historical information to make predictions for jobs (40% of Cosmos jobs are repeated) -- basically, f(progress, allocation) = remaining time. They considered and evaluated six different progress indicators. While running, the Jockey control loop keeps making predictions for the remaining job run time, and adjusts the resources committed to the job in order to meet the deadline. The predictions are made using a simulator based on historic task run time information, choosing the maximum utility outcome of the simulation. They also use the largest input ever seen as a conservative estimate of input size.Evaluation: consider a job running on a cluster with ~80% CPU utilization, with initial deadline of 140 minutes, but 10 minutes into the job, they submit a new utility curve setting the deadline to 70 min.; over time, Jockey reduces the resource allocation as its initial prediction was excessively pessimistic. They use an “oracle” value, which is the amount of resources that would have been required as a constant allocation, assuming complete 100% use of those resources. The excess resource use above the oracle is a useful metric to compare allocation schemes. On a larger-scale experiment, Jockey only missed one out of 94 deadlines, and provides predictable performance without excessive resource commitments (about 25% above oracle, as opposed to 76% for a max resource commitment).
Session 4: Virtualization
The Xen-Blanket: Virtualize Once, Run Everywhere
Dan Williams (Cornell/IBM), Hani Jamjoom (IBM T. J. Watson Research Lab), and Hakim Weatherspoon (Cornell University)
Many cloud providers have ways of running virtual machines; but unfortunately, there are no standards, so it is impossible to have VMs that are portable across multiple clouds (due to difference in image format, virtual devices etc.). Provider-centric homogeneisation would be great, but is unlikely to happen any time soon, so instead, they look into user-centric homogeneisation. Basically, they leverage nested virtualization to have portability. Unlike the Turtles people, who assumed support on the bottom (provider) layer and none from the nested hypervisor, they assume no support from the provider, but do assume support from the nested hypervisor. The (reasonably straightforward) result here is that, as long as someone has made standard Xen-blanket images for a cloud provider, one can move VMs transparently across different clouds. Since bottom-level hypervisors do not expose virtualization features to the guest, they need to use paravirtualization. For device drivers, they need to create individual “blanket” (translation) drivers. There are some details about the implementation, but they do not go detail in the talk -- the guest VM inside the Xen-blanket system can be unaware of running on top of Xen-blanket devices.Evaluation is comparative between native, HVM, PV and Xen-blanket. Same network throughput for 1 Gbps link, but somewhat higher CPU utilization. However, they see up to 68% overhead on kernbench (some kernel benchmark). However, they can now do a trick and game Amazon’s pricing regime, since CPU provision is super-linear compared to other resources (memory, disk). In other words, it is possible to trade their CPU overhead for efficient use of other resources. In fact, they can oversubscribe a 4XL cluster VM with 40 nested VMs and get approximately the same kernbench performance as an m1.small instance (albeit with a huge error bar!). Flexibility win: live migration between Xen-blanket instance, across different providers. They live-migrated from their local cloud to EC2, although after migration, the disk becomes a remote disk. They do note that there is a bunch of existing related work on multi-cloud deployments, but they argue that they are less powerful (e.g. they don’t do live migration).
Isolating Commodity Hosted Hypervisors with HyperLock
Zhi Wang, Chiachih Wu, and Michael Grace (North Carolina State University) and Xuxian Jiang (North Carolina State Univeristy)
Hosted hypervisors are type II hypervisors running inside a host OS and borrow host OS features. Guest VMs essentially have their resources multiplexed by a kernel module, which runs at the same privilege level as the host OS kernel. Hypervisors, however, are complex and have vulnerabilities, which are a big issue. The threat is especially acute with hosted hypervisors, where a compromise of the hypervisor means we now own the entire machine (that said, surely the same is true of a non-hosted hypervisor vulnerability?). Their system, HyperLock, puts an isolation runtime around the hypervisor, and provides each VM with its own shadow copy of the hypervisor, so that compromises are localized to a single per-VM hypervisor. In HyperLock, the KVM module in the kernel is replaced by a HyperLock proxy, which is used by the HyperLock isolation runtime to complete privileged actions. They also confine memory access using a separate paging-based address space, and do now give the individual KVM instances access to the host memory. There is also some stuff to do with constrained execution, to avoid variable-length instruction sequences being misinterpreted and doing dangerous things (I don’t fully understand why this is relevant, but I missed a bit of the talk being distracted). They support a particular version combination of KVM and QEMU (220.127.116.11 and 0.14.0), and ended up reducing the privileged code size significantly (33.6k to 4.1k). They evaluated vulnerabilities using case studies of how their system copes with them. Due to the partitioning of the hypervisor, leakage of other VMs data is impossible, and isolates crash-DoS attacks to the single VM-KVM combination. Their system call interposition “could” also mitigate certain attacks. Performance-wise, they have a worst-case overhead of 5% over normal KVM.
Delusional Boot: Securing Cloud Hypervisors without Massive Re-engineering
Anh M. Nguyen (UIUC), Himanshu Raj (Microsoft Research), Shravan Rayanchu (U. of Wisconsin), and Stefan Saroiu and Alec Wolman (Microsoft Research)
Motivation: virtualized environments have very large TCBs (sounding familiar to anyone?). Typically hypervisor and dom0. According to a survey they did, 70% of vulnerabilities are somehow related to virtual devices. They observe that most virtual devices are not needed for cloud VMs -- basically, networking and storage only; drop floppy, keyboard, mouse, monitor, serial port, etc. They also narrow the interface for the remaining devices. This way, they managed to remove 30 out of 39 devices from Hyper-V (a 60% reduction in LOC), at no runtime performance penalty, and without massive re-engineering. First had to work out the dependencies between virtual devices, which is actually non-trivial. Using three techniques: (1) incrementally disable virtual devices (pairwise combinations; this approach doesn’t scale well), (2) static analysis of Hyper-V object files, and (3) code inspection (which is impractical at scale). Using a combination of these techniques, they found 68 dependencies (55 using technique (2), 11 using (1) and 2 using code inspection). After removal, ~15 devices left, but some of those only needed at boot time! Idea: remove them after boot (by suspending, not hotplug). Furthermore, would like to avoid intrusions via the devices that they remove after boot, so they boot in an isolated sandbox, and live-migrate into production system afterwards.To evaluate, need to measure Min-V (their thing)’s security benefits, and ensure the security of the delusional boot. Extraneous device removal amounts of 38% of TCB LOC, and delusional boot gives an another 22% benefit. This is asserted to be a good thing, and result in better security. For the second concern, they make sure that the boot server is in a clean, known-good state before each VM boot (attested to firewall using TPM attestation, which will fail if the clean configuration code was modified). They also measured the boot-time overhead (imposed by the necessary boot server restart and VM migration), which is between 2.6x and 2.9x. Guest OSes remained stable despite removed devices on various benchmarks (PassMark BurnInTest, various user applications). It remains unclear if this prevents any real-world attacks the approach prevents (since Hyper-V is not open source). 44 (out of 52) vulnerabilities in open source hypervisors are in their equivalents of the removed devices, though.
Here endeth today's live-blog as everyone heads off to the EuroSys General Assembly and/or dinner. We'll be back tomorrow morning!