Various people from Cambridge are currently in Bern für EuroSys 2012, and will be reporting live from the conference here, as well as summarizing the trends and highlights afterwards.
The second day has kicked off, and we will be providing the usual live update service below the fold -- click "continue reading" to get there!
Session 5: Databases
A Critique of Snapshot Isolation
Daniel Gómez Ferro and Maysam Yabandeh (Yahoo! Research)
[Missed the first 10-15 minutes]
This appears to be criticizing snapshot isolation, saying that both read-write and write-write conflict checking are helpful, but not strictly necessary. They have some kind of optimistically concurrent scheme that only checks read-write conflicts, and they get full serializability at almost no overhead (unclear over what? I must have missed something) using a centralized state oracle. The implemented this in Hadoop, but also talk about an open source project called OMID. For a synthetic workload, they achieve something in the order of 120k TPS. Apparently, though, this only works for HBase-like databases, and not for systems supporting SQL queries.
LazyBase: Trading freshness for performance in a scalable database
Jim Cipar and Greg Ganger (CMU) and Kimberly Keeton, Charles B. Morrey III, Craig Soules, and Alistair Veitch (HP Labs)
Often, we would like to run queries over continuously produced data, ideally with near real-time freshness. Unlike the “big data” analytics, which are focused on high throughput, this is focusing on latency. They have a bit of a tick-cross taxonomy showing that classical DBMS do not have the performance for processing data sets of the size they need, data warehousing solutions do not have the freshness, and NoSQL things do not have the consistency. However, freshness requirements are highly dependent on the application -- so it is logical to build a systems that allows for flexible scaling of freshness. Furthermore, they separate consistency and freshness, and always expose a consistent view, which may be stale to different degrees. Limitations of their system: observational data only (read and write-only transactions, no updates), scales to 10s of servers, not really large scale. LazyBase uses uniform, general-purpose workers, but arranges them in a pipeline (simplified: client -> ingest -> sort -> merge). High-throughput ingest is batched for performance (throughput) and atomicity (atomic multi-row ops). Varying batch size from 250 to ~2k rows increases throughput significantly. However, obviously, the problem with batching is that it hurts latency. They have ways of trading freshness for latency: if they read from the start of the pipeline, where slow-to-query raw data exists, they get fresh results, while if they query from the end of the pipeline, they get good query performance (due to pre-processing on the data already completed), but more stale results. LazyBase is queried using a subset of SQL, which has been amended with a FRESHNESS keyword. The client library then deals with the “dirty work” of converting this into a query for the right stage of the pipeline. In an experiment, they varied the freshness requirement, and experienced a ~7x difference in latency (0.1 vs 0.7s).
VMs in OpenCirrus cluster with 6 dedicated cores, 12 GB RAM (per machine?); 38 GB Twitter dataset. They find that they can ingest data via 20 machines, and compare against Cassandra, finding that LazyBase is 4x faster. For stale query experiments, they used both point and range queries (~0.1% range size). Their point query throughput is not as good as Cassandra’s (50% performance) due to on-disk format (compressed time-series, so need to unpack). However, for range queries, LazyBase creams Cassandra with 4x higher throughput (16 vs 4 qps). On consistency experiments, they find that LazyBase has strictly higher consistency than Cassandra, for which a value oscillates between -1, 0 and 1 for a simple increment/decrement workload. For freshness, they find that Cassandra usually does get higher freshness, but at the expense of consistency (again, oscillations of timestamps, non-monotonicity), while LazyBase’s timestamps are monotonically increasing and always-consistent.
Cache Craftiness for Fast Multicore Key-Value Storage
Yandong Mao (MIT), Eddie Kohler (Harvard), and Robert Morris (MIT)
Let’s build a new fast KV-store! Would like to perform well on hard workloads! Should support range queries, skewed key popularity, small K-V pairs, many puts, and arbitrary keys. Their first attempt was a fast binary tree, achieving 3.7M qps, if sufficiently high-bandwidth network and disk hardware provisioned. Bottleneck turns out to be DRAM, so optimize caching craftiness, get 1.5x better performance. Their optimized thing is called “Masstree”. They evaluated it on a 16-node cluster with a bunch of SSDs, 64 GB RAM and 10 Gbps NIC per machine. In fact, when looking at local performance (without network/disk bottlenecks), cache craftiness results in a 1.7x improvement (I thought they said that they’d already evaded those bottlenecks?!).
To reduce DRAM latency, they constructed a lock-free, unbalanced 4-way tree with the same concurrency properties as a binary tree, but is only half as deep for the same amount of data -- plus each node fits into a cache line. However, it is pessimal (O(N)) for sequential inserts, so look at a balanced B+tree instead. Their particular implementation uses optimistic concurrency with versioning. It turns out, however, that the B+tree is 11% slower than the 4-way tree (because of cache line optimization, and since all nodes are full, while for their B+tree, only ~75% are)! Now realize that we can do software prefetch to get read multiple cache lines at a time with the B+tree, and things get 9% better than the 4-way tree. However, there are consistency issues with concurrent inserts in the B+tree (not with 4-way, as keys don’t move), so pre-pend a “permuter” (index list array) to each node, so we can atomically swap keys around. But there is an issue with long keys -- they require multiple memory accesses, and throughput drops rapidly as keys get longer. So they use a Trie of B+trees, with each level (B+tree) responsible for 8 bytes of key length. Now throughput scales much better. “Masstree” is the union of all these optimizations (I think), and despite having a trie of B+trees in it, it is 8% more efficient than a single B+tree with all the other optimizations.
Evaluation finds that they are much faster than common NoSQL databases (MongoDB, VoltDB), faster than Redis, and competitive with memcached.
Now, how do we scale this to multi-core? Other systems have an instance per core, and partition the key space. Masstree uses a single, shared key that is accessed by all cores (presumably using the OCC+versioning techniques they alluded to). They find that their performance is basically constant as they scale load when running on 16 cores, while a statically partitioned Masstree performs better under low load, but loses out subsequently (it didn’t become entirely clear how they varied the load). They also find that they scale fairly well to 16 cores (about ⅕ drop in performance over perfect scalability).
Session 6: Cloud II
MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud
Zhengping Qian and Xiuwei Chen (Microsoft Research Asia), Nanxi Kang and Mingcheng Chen (Shanghai Jiaotong University), Yuan Yu (Microsoft Research Silicon Valley), and Thomas Moscibroda and Zheng Zhang (Microsoft Research Asia)
Lots of applications for matrix operations, but hard to do at massive scale -- slowing speed of innovation in many fields. “State of the art”: ScaLAPACK, developed 20 years ago. Main limitation: all data must fit into memory -- so limited by memory size. ScaLAPACK does parallelism by traditional barrier sync, which limits performance and fault tolerance. Using something like MapReduce is not an option for writing matrix algorithms directly, since the programming model is too restricted. Instead, MadLINQ is a layer above MR/Dryad in the spirit of DryadLINQ. MadLINQ integrates in the language (e.g. C#), allowing it to be mixed with DryadLINQ queries (although transformation methods need to be called to convert the meta-data between them). As one would expect, MadLINQ compiles into a global execution DAG, with each vertex corresponding to a matrix operation on a tile. Tiles can actually be partitioned into smaller blocks, and high-performance Math libraries can then be used on those blocks. This enables them to send partial results downstream earlier (i.e. streaming before the entire task/vertex result is available); they argue that this strategy avoids network bursts and disk I/O by spreading the communication over time. The naive solution is problematic, though -- if a vertex fails after having sent output downstream, we have to redo redundant work. So to fix this, they track dependencies on the block level (but in a central scheduler), and then re-compute only what’s necessary. They formulated a set of invariants that must always be true, and the system will work to satisfy them again after failures that lead to violation.
Evaluation shows that their pipelined approach finishes significantly faster; 31% faster for 512 cores vs. non-pipelined (barrier sync). They outperform ScaLAPACK by 14% on average on a commodity cluster; notably they do not outperform them with small numbers of cores, but have a greater advantage with more cores (as one would expect due to idleness caused by waiting for a barrier). And they also have fault tolerance, although they do not demonstrate how it compares to ScaLAPACKs superstep-level fault tolerance. They also outperform SCOPE/Hadoop implementations by 3x/80x. However, for a Markov clustering job, they are “only” 1.2x faster than a non-pipelined approach. There are lots more benchmarks in the paper.
Jettison: Efficient Idle Desktop Consolidation with Partial VM Migration
Nilton Bila and Eyal de Lara (University of Toronto), Kaustubh Joshi, H. Andres Lagar-Cavilla, and Matti Hiltunen (AT&T Labs Research), and Mahadev Satyanarayanan (Carnegie Mellon University)
Idle desktops waste lots of energy, but people want always-on semantics for network presence. WoL insufficient for things like AJAX webapps, messengers etc., so use VM consolidation and migrate to a server. But desktops have lots of memory, so you get a low consolidation ratio, and migration takes a long time. Plus, there are correlated resumes, which further impact performance. However, the idle desktop actually has a very small working set -- so how about we only migrate that? They migrate the skeleton VM, then bring pages from desktop on faults. Then the desktop is shut down, and we assume faults are rare (they will lead to wake-up). On resume, consolidate by migrating back. In fact, they migrate for “micro-sleep” periods (around 80s; avg. time between page requests and disk accesses is 83s). Apparently, there is a “break-even” time of 17s in order to save power (they don’t say how they arrived at this figure?). Implementation: Xen 3.4 w/Linux 2.6.18. Pages are brought in by trapping into modified hypervisor that fetches pages from the desktop. Network migration is implemented by bridging. They also implemented some pre-fetching strategies, which further extend their sleep periods (to around 120s): hoarding of MRU pages is useful. For re-integration, they need to track dirty state. They have the intuitive policies for making consolidation/reintegration decisions. Research questions: (1) When to microsleep? Especially given that sleep/wake consumes extra power. They solve this by estimating the page request inter-arrival times in order to catch bursts. (2) How much state is consolidated/reintegrated? Case study with four employees of university. Consolidation traffic is ~250 MB of memory (6%) and 0.5 MB of disk. Reintegration is ~110 MB memory and 6 MB disk. This only takes a few seconds. (3) What is the consolidation efficiency? On a 16 GB RAM server, they could consolidate 98 VMs (based on above assumptions). (4) How much energy is saved? 78% in one hour, 91% in five hours. They also looked at scalability, and tracked 22 users of desktops and laptops. Unsurprisingly, partial VM migration is orders of magnitude faster than migration; even with many users, the latency does not go up much (although log scale -- it probably is like 2x), while full VM migration quickly takes forever due to load. For some reason, they compare energy savings of partial and full migration, and find that they save as much, if not more, energy (I would have expected a lot more, since full migration takes very long?!).
Practical TDMA for Datacenter Ethernet
Bhanu C. Vattikonda, George Porter, Amin Vahdat, and Alex C. Snoeren (University of California, San Diego)
Data centre applications have different network requirements: high throughput (MR and friends), or low latency (memcached, webapps). Buffers hurt latency traffic if the queues are full of high-throughput flows’ packets. Current solutions (Facebook’s custom UDP, DCTCP, Infiniband) are not holistic or very expensive. Their observation: demand on data centre network can be anticipated, so we should be able to coordinate ahead of time! Basically, they are arguing against the statistical multiplexing commonly used, and in favour of TDMA/coordinated approaches (wonder if there’s anything in ATM world that did this?). However, using TDMA is tricky as host clocks drift, and hence we cannot rely on clients keeping time -- so need special support from the network. Existing TDMA solutions use special hardware for this, and still require specialist real-time OSes on the hosts. They want to do it on commodity Ethernet instead. They leverage pause frames, which are some kind of flow control packet, and helpfully are processed in hardware. They basically send pause packets to the other end of a flow and measure the reaction time until it stops sending packets. Observed values: 2-6µs at low variance. They have a fabric manager, who (I think) sends control packets to hosts to tell them to start/stop sending. Experimentally, they found that control packets can arrive up to 15µs out-of-sync. This means that transfers can overlap, and they fix it by inserting 15µs “guard times” between transmissions, so that there can never be any overlap (assuming 15µs worst-case delay). On end hosts, traffic is managed using some kind of IEEE802.11 priority class stuff, and control frames that selectively start/stop transmissions.
Evaluation is on the Hadoop shuffle phase and memcached, on a network with a mix of hybrid and optical switches. They use some high-spec HP servers with 10G NICs, and Cisco Nexus 5000 10G switches. TDMA window is 300µs, 15µs guard time. Hosts are transferring data all-to-all. In a traditional 2-tier multi-hop topology, TDMA gets close to the ideal transfer time (about 5% beyond), much better than simple TCP. For latency, TDMA should really help to get around background flows. And it does -- 100-200µs vs. 200-600µs. With TDMA kernel bypass in OS, they get down to around 50µs.
Also looked at electric vs. optical switches. The latter have higher bandwidth, but longer switching times. Their system can adapt to that. Experiment: artificially vary link capacity between the hosts between 1G and 10G periodically. Should get a nice step function, but with normal TCP, performance sucks, and we can only utilize around 4G max.; TDMA is much better and gets close to the expected step function. Headline figures: 15% lower finish time for all transfers, 3x lower latency on memcached transfers, 2.5x higher throughput on varying link capacity, leverage existing Ethernet standards.
Session 7: Storage
Scalable Testing of File System Checkers
Joao Carreira and Rodrigo Rodrigues (MPI-SWS), George Candea (EPFL), and Rupak Majumdar (MPI-SWS)
Silent data corruption is bad. Sometimes we get the wrong data back from disks. Maybe there are bugs in the storage stack. File system checkers are part of that stack, so we should be able to verify that they are correct, otherwise they may destroy more than they do good when fixing errors. Ideally, without having the developer write a manual spec! Should also be portable across different file systems, scalable, fast, and other nice things. They use a corruption model to synthesize errors on healthy disks, then run fsck on the disk and check the recovered disks against a specification. Writing such a specification is hard, so they split it into two aspects: consistency and completeness. Former means that spec output must agree with FS checker output, latter means that no opportunities for recovery are missed. They assert that the disk consistency checker code is usually better and more mature than the error recovery code, so they use it as a proxy for consistency by simply running the checker again (and expecting no errors)! They check completeness by using different file system checkers on the same faulty disk and compare their output (which should agree). The corruption models rely on symbolic execution of checker code, i.e. systematic corruption of fields read form the FS (as part of the symbolic execution), and test suites from file system checkers (which they generalize, to get more permutations).
Evaluation was on three different FS checkers: e2fsck, reiserfsck, fsck.minix. Three metrics: performance, scalability, bugs found. SWIFT (their tool) found buggy code paths in all of the checkers. Scalability is tested by looking at how much of the code is covered within 1000 minutes; it gets about 50% on ext, as opposed to >60% for the ext test suite, but without any knowledge of the code. Overall, they found 12 bugs, some of which were new, some known. Both completeness and consistency bugs.
Delta FTL: Improving SSD Lifetime via Exploiting Content Locality
Guanying Wu and Xubin He (Virginia Commonwealth University)
[I don’t think I understood this well; SSDs are not my area. YMMV.]
SSDs have issues with the limited write count. Existing solutions are wear leveling and write caching, but they are inappropriate. Wear leveling is usually done in the FTL (Flash Translation Layer; address translation inside the SSD), and does remapping to spread writes uniformly, and can perform deduplication. However, dedup only works for the exact same content, on the entire drive and is “complicated”. DeltaFTL (their thing) works for similar content, on a page level and is “easy”. Flash drives also cache writes to improve sequentiality and avoid random writes. (It remains unclear why that is inappropriate). What DeltaFTL does is to store “deltas” for similar content, which will improve utilization (cf. dedup) and write count. This kind of stuff has already been explored in the block storage world. So the approach is not “do clever stuff to balance writes”, but more of a “avoid erases by writing less often”. The XOR deltas are also compressed to use even fewer writes, and buffered before being written to flash (so that many deltas can be aggregated). In addition to the normal page mapping table, they now also need a delta mapping table. Key question is, of course, when to delta-encode a page. Obviously, “every page” is not an answer due to the overheads of delta storage (space and read latency). So, the pages should be write-hot and read-cold (i.e. written a lot and read rarely). Presumably, ultimately there’s going to be some kind of merge process to consolidate the deltas for a particular page. The compression algorithm they use is LZF1X-1. For different microcontroller frequencies, they find that the compression ratio required ranges from 0.55 to 0.8. They cache frequent mappings from page addresses to pages/deltas in some kind of RAM buffer. And there is a “garbage collection”, which will pick the block with the most invalid pages and merge them with their delta(s).
Evaluation: for some values of R_c they picked (0.2, 0.35 and 0.5), they see a significant reduction in flash garbage collection, write latency and only slightly (~10%) increased read latency. Also some reduction in space use that I didn’t catch. Question: does this actually impact lifetime? (which they imply). Answer: fewer writes/flash GCs = necessarily longer lifetime.
FlashTier: A Lightweight, Consistent and Durable Storage Cache
Mohit Saxena, Michael M. Swift, and Yiying Zhang (University of Wisconsin-Madison)
[I don’t think I understood this well; SSDs are not my area. YMMV.]
Flash is here. People use it as a cache, and there exists some OS support. Various block caching approaches (write-through, write-back, read-miss). But this is inefficient due to multiple layers of indirection (extra memory overhead due to address space management), cost of ensuring cache consistency , and extra overhead of free space management (cache GC). The address space management requires a mapping from OS disk locations to SSD locations to Flash locations. Consistency issue: cache manager must save the mappings durably, otherwise the cache needs to be warmed up again, which can take a long time due to its size. Finally, GC to remove stale cache elements leads to additional writes, and hence lifetime. Not entirely clear to me why you need to GC the cache and can’t just selectively evict. Their system, FlashTier, provides SSCs (Solid State Caches), which have a unified address space, cache-aware GC, and other good things. They are built directly from flash, i.e. they are NOT SSDs.
They map disk locations directly to flash locations; they use a sparse hash map with a hybrid address mapping (both for 256K erase blocks and for 4K pages). Their GC mechanism does what I suspected above, i.e. “silently” evicts clean blocks (i.e. not written to). There’s also a differentiation between “log” and “data” blocks, which have different write characteristics and are treated differently for GC purposes. In the SSC policy, there is a fixed log area (7% in their experiments), in SSC-V, it is variable (0-20% in their experiments). (I didn’t really understand the details of this, since it was going it some speed.) They do have a nice crash consistency guarantee, which is that a cache can always be used after a crash, since they never lose dirty data, or retrieve stale data (by enforcing some invariants in system design, see paper). FlashTier can support both write-back and read-miss policies by using their low-level operations in different ways. For write-intensive workloads, they perform up to 200% better than the “SSD” model (naive use of SSDs as block caches), but on read-intensive workloads, they get 98% of the performance. In write-intensive cases, the improve device lifetime (simulated) from 2.1 years to up to 4.7 years for one of the workloads. Memory consumption on the host is also reduced.
Session 8: Dependability and Security
[I am not attending Session 8 due to other commitments; maybe someone else will blog it. Otherwise, normal service will resume tomorrow morning for the last two sessions.]