syslog
9Oct/120

Live blog from OSDI 2012 — Day 2

Here we are, reporting back from OSDI 2012 in Hollywood today.

Today's live-blog coverage continues below the fold. Note that some of the coverage is a little spotty due to our blog machine being overwhelmed by the number of requests.

Session 4: Distributed Systems and Networking

Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE
Zhenyu Guo, Microsoft Research Asia; Xuepeng Fan, Microsoft Research Asia and Huazhong University of Science and Technology; Rishan Chen, Microsoft Research Asia and Peking University; Jiaxing Zhang, Hucheng Zhou, and Sean McDirmid, Microsoft Research Asia; Chang Liu, Microsoft Research Asia and Shanghai Jiao Tong University; Wei Lin and Jingren Zhou, Microsoft Bing; Lidong Zhou, Microsoft Research Asia

This looks at typical MR-like computations, with multiple phases and network-intensive data motion in between procedural compute phases. However, the computation phases often discard a lot of data; if this is only done in the reduce phase, we end up moving a lot of useless data. Similarly, things can be moved down to a later phase if they cause inflation in intermediate data. The challenge for such optimizations is obviously to maintain correctness. Current systems like SCOPE, DryadLINQ, Pig Latin and Hive compile bits of code into job phase binaries, after running a query optimizer over them. When starting with procedural code, it is hard to make whole-program (job) optimizations. One option is to map bits of the procedural code to relational constructs and then use a query optimizer, but this is hard in the general case.

PeriSCOPE builds an inter-procedural flow graph (across the shuffle phase), adds safety constraints and then applies optimizations. These include reducing the number of columns (removing unneeded ones), reducing the number of rows, and reducing the size of each row. This is done by permuting the data-flow graph (the added safety information adds dependencies to prevent unsafe optimization). Row size reduction is done by labeling dependencies with type and field name and size information (I think?) and then doing graph cuts (cut such that no safety-critical dependency is broken).

A simple coverage study on a trace of 28k jobs in 2010/2011 shows that their optimizations can affect up to 22% of jobs, with column reduction being the most important one (~14%). Evaluation on a set of eight jobs shows that they can massively reduce the I/O volumne (often by more than 50%), and also reduce job runtime (latency) in many cases. The effectiveness of the individual optimizations is highly job-dependent, though -- sometimes a particular optimization (early filtering) has almost no effect, sometimes it reduces I/O by 99%.
The design of PeriSCOPE is such that it should be generally applicable. However, the programming model has a major influence on the options and effectiveness of optimizations, but this can also mean a trade-off against easy-of-use. Ideally, a more informative interface than the MapReduce-style computations considered in this work would be used.

Q: SCOPE also does query plan optimization. How does that interact with PeriSCOPE?
A: PeriSCOPE takes the output of the query optimization phase as input, so they are independent. In future work, may share information between the two.

Q: These optimizations are all based on static analysis. Could you do better if you combined this with dynamic profiling, and what would be better?
A: Yes! For example, do not know the size of stream variables at compile time.

 

MegaPipe: A New Programming Interface for Scalable Network I/O
Sangjin Han and Scott Marshall, University of California, Berkeley; Byung-Gon Chun, Yahoo! Research; Sylvia Ratnasamy, University of California, Berkeley

MegaPipe is a new network programming API for message-oriented workloads, avoiding many of the shortfalls of the BSD sockets API. Let's consider two types of workload: (1) one-directional bulk transfer: half a CPU can easily saturate a 10G link, (2) message-oriented: smaller messages, bi-directional, higher CPU load. The second type does not play well with the BSD socket API: we need to make a system call for every I/O operation, and everyone shares a listening socket. Finally, the socket API is based on file descriptors, which means it inherits some of the overhead from that abstraction. Motivating experiment: RPC-like test on an 8-core server with epoll, performing simple hand-shake transacrions, with 768 clients. As they scale up the message size, throughput increases and CPU utilization drops. However, tiny messages are pessinmal, as the throughput barely exceeds 500 MBit/s, while using almost 100% CPU. When they scaled the number of transactions per connection, they found that throughput is 19x lower with a single TX per connection as opposed to 128 TX/connection. Does exploiting multiple cores help? No, not really: diminishing returns from adding more cores.

MegaPipe addresses these issues. Its design goals are three-fold: concurrency is a first-class citizen, various I/O types share the same unified interface (network connections, disk I/O, pipes, signals), low overhead and scalability to many cores. The talk focuses on the third goal. Previous work shows that limiting factors are in syscall overhead (per-core performance), VFS overhead and shared resources (multi-core scalability). Key primitives of MegaPipe are a "handle" (like an FD, but only valid within a channel), and "channels", which are per-core point-to-point connections. Handles are automatically batched into using a single channel if they communicate with the same remote endpoint. Finally, unlike globally-visible FDs, handles in MegaPipe are lightweight sockets. Note that MP semantics have different semantics: MP requires the programmer to explicitly "dispatch" aggregated I/O requests on a channel, while the asynchronous BSD API requires setting up wait primitives and then making requests (without any aggegration). MP can also batch together multiple requests of different types and pass them down to the MP kernel module together. The multi-core listen/accept optimizations look a lot like the stuff MIT people presented at EuroSys (per-core accept queues). Lightweight sockets are ephemeral and only converted into a FD when necessary; MP handles are based on such lwsockets.

Evaluation: some micro-benchmarks, and adapted two popular applications (memcached, and nginx). On the same micro-benchmark as discussed in the motivation, MP manages to improve throughput by up to 100% for small (< 1KB) messages. MP, unlike BSD sockets, scales linearly and independently of connection length, to multiple cores. In the marco-benchmarks, memcached has limited scalability from the outset, as there is a global lock on the object store, while nginx is already designed as a scalable shared-nothing application. Accordingly, MP only benefits out-of-the-box memcached for short connections with few requests. Moving memcached to fine-grained locking, however, leads to a major improvement, with MP adding 15% extra throughput over BSD sockets. With nginx, they see similar results: 75% increase in throughput when using MP.

Q (someone from Stanford): People who complained about sockets overhead in the past gave up on it and bypass the API. Are your lwsockets good enough to help these people?
A: lwsockets are an opportunistic optimization, avoiding most of the overhead most of the time, but still giving the full sockets API when needed.
Q: What happens if the user process having a MP handle/channel forks?
A: Not duplicated when forking.

Q: Batching affects short messages. This will delay them, and that may be an issue with delay-sensitive systems (commonly using small messages). Do you somehow allow users to control the batching?
A: Network cards already have deep queues, so latency is already there. Have some results that show MP latency on memcached is actually same or lower than baseline.

Q: How do you schedule requests to per-core handler threads, e.g. accepts?
A: Just normal user-level application, no special OS scheduling.

DJoin: Differentially Private Join Queries over Distributed Databases
Arjun Narayan and Andreas Haeberlen, University of Pennsylvania

[to be added]

 

Session 5: Security

Improving Integer Security for Systems with KINT
Xi Wang and Haogang Chen, MIT CSAIL; Zhihao Jia, Tsinghua University IIIS; Nickolai Zeldovich and M. Frans Kaashoek, MIT CSAIL

Integer overflows can have disastrous consequences, such as buffer overflows or other logical bugs (e.g. trick OOM killer into killing innocent processes). Indeed, integer errors account for Linux the #2 OS vendor advisory topic (according to CVE). Some options to avoid interger overflow: use arbitrary precision integers (performance not good enough), trap on every overflow (also bad performance, and some code relies on overflows).

Found 114 bugs in Linux kernel, 9 of which were independently found by others. Two thirds of these led to logical errors or buffer overflows, so were quite serious. More importantly, two thirds of them also had checks which were incorrect, and multiple fix attempts as a result of reporting the bug were, too!

KINT uses LLVM IR, and combines results from per-function analysis, range analysis and taint analysis into a list of potential bugs. In the function analysis, they simply infer constraints from the code (control flow paths and overflow conditions) and then use a constraint solver to find if any integer value will satisfy the constraints that lead to an overflow. Taint analysis is optional, as it relies on user annotations. For example, a programmer may annotate code processing untrusted user input, and the taint analysis will then propagate this uncertainty and highlight potential overflows that result (and which are not protected against).
Evaluation is in terms of effectiveness, and false positives/negatives. In addition to the 114 bugs in the Linux kernel, they found five bugs in OpenSSH and one in lighttpd. To work out false negatives, they looked at 37 known integer overflow bugs from recent years. KINT found 36 of them. To look at false positives, they look at the patched code for these 37 bugs. KINT reports one false positive, and found two incorrect fixes! Run on the whole kernel, KINT finds about ~125k potential bugs, of which 724 are classified as "critical". Running only takes a few hours, even for a large code base like the kernel. They did not have the resources to inspect all of the potential bugs, but skimming found a few hundred.

One contribution as a result of this work is kmalloc_array(), which is a helper function to avoid the frequently-used, dangerous malloc(n * size) paradigm in the kernel. As a generalized approach, they propose a "NaN" special integer value, for which they have added support to Clang ("nan" keyword and "is_nan" call). Overflows will result in such a special NaN value, which can be contained. The advantage of this is that bounds-checking code can largely be elided, as the checks are automatic.

Q (someone from Harvey Mudd): Have you checked KINT's source code using KINT?
A: Nope, it's C++, and KINT only supports C.

Q (someone from UCSD): How many annotations did you use when checking the Linux kernel?
A: Details in paper, about 40 for untrusted input and 20 for annotation sizes.

Q (someone from NICTA): You are changing the C semantics with KINT. Why no simply change the semantics of addition and multiplication?
A: Some programs, e.g. crypto code, rely on overflow and modulo semantics. KINT will report many false positives for this kind of code.

Q (someone from UCSD): What do the annotations for untrusted user input look like, what extra code is required?
A: Specify untrusted function parameters, a little more difficult with macros.

Q (someone from UCSD): Do you also cover signed/unsigned mismatches? Infrastructure seems to work for this.
A: Yes.

Q: Nasty code exists. What happens if your solver is faced with something that it cannot solve, or which would take very long? Will you err towards false positives or false negatives?
A: Details in paper; solver has issues with divisions. Implemented a bunch of re-write rules that do not change semantics, but make the solver's work easier.

 

Dissent in Numbers: Making Strong Anonymity Scale
David Isaac Wolinsky, Henry Corrigan-Gibbs, and Bryan Ford, Yale University; Aaron Johnson, U.S. Naval Research Laboratory

This work allows dissemination of information without fear of reprisal from authorities or peers. The core challenge is the trade-off between anonymity and scale (resistant to timing analysis, thousands of participans and churn tolerance). Existing work on weak anonymity (e.g. Tor) scales well, but is not resistant to timing analysis: if someone can measure the timing of messages going into Tor, and coming out again, they can statistically de-anonymize the originator. Another alternative is DC-nets, but they do not scale to strong anonymity at large scale (since everyone is talking to everyone else). Dissent (their work) uses a mix-net topology with the anonymity semantics of DC-nets. Imagine M servers with N clients, where N >> M. Servers have N shared secrets, which they distribute to clients, which each have M secrets that they can combine with the server secrets (this reduces computational complexity for generating and distributing the secrets). For a practical example case, the number of messages to generate and distribute goes from ~10k to around 215 in Dissent. Then, the servers collaborate by generating the ciphertexts from their connected users ciphertexts, exchanging them, and then performing XOR on the M ciphertexts. However, DC-nets are not churn tolerant, since all participants' ciphertexts are needed to decrypt the message (by XOR'ing). In the system proposed here, since there are servers involved, they can simply time out participants that drop out before computing the ciphertext.
There is also a fairly complicated routine that deals with disruptors in the system, and can identify them. They can maintain anonymity as long as at least one honest server exists, and the clients need not to know which of the servers they talk to it is.

Evaluation: unlike previous systems, which usually scaled to around 40 clients, they can scale to 5,000 clients, while not exceeding a message latency of 10 seconds. In a trace-based evaluation using a Twitter trace, Dissent can keep up with the disemmination rate of real-world Twitter, while other systems do not. For churn resistance evaluation, they used PlanetLab, and found that there are some decent heuristics they can use to time out dropped-off participants. Disruption detection is bottle-necked on key shuffle and blame shuffle, taking on the order of hours, so there is room for improvement there.

Q: Are O(thousands) of people really a large-enough anonymity set? It is very clear that someone is a member of Dissent.
A: No ground truth on this, really, but thousands of people are certainly harder to reprimand.

Q: Can you compose what you have done with another mechanism that allows users to hide their participation?
A: For example, Tor is making progress on masking traffic as innocent web traffic. Could use that kind of thing, but that would still degenerate into an arms race.

 

Efficient Patch-based Auditing for Web Application Vulnerabilities
Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich, MIT CSAIL

This is an auditing system. Consider the example of Github authentication using public keys. There was a vulnerability that led to an attacker being able to modify peoples' public keys. As a response, Github asked users to audit their own public keys. It would have been better if they had been themselves able to find out what keys had been attacked, but the scale of Github logs is too large to make that pratical. In the particular example, the vulnerability was a result of using a user ID provided as part of the request, rather than the current user's ID.
Their auditing system is based on replaying historic requests, running the code once with and once without the patch fixing a vulnerability applied, and then watching for different results. This is a known methodology, but their contribution is that they can do this much faster, replying a month of traffic in a few hours. During normal execution, they record intial, non-deterministic and external request input in an audit log. The naive auditing approach then involves replaying this and comparing the results. Optimization opportunities: patches may not affect every request, the two replay instances will share a lot of code, and requests are similar. They address all three of these points. For the first one, they track control flow and identify the basic blocks that diverge as a result of patch application. During normal execution, they record the control flow trace (CFT) for each request. At replay time, they will use the CFTs and information about which basic blocks were affected by the patch to filter out requests not affected at all. For the second point (shared code between instances), they use function-level auditing. The two instances (patched and unpatched) start running as a single instance, and fork immediately before calling the patched function. While said function is running, any side-effects must be intercepted (global variables, output, database queries). If the side effects are the same, there was no exploitation, so skip this request. Finally, they memoize a lot of execution detail, meaning that similar requests (identified as same control flow, modulo different input) will be able to re-use the bits of the CFT that are independent of the patch and template variables affected by the input.

Their system is called POIROT, and based on a modified PHP runtime. It does not require any changes to application code. In evaluation, POIROT successfully detected five different types of attacks on MediaWiki in real-world Wikipedia traces, and various information leak vulnerabilities in HotCRP using synthetic traces. For examples of real CVE vulnerabilities on MediaWiki, the naive replay strategy for 100k Wikipedia requests (~= 3.4h) would have taken on the order of several hours, but is down to minutes with POIROT (12-51x faster than original execution). Their use of templates helps cutting the amount of code to run very significantly. For the logging in normal operation, POIROT adds ~5KB of logging data and ~15% increase in latency and throughput to each request.

Q (someone from Princeton): You seem to record a lot of information for each request. How much?
A: See results in eval; 5KB are all input required for a Wikipedia request, including cookie.

Q (someone from NICTA): How does your overhead compare to existing work? How could you reduce it?
A: Record a lot of non-deterministic input (e.g. random numbers generated), which may not be necessary for replay, as attacker often cannot exploit it.

 

Session 6: Potpurri

[N.B.: I was dealing with fixing syslog during this session, since the interest in this live-blog essentially DDoS'ed it; hence the coverage is a little sparse.]

Experiences from a Decade of TinyOS Development
Philip Levis, Stanford University

TinyOS started in 1999, as an OS for very small embedded micro-controllers. This talk is about design principles from embedded software, technical results found during the project, and things they should have done differently. First lession: minimize resource usage. Micro-controllers have very little resources: single or double digit numbers of RAM and ROM. Why not use low-power embedded ARM processors? Battery lifetime! With embedded micro-controllers, the system can run for years off a battery, while with the lowest-power ARMs, it is a matter of days. Debugging this stuff is very, very hard, especially in the wild, since it is not possible to simply attach a debugger to these tiny systems. A technique that helps is static virtualization. This is basically about compiling application and OS together, and merging at build time, making as much as possible static (even memory allocation and function calls). This enables whole-program analysis and optimization, as well as dead code elimination.
A non-technical lesson is that the researchers focused a lot on making increasingly complex applications possible, but at the same time, made it more difficult to implement simple, basic applications. island syndrome: increased barrier to entry

Q (someone from Harvard): How difficult would it be to re-architect the interface to the form that you would prefer, and that might make it more accessible for novice users?
A: Probably not that hard, but at the same time, Contiki has filled that role. It would definitely be possible, though.

Q (someone from NICTA): Your analysis seems to make this assumption that having lots of users is a good thing. Is that really the right metric?
A: I probably would go the same way and focus on research if I could go back. The point was more that we never thought about it, and I wanted to make people aware of it.

Q (Eric Sedlar, Oracle): Any insights on how you could publish something that says "this is more usable"?
A: Tricky, since the typical HCI/usability venues have much, much higher bars to what counts as "usable". There is some scope for publishing easier-to-use programming models, though.

Q: Attribute high learning curve mainly to the lack of tools on top of it. Maybe this would be different if there were better tools, like e.g. in the Java ecosystem?
A: Indeed, tools might help, but then again, TinyOS makes the fundamental assumption that a "novice" knows C. Also, developing tools is beyond the scope of a research community and does not gain research credit.

Automated Concurrency-Bug Fixing
Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu, University of Wisconsin—Madison

Bugs are important, and it would be nice if we could fix them automatically. But this is hard in the general case, as we need ground truth on what the correct behaviour is, and what counts as incorrect behaviour. If we restrict ourselves to the class of concurrency bugs, though, this becomes a little more tractable, since the fix often "just" amounts to inserting the correct synchronization primitives into the program. Their system, CFix, fixes concurrency bugs in six steps. First, they feed bug reports, buggy binary and input data in, then develop a fix strategy, determining if this is an atomicity or an odering problem. Synchronization enforcement, patch testing and selection, patch merging, run-time support.Two major contributions: OFix, a new technique that enforces order relationships, and a framework that ties together various existing tools for analysis and bug fixing. They leverage a bunch of existing bug detectors, and develop a set of template "fix strategies", which largely seem to be different ways of interleaving thread executions in order to avoid bugs (at least for the atomicity violation class of bugs). For order enforcement, OFix provides "allA-B" and "firstA-B" strategies, which seem to be about detecting when all work of type "A" is completed, then synthesizing signals to other threads, and having all threads running work of type "B" wait for these signals. This is somewhat more tricky when threads can spawn child threads that also run work of type "A", but they have some counter-based strategy that keeps track. Of course, OFix could introduce deadlock by synthesizing waits. As far as possible, they try to predict this and give up if it happens, or use timed waits to avoid it. Some cleverness exists to avoid signals if they are unnecessary (because e.g. B does not ever execute on this control flow path). The "firstA-B" strategy is similar, but signals after the first execution of work of type A. At later stages, they prune incorrect patches (fix strategies that do not fix the root cause, e.g. because they make the bug occur deterministically, rather than non-deterministically), and perform some optimizations.

In evaluation, they find that using a combination of four different bug detectors, they can find a large number of known bugs (standard set used for evaluating bug detectors -- not really surprising that they find them!).

Q (Florentina Popovici, Google): [missed this]
A: CFix can work with any bug detector.

Q (someone from MSR): You need to synthesize condition variables in order to perform your wait. Do you add them statically or dynamically?
A: There is one mutex/condition variable per bug report, and they are allocated statically.

 

All about Eve: Execute-Verify Replication for Multi-Core Servers
Manos Kapritsos and Yang Wang, University of Texas at Austin; Vivien Quema, Grenoble INP; Allen Clement, MPI-SWS; Lorenzo Alvisi and Mike Dahlin, University of Texas at Austin

[I sadly missed most of this talk due to working on fixing the blog. It seems that this is essentially about running parallel state machines (for dependability/fault tolerance) on multiple cores with clever synchronization. Key insight: as long as the result of a non-deterministic execution order is the same, we do not care. A lot of  the talk was spent talking about how divergence can be detected at low overhead; as a result, independent transactions can be executed in parallel. If divergence is detected, Eve rolls back and executes serially as a fallback. This has the benefit of "masking" concurrency bugs by replacing them with serial execution. In evaluation, they find that Eve is 6-7x faster than traditional state-machine replication, and only a little slower than an unreplicated parallel execution. A higher fraction of false positive conflict (divergence) events leads to performance asymptotically approaching traditional state-machine replication.]

 

Session 7: Replication

Spanner: Google’s Globally-Distributed Database (Best Paper Award)
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford, Google, Inc.

Spanner is a project that has been going on for 4-5 years at Google, and which now runs in production, serving Google's ad database. It's a key-value store and a full database, with general purpose transactions, SQL-like query language, etc., but also fully geo-replicated across data-centres on different continents. Data is also replicated and sharded in various ways. One of the key features is the ability to run lock-free distributed read transactions at global scale. Necessary for this is the property of global external consistency of distributed transactions, and Spanner is the first system to support this. A major enabling technology for this property is the TrueTime API, which provides tight global clock synchronization.

At the simplest possible granularity, we would like to generate output from a consistent snapshot of the database. However, if it is sharded, this becomes non-trivial, as we need to take the snapshots at the exact same time. As a consequence, data should be versioned using timestamps. For a fully consistent snapshot, we want not just serializability, but also external consistency, i.e. agree on global commit order. This requires a notion of global wall-clock time if we use timestamps as our ordering primitive. This can be achieved using strict two-phase locking (i.e. all locks must be acquired before any writes happen).

The TrueTime API provides a notion of global wall clock time with explicit uncertainty. A timestamp becomes an interval: the actual wall clock time must be within this uncertainty interval (this is a bit like the notion of interval arithmetic). For transactions with 2PL, we now set our commit time stamp to TT.latest (the last possibly timestamp) and cannot release them again before TT.latest has passed. Wait time is "waited out" by a logical spin called "commit wait". If distributed consensus between multiple participants in a transaction must be achieved, the overall timestamp chosen is the maximum of the participants decided commit timestamps. There are a lot more details in the paper about different read modes and atomic schema changes.

So how does the TrueTime API work? In different data centres, they have GPS receivers and atomic clocks attached to a set of machines. Machines frequently synchronize against several of these, and in the mean time model time uncertainty by assuming linear worst-case clock drift (200µs/sec). Of course, there is some network-induced uncertainty in the time synchronization, but this remains in the low single-digit millisecond ranges, meaning that the worst case uncertainty is ~10ms (4 network induced + 6 linear worst-case). In future work, they hope to get this down to <1 ms.

Q: Could you comment on the difference between external consistency and strict serializability?
A: The two are equivalent.

Q: Could you, instead of fixing a commit timestamp as at the "safe time", represent commit time as an interval?
A: Yes, but our data representation necessitated a single time.

Q: Is every epsilon based on the global GPS or atomic clock time, or do you have some kind of local reference epsilon?
A: Some data centres may have higher values than others; in that case, spanner will slow down when running in a data centre with higher uncertainty.

Q: The chance of the clock going rogue is slim, but what is the worst case scenario if this happens?
A: This is equivalent to assuming that the computer has stopped working. In that case, we need to eject it from the system, or otherwise timestamps will be chosen incorrectly.

Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary
Cheng Li, Max Planck Institute for Software Systems; Daniel Porto, CITI/Universidade Nova de Lisboa and Max Planck Institute for Software Systems; Allen Clement, Max Planck Institute for Software Systems; Johannes Gehrke, Cornell University; Nuno Preguiça and Rodrigo Rodrigues, CITI/Universidade Nova de Lisboa

Why does latency matter? Experiments from ad systems at Bing show that, as latency increases, revenue per user goes down significantly. At the same time, geo-replication is necessary to keep latency down. Replication traditionally necessitates a decision on strong consistency (with limited performance), or eventual consistency (higher performance). This talk is about how one can build a system that has both properties.

In their model, there are some operations that require total ordering (strong consistency), and some that are happy with eventual consistency. They call the combination "RedBlue consistency", which is partially ordered, maintaining strong consistency when necessary, but goes for eventual otherwise. A local site can always accept a "blue" (eventually consistent) operation without any coordination with other sites, while red operations require coordination as they must be serialized. Their "Gemini" coordination system is based on a special token (the "red flag") that must be held in order to execute strongly consistent operations, and can only ever be held by one site in the distributed system (although it can be passed around). Challenge is now to make sure that all sites converge in this scenario. One key insight that allows "blue" operations to be used in systems that otherwise require strong consistency is that operations can often be split into several sub-operations (e.g. deciding on a value and actually applying the change) that need not all be strongly consistent in order to produce the same end result (deciding the value is "red", but applying the change can be "blue"). A blue "shadow operation" must commute with all other operations AND break no invariants; otherwise, it must be a red operation.

Evaluation: three key questions -- (1) How many blue operations can we extract from workloads? (2) Does RedBlue consistency improve user-observed latency? (3) How does throughput scale with an increased number of sites? For (1), looked at TPC-W, RUBiS and Quoddy (some kind of social networking app). While none of the existing operations in these could be made blue, there were 4-17 extractable shadow operations that could, and these accounted for >90% of workload runtime in all cases. To see the latency improvements, they ran some TCP-W experiments on EC2, and found that latency goes from thousands of ms for a remote site access when using only red consistency to <100 ms at all sites when using RedBlue. Peak throughput was also improved in a multi-site setup, with more sites adding more throughput (due to parallelism in blue operations).

Q: Can shadow operations be identified automatically?
A: Future work; currently doing it manually.

Q: How much effort to transform existing code into using RedBlue consistency?
A: About a week to familiarize with code base.