The second day of the 24th ACM Symposium on Operating Systems Principles has just started. Several of us are here and will be live-blogging the talks and Q&A sessions.
Asynchronous Intrusion Recovery for Interconnected Web Services
Ramesh Chandra, Taesoo Kim, Nickolai Zeldovich (MIT CSAIL)
- Many internet services are inter-connected. For example, when building a website one may be using Facebook Login, Bill On services and so on. The problem is that if one of the services has a bug or is compromised then the attacker can access the other services as well.
- Use rollback-and-replay for recovering integrity in single machine. However, there are some challenges such as a global coordinator is required and all services must be available during recovery.
- Contribution: define a repair protocol that doesn't have a central coordinator.
- The system has 2 types of execution: normal (log enough for rollback-and-replay), repair (identify attacker and initiate repair).
- Missed the rest of the talk...
Optimistic Crash Consistency
Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau (University of Wisconsin-Madison)
- Single file-system operation updates multiple on-disk data structures leaving some structures non-updated in case of crash.
- Crash consistency is built upon ordered writes. However, file systems conflate ordering and durability. Ineffcient when only ordering is required.
- Optimistic File System (OptFS) - provides strong consistency and performance by decoupling ordering and durability.
- OptFS - has a new primited osync() - that provides ordering.
- Journaling FS - before update FS, the system writes a note describing the update.
- 1) Data write (F), 2) logging metadata (JM), 3) logging commit (JC), 4) checkpointing (M)
- flushes are introduced to make sure that data is moved from buffer cache.
- without flushes, blocks may be reordered.
- Probabilistic Crash Consistency - re-ordering leads to windows of vulnerability. (e.g. JC is written before the others).
- P-inconsistency = time in windows of vulnerability / total IO time
- p-inconsistency - database workloads have high p-inconsistency.
- OptFS - freshness = indicates how up-to-date state is after a crash.
- eliminates flushes in the common case.
- some re-orderings are detected after the crash and for other re-orderings they prevent them to happen.
- Async Durability Notifications (ADN) - notifications sent when blocks are made durable => blocks can be destaged at any time in any order.
1) when the flush after JM is removed then the re-ordering is detected using checksums that are computed over data and metadata.
2) when the second flush is removed (after JC), M could be re-ordered before D or JM or JC. Re-ordering is prevented using delayed writes.
- new FS primitives:
- osync() - provides only ordering of writes. Changes semantics from ACID to ACI-(Eventual Durability).
- dsync() - provides durability.
- run SQLite on top of OptFS. Replaced fsync() with osync().
Q: You seem to treat every crash as truncating? Can you talk a little bit about how physical disk handle failure?
A: We assume that the disk cache losses power as well in case of a crash.
Q: You're saying that you're modelling the worst case, but that doesn't give a good sense of what's happening in practice.
A: Yes, that's true.
Q: If you're modifying the disk controller. Why not just a ordering info to each block?
A: Our way puts let's stress on the disks. We do it in the file system.
Do Not Blame Users for Misconfigurations
Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng (UC San Diego), Ding Yuan (University of Toronto), YuanyuanZhou (UC San Diego), Shankar Pasupathy (NetApp Inc)
- 31% of severe customer issues are due to missconfigurations.
- Goals: 1) react gracefully to misconfiguration. In some systems up to 20% configurations lead to complete unavailability.
2) make configuration intuitive & less error-prone.
- Spex: - automatically infer configuration constraints by statically analysing source code.
- outputs a set of configuration constraints.
- has two components: mapping engine and inference engine.
- mapping is not trivial because of different conventions (e.g. structure based, comparison-based, container-based).
- structure based: data structures used to maintain config information.
- comparison based: the system uses string comparison in the parsing function.
- container based: the software get the values using getter functions.
- with the help of annotation Spex is able to infer 4 types of constraints: data type, data range, control dependency (X depends Y), value relationship (e.g. X < Y).
- Use case of constraints: to detect error-prone config. design (e.g. case sensitivity, unit, silent overruling).
- Inference accuracy is not 100%. On average they achieve 90.6%.
Q: There are config issues that lead to performance issues. Have looked into that?
A: We do not catch that. We catch only the miss-config that clearly violate the requirements.
Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior
Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, Armando Solar-Lezama (MIT CSAIL)
- compiler = faithful translator. This is not true if the code contains undefined behaviour.
- e.g. C spec: pointer overflow is undefined behaviour = > C can remove buf overflow checks.
- Unstable code: compilers discard code due to undefined behaviour.
- unstable examples: pointer overflow, signed integer overflow, oversized, shift...
- tested 12 compiler against 5 unstable examples. Also checked the history of clang and gcc to see how this bugs creep in as the compilers become more aggressive over time.
- STACK - tool to identify unstable code
- C/C++ source -> LLVM IR -> STACK -> warnings
- mimic a compiler that can selectively decide if user write or not unstable code. On the first run, STACK works with the assumption that users don't write code with undefined behaviour. The second time it runs without the assumption. Report unstable code from the difference of two runs.
- limitations: if phase 2 not powerful enough then some unstable code may be missed. If phase 1 is not powerful enough then we may get some false positives.
- evaluation: - found 160 new bugs.
- for Postgres STACK produced 68 warnings out of which 4 were false warnings.
Q: When you run the solver you say in the paper that you timeout and you say that you choose a timeout. What the distribution of the solver time vs finding more bugs?
A: We tried a bunch of timeouts. We went up to 30 s and we didn't see much change.
Q: How hard is for the compiler to implement similar functionality?
A: Our tool doesn't care that much about performance. Compilers want to compile as fast as possible, hence they want something simpler. There's a trade-off between speed vs undefined behaviour found.
Q: Did you find any unstable behaviour in gcc?
A: Yes, we do.
Q: Can you integrate how to integrate this better into the developer flow?
A: Large companies have some discipline. Every night you run the tools.
Transaction Chains: Achieving Serializability with Low Latency in Geo-Distributed Storage Systems
Yang Zhang, Russell Power, Siyuan Zhou, Yair Sovran (NYU), Marcos K. Aguilera (Microsoft Research), Jinyang Li (NYU)
- We are tradeoffing consistency vs latency. Difficult to find a point.
- Contributions: proposes a new primitive callend transaction chain, built a geo-replicated system callend Lynx.
- Transaction chains: split transactions
- Problems: what if chain fails after executing first-hop?
what if the user aborts the chain in the middle?
- They provide all-or-nothing atomicity implemented by durably logging chains at first-hop and by allowing user-aborts only after first-hop.
- Non-serializable interleaving solved by detecting non-serializable interleaving via static analysis. (i.e. statically analyzing all the chains to be executed).
- Limitations of Lynx: - chains are not strictly serializable.
- programmers can abort only at first hop.
- Four tables system: one storing tweets, follow-graph, secondary follow-graph and a table containing the join of tweets and follow-graph.
- Replicate across 3 data centers that have a latency of 153ms among them.
- Follow-user has a latency of 174ms and post-tweet has a latency of 252ms.
- Systems achieves 184k/s for follow-user.
Q: Lynx statically analyzes before hand. How does it have access to all the chains?
A: The chains are know and we don't believe that we're introducing high overhead.
Q: You said that after the first segment of your chain the app can abort. What do I do if in the later steps of the chain my app reach a bogus state. Is the developer required to write some\
code to recover from that state?
A: It's a limitation of our system. We don't have a good solution to that.
SPANStore: Cost-Effective Geo-Replicated Storage Spanning Multiple Cloud Services
Zhe Wu, Michael Butkiewicz, Dorian Perkins (UC Riverside), Ethan Katz-Bassett (USC), Harsha V. Madhyastha (UC Riverside)
- Data uploaded by a user may be viewed/edited by users in other locations.
- Minimizing cost is also important in designing geo-replicated storage systems.
- SPANStore - Key/Value storage spanning cloud storage services
- objective: minimize cost
- placement manager: takes as input pricing policies, inter-DC latencies, latency, consistency and FT requirements.
- placement manager decides on the replication policy.
- Cost includes storage cost, request cost, data transfer cost. Storage service cost combines the 3 previously mentioned costs.
- Techniques for reducing cost:
- 1) use multiple cloud providers. Advantages: more locations for DCs and more pricing options.
- 2) replicate everywhere to decrease latency
- 3) aggregate workload prediction per access set (e.g. diurnal and weekly patterns). Classify objects by access set.
- Optimal replication policy depends on the application and on the workload.
- Placement manager uses Integer Linear Programming to decide on placement.
- simulated SPANStore agains: single replica, single cloud and replicate everywhere strategy.
Q: What if the compute workload is distributed on multiple cloud providers?
A: We assumed that the application is always deployed on the same provider.
Consistency-Based Service Level Agreements for Cloud Storage
Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Aguilera (Microsoft Research), Hussam Abu-Libdeh (Cornell University)
- Cloud storage providers replicate data widely. The tradeoffs are consistency, availability and performance.
- What consistency do they provide to their customers?
- Problem: - app developers have to choose consistency at the time they're developing their applications.
- no single choice is best for all clients and situations.
- Measurement strong vs eventual consistency: the latency of eventual and strong consistency varies in each DCs.
- Pileus - replicated, partitioned key-value store that provides choice of consistency.
- primary core that uses strong consistency.
- secondary nodes that lazily receive updates.
- all put operations go to the primary core. Get can go to any node as long as the nodes are sufficiently up to date to provide the desired level of consistency.
- Choices of consistency supported by Pileus: Strong, causal, bounded staleness, read my writes, monotonic reads, eventual consistency.
- Tested and found that intermediate do really provide intermediate performance.
- Consistency-based SLA: app declare desired consistency/latency.
- The SLA consistency of 3 parts: consistency requirement, latency requirement and utility.
- It's achieved in the client library. The library has to node which nodes store the data and how expensive is to communicate with those nodes.
- SLA enforcement: Compute for each SubSLA and node compute the probability of meeting the latency and consistency requirements. Following, the expected utility is computed P(latency) * P(consistency) * utility.
Q: How about tail distribution of the delays in the network?
A: The variance in the network delays was not very high.
Q: How I as a developer reason about what I get? Do I have to distinguished 6 different consistency cases I may get.
A: You have to develop your app for the consistency levels you support.
Q: Once you've returned a weaker consistency, shouldn't you just drop for all the following requests?
A: No, not necessarily...
Tango: Distributed Data Structures over a Shared Log
Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran (Microsoft Research), Michael Wei (UCSD), John D. Davis (Microsoft Research), Sriram Rao (Microsoft), Tao Zou (Cornell University), Aviad Zuck (Tel-Aviv University)
- Common design patter: distributed data, centralized metadata.
- Centralized metadata services are usually built using in-memory data structures.
- Adding high availability. Currently, there are 3 options:
1) Move state to external service like ZooKeeper
2) restructure code to use state machine replication.
3) implement custom replication protocols.
- Tango provides an in-memory view + history ordered updates in shared log.
- The shared log is the source of persistence, availability, elasticity, atomicity and isolation.
- Tango objects - implement standard interfaces (Java/C# Collections).
- linearizability for single operations.
- serializable transactions. Under the hood
- Performance - fast shared log based on CORFU
- introduces a streaming abstraction over fast shared log. readnext(streamid) and append(value, streamid). Each client only plays the events it is interested in.
Q: You say that when you recover you go back and you replay the history? How far do you have to go back?
A: There's no magic. If you want to do it fast you have to checkpoint.
Q: Can you tell something about the sync issues that arise?
A: Whoever gets to the shared log first wins. The other one backs-off.
Verifying Computations with State
Benjamin Braun (UT Austin), Ariel J. Feldman (University of Pennsylvania), Zuocheng Ren, Srinath Setty, Andrew J. Blumberg, Michael Walfish (UT Austin)
- Suppose there's a client executing a MapReduce job. However, the server may return an incorrect solution.
- Problems with previous work: the computations have to be stateless, the client incurs a large setup cost and the server's overheads are large.
- Pantry is based on Zaatar.
- The input is a program in a subset of C -> constraints on execution -> theoretical tools (PCPs) -> client executable & server executable.
- Missed the rest of the talk...
There Is More Consensus In Egalitarian Parliaments
Iulian Moraru, David G. Andersen (Carnegie Mellon University), Michael Kaminsky (Intel Labs)
- Paxos - no external failure required.
- In DCs Paxos is used for synchronization and resource discovery.
- Paxos tolerates F failtures with 2F + 1 replicas, replicas can fail by crashing.
- At least 2 RTT to committing a command.
- Multi-Paxos - there's a designated leader which is an owner of all the slots.
- 1 RTT to commit.
- potentiall bottleneck for performance and availability.
- Can we have - the high throughput of Multi-Paxos.
- constant availability of Paxos, distribute load evenly across all replicas.
- EPaxos - split the linear space into n linear spaces (where n is the number of machines). Each machine gets complete control over a linear space.
- can choose any quorum for each command.
- 1 RTT for non-concurrent commands and 2 RTT for concurrent commands.
- Execution - find maximum strongly connected components over the ordering graph. Following the components become nodes in a DAG. This gives linearlizability and fast-path quorum.
- EPaxos can do around 27k operations/sec with 5 replicas.
- When one replica is slow the performance EPaxos degrades slower compared to other implementations.
Q: Can interference continue..and then you end up having 3..4 RTTs?
A: No, that's not the case..missed the rest.