Several of us are attending the LADIS workshop in advance of SOSP in Nemacolin Woodlands in colourful Pennsylvania today and tomorrow. We will cover some of the papers and keynotes in our live blog below.
- 7th edition of the workshop, moving between conferences (theory and systems)
- Chairs alternate between the US and Europe
- 36 submissions, 12 papers accepted
icg: Ionel Gog
ms: Malte Schwarzkopf
Keynote I: SDN for the Public Cloud: Windows Azure
Albert Greenberg, Microsoft
- Work of 100 engineers of several years
- Greenberg joined Windows Azure (WA) in 2010, and they worked on ‘this’ (SDN deployment, presumably) since then
- WA is Microsoft’s SaaS offering; they are also moving their own sites and properties to it now
- Fairly rapid growth, fully geographically distributed
- IaaS offering: simply run your own Hyper-V VM (à la EC2)
- What does IaaS require from the network?
- Would like to be able to bring private network to “the cloud”
- Need VPNs to extend into WA, which requires multiple secure overlay networks inside WA
- Microsoft’s technology for this is called NVGRE (Network Virtualization …)
- Challenge: scale; key insight: keep hardware oblivious of policy and implementation, and do this in software (… just the usual SDN sales pitch)
- Q: What are the customers’ requirements on ingress and egress latency in such a virtualized environment?
- A: Predictability is much more important to enterprise customers than ultimate performance.
- Goal: effect mapping of customer IP addresses (CAs) to provider addresses (PAs) of hosts running the VMs
- ‘Seperate the interface to specify a VNet from the interface to plumb mappings to programmable software switches in the host machines using the network controller’
- Q: Do you just supply connectivity or do you also offer a notion of QoS here?
- A: VMs are offered as parts of profiles, which include CPU, memory, network speed allocations. But currently only NIC speed is customer-selectable.
- Secure connectivity for intra-tennant VMs is done via packet encapsulation.
- However, to bridge a customer network in the WA data center with the rest of the customer’s enterprise network, it needs a VPN gateway.
- Key insight is that re-programmable L3 forwarding policies permit setting this up easily
- VNet is a very successful product: >40% of WA VMs use it now
- Q: How do you scale the controller to support many VNets?
- A: Refactored it a couple of times. Relies on directory service, which is already scalable. But at the bottom of this, they always have something like Paxos running.
- Q: <missed>
- Q: What state do the VMSwitches (host software switches) hold?
- A: Maintaining state is key to making all of this work, but cannot have state for all VMs in every VMSwitch. So have a learning/caching approach backed by a directory service.
- When running most cloud services, load-balancing is another feature in demand
- Again, can solve this using SDN with an all-software approach: simply expose a virtual IP and remap the requests coming in to dedicated IPs
- The load balancer runs in several VMs, and establishes stateless tunnels to the actual request handling VMs
- Responses are sent directly, rather than tunnelled through the LB VM
- SDN is also useful for many other purposes: ACLs, billing, metering, etc.
- Obviously, the VMSwitch is kinda important in this whole scheme. It’s a bit/lot like an OpenFlow switch.
- Big bet on SDN seems to pay off for Microsoft.
- SDN is a great abstraction, but the implementation is not easy!
- ‘Typed tables’ in VMSwitch help — they allow interactions between multiple tables for any flow (e.g. to get a load-balanced VNet, using multiple features)
- 40G Ethernet makes it challenging to keep up with line rate, and the sheer scale of systems makes it necessary to build distributed/partitioned controllers.
- Need to make VMSwitches request policies, rather than pushing a globally synchronous view; “eventing” supports this (looks like some kind of synch-RPC-callback scheme to me)
Session 1: Caching
Cooperative client caching strategies for social and web applications.
Stavros Nikolaou, Robbert van Renesse and Nicolas Schiper (Cornell University).
There are some advantages to trying to use client devices to cache popular web content. In this work, they look at what benefits leveraging social link information can have for such client caching. The clients are connected by a peer-to-peer network. The centralized backend service interacts with requesting clients, but only provides meta-data: it tells the client where to fetch an object from a client cache. The setting considered here is a social network one, where each content item is owned (uploaded) by an individual user.
Different caching strategies are conceivable. In an opportunistic approach (widely used in CDNs and previous work, e.g. Maygh at EuroSys 2013), clients cache the objects that they request and maintains a cache managed under LRU policy. Another possible strategy is a proactive one, which aims to maximize client hit rate. This approach is based on the client relationship graph, where there is a higher probability of accessing content owned by neighbours than of random content. In this situation, it is beneficial if any first client in a community to obtain an object proceeds to then push it out to all of the clients in the same community. Yet, a third approach is a minimalistic one, where the goal is to minimize load on the back end service and thus to cache as many diverse objects as possible. This works like opportunistic caching, but allows a client only to cache an object if it is not already cached somewhere else.
In their evaluation, they show that the local (with-community) client hit ratio for the proactive strategy is consistently the highest (60-70%), as one would expect. However, the opportunistic approach also achiaves around 50%, while the minimal approach achieves little over 10%. For the global hit ratio, there is no difference as soon as the number of clients is large. In terms of balancing the per-client bandwidth, the proactive approach is worst and the opportunistic one is best.
Q: Evaluation is a little bit like cheating. You’re assuming that the social graph is a good model for the client graph. However, this is only a good model if the users are Twitter/Facebook users. Have you looked at other traffic patterns?
A: No we haven’t considered that yet. We don’t have the data.
Allen Clement: You said that the overheads were about 5KB/s, but how much useful information is the client getting?
A: It’s interesting but we haven’t studied that.
Malte Schwarzkopf: You make the assumption that improving hit rate is always a good thing. However, that’s not necesarily the case. I have a lot of friends that I am connected and they live in the US. That’s not good because the system will end up caching data in the US and the transatlantic link is expensive. Have you considered if the social network is a good approximation?
A: We would like to measure to what extent social-proximity is related to geo-location proximity.
Cache Provisioning for Interactive NLP Services.
Jaimie Kelley, Christopher Stewart (Ohio State), Yuxiong He and Sameh Elnikety (Microsoft Research).
- Data is growing too fast to keep in main memory.
- Quality-Aware Cache management – for NLP evit data that will cause the least quality
- Quality loss – ideal vs real: the differences in the replies given by the NLP system.
- challenges: synonyms.
- Quality loss in NLP services – used Lucene with Redis and DiskSim.
- For the ideal NLP services the same processing without any latency bound.
- Query trace obtained from Google Trends.
- Tested quality loss on LRU.
- Tested quality with limit content. (i.e. remove some of the data) – quality loss rises
sharply when documents are removed from the data set.
- NLP services can remove some redundant content with little quality loss.
Q: Synonym seemed to save you. Is there a relationship between the length of the queries and the reply?
A: Yes, there’s a strong connection.
Q: Do you the temporal meta-data about the objects to evict them?
A: You can do that with limited content methos. The method we used is very similar to that. You can also eliminate whatever happens to X after a timestamp.
GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores.
Conglong Li and Alan L. Cox (Rice University)
- Key-value stores support two basic operations: get and set.
- When full, KV use LRU for eviction.
- RUBIS: latency 240ms (for 4% of the request), 60-95ms (for 70% of the request).
- GD-Wheel – taking the cost into consideration in order to improve read access latency.
- Uses YCSB benchmark to evaluate.
- GD-Wheel – implementation of the GreedyDual algorithm
- Hierarchical cost wheel – a data structure used to reduce the time complexity of GreedyDual.
- Key-value pairs with different costs are stored in different queues.
- Implemented GD-Wheel in Memcached
- Cost input added to set requests. The cost is an abstract concept that doesn’t have a unit.
It could be time duration.
- FB workfloads: – most keys are no more than 32 bytes.
- values: median size 135 bytes, mean size = 954 bytes.
Q: It may be worthwile to do a comparison to M3C? from CMU.
A: MemC3 is a variation of Memcached. Our work is focused on reducing latency.
Q: You’ve mentioned GreedyDual. Have you compared against it?
A: GreedyRedual is a cost aware published <missed rest>
Towards a storage system for connected homes.
Trinabh Gupta, Amar Phanishayee, Jaeyeon Jung and Ratul Mahajan (Microsoft Research).
- What are the right storage abstractions for home applications?
- Deployed/existing applications (e.g. PreHeat, DigiSwitch, Energy Data Analytics,
Diginal Neighborhood Watch).
- Example of requirements: filtering based on tags, range queries over time,
support app specific location policies, confidentiality of data.
- BOLT – supports time-series data and provides privacy
- Design of Bolt
- stream based abstraction. (e.g. create new stream of data, read/write to/from stream of
data). Specifies the location of the stream, where the stream should be located.
- append(tag, value). Followed by get(tag) to retrieve data.
- Approach: 1) specialized index on top of data
2) separate index from data
- Privacy of data stored on untrusted servers: 1) hash, encrypt the data
2) decentralize access
- Support app specific storage policies: 1) segment stream, location per segment
- Inefficiens: – overhead of storing a hash for each datavalue => group many datavalues into one check, hash and encrypt each chunking.
- reliance on metadata server: use quorom on metadata reads.
- Found that: – prefetching data (chunks) improves read throughput up to 3x for temporal range queries.
- bol segments are scalable. Around 3.6% overhead.
Q: Broken devices may create bad data. Is there any way of retracting the data? The opposite of an append.
A: We don’t have any method of going back to delete some value.
Q: Did you look at he IO behaviour of the streams you’re generating? You had different ways at looking at the data. The index points to various places in your stream. If you’re using the stream then your IO will be bad.
A: It depends on how many tag you have.
Q: Did you look at it?
Smart home technology is becoming increasingly common, but there is no good storage solution for these sorts of distributed and connected applications yet. Various sensors will store current and historic data, and applications may need to run various types of queries and analyses over this. Indeed, multiple smart homes may collaborate by sharing sensor data, and data should be available even when sensors are offline.
Cloud storage would be an obvious candidate here, but has trust issues. Various platforms exist (e.g. HomeOS), but do not support some of the desired features (range queries or tags); existing database systems are “inefficient to run over untrusted storage servers” (?); stream processing systems also do not fit the bill in terms of data retention. Bolt (their system) supports all of the needed features including secure and efficient sharing.
Bolt’s API is centred around streams with certain declarative properties (e.g. location, security, data size) and has operations corresponding to certain events (e.g. append()), and analysis features (get(), range queries, tag-based filtering etc.). The data storage layout is an append-only log, augmented with an index. Index entries are timestamped, and time-based range queries act on the index before retrieving the data from the log (assume that bulk data is much larger than the index). Integrity of the stored data is ensured by storing hashes of the data with each record entry in the index, and data values are encrypted. The index can be stored on untrusted service, too. Finally, Bolt offers declarative control for applications over the data storage locations.
The problem of the index growing large due to the hashes is solved by batching multiple data values before hashing. There is also a metadata server (unclear to me where that lives), but Bolt tries to relax its reliance on its availability.
Evaluation: consider storing data locally, on Azure (while trusted) or on Azure while untrusted. Prefetching improves throughput by 3x; the encryption overhead is negligible as the remote I/O operations dominate. The segmentation of Bolt streams is beneficial and queries scale well.
Q: Seems like you have a partitioned, distributed temporal database. How do you deal with clock skews between different homes, for example?
A: Currently assume synchronized clocks.
Q: How portable is this to other cloud providers?
A: Very portable, only needs put/get API.
Q: Can the system support retraction of data?
A: Yep, a delete() operation exists, although this is only at the granularity of a complete stream.
Tachyon: Memory Throughput I/O for Cluster Computing Frameworks.
Haoyuan Li, Ali Ghodsi, Matei Zaharia, Eric Baldeschwieler, Scott Shenker and Ion Stoica (UC Berkeley).
- RAM throughput increasing exponentially.
- Disk throughput slowly increasing.
- Problems: Data shared across jobs must be written on disk and disk scans are slow for read.
- One copy of data in memory (Fast).
- Upon failure, re-compute data using lineage (Fault tolerance).
- Challenges: lineage API, recomputation cost, recomputation resource allocation.
- Lineage – Binary program, configuration, input files, output files.
In-memory computing is really important. RAM throughput is increasing exponentially (capacity doesn’t, though), while disks don’t scale at the same rate. Hence, memory locality is important (?). Many existing big data frameworks (e.g. Spark, Shark) already use in-memory data to accelerate computations, but they have to store the data to disk in between jobs. Tachyon fixes this problem by offering a caching distributed file system, and uses lineage information to recompute missing data and offer fault tolerance.
Of course, this only works with some caveats: it requires immutable data, deterministic computation, the working set must fit into memory. Challenges: lineage API, ensuring that recomputation cost does not become ridiculous, balancing recomputation with other work going on in the cluster. Tachyon workers simply have a RAM-disk that they expose to jobs, so these can read at memory speed directly. For the lineage tracking to work, they need to keep track of binary, configuration, inputs/outputs and dependencies (this is a lot like CIEL’s data flow mechanisms).
What is the re-computation cost? They reduce it by asynchronously checkpointing to the durable storage system every now and then. Some tweaks to checkpointing (latest result first, etc.) can improve performance.
Evaluation: as one would expect, memory read/write is much faster than disk.
Q: Tachyon using RAM obviously takes this RAM away from working sets of running jobs. Have you thought about this trade-off?
A: Not really. But RAM capacity is scaling exponentially, while core count is not, so memory will be plentiful in the future.
Q: Is there anything that HDFS is still better on than Tachyon?
A: It’s more stable!
Q: Here’s an example: if you have a JOIN between two large relations that produces a tiny output, it’s much better to pay the cost of replicating that than to re-run the jobs based on lineage information.
Q: Have you tried increasing the number of dirty pages in the buffer cache to get a fairer comparison against HDFS?
A: No, not tuned that parameter.