For the next couple of days, I am attending OSDI in Hollywood. However, due to various scheduling constraints on both sides of the Atlantic, I only arrived there at lunch time on Monday, and missed the first session. Fortunately, in addition to my delay-tolerant “live blog” from the plane, where I read the first session’s papers, Derek Murray was kind enough to take some notes on the actual talks. Normal live-blogging service of the talks will be provided for the other days! :)
Session 1: Big Data
Flat Datacenter Storage
Edmund B. Nightingale, Jeremy Elson, and Jinliang Fan, Microsoft Research; Owen Hofmann, University of Texas at Austin; Jon Howell and Yutaka Suzue, Microsoft Research
“Flat storage”, a.k.a. a simple network file server, is simple and neat. In conventional data centres, however, we see more complex approaches that try to move the computation to the data, since data motion is expensive as a result of the tree topology of the DC. Many common jobs however inherently require data motion, and MR et al. do not cut it for those. However, we can now build full bisection bandwidth DCs, so the locality constraint can go: in FDS, “all compute nodes can access all storage with equal throughput”. It is a clean-slate re-think of DC storage, and the guiding principle is statistical multiplexing of I/O across all disks and network links in the cluster. Data is structured in blobs and “tracts”: a blob is a sequence of mutable tracts, which are small units of data (~8MB) named by a 128-bit GUID. Tracts are stored directly via raw block device access (i.e. there is no file system), and all meta-data is kept in memory (!). Simple non-blocking API with atomic writes, but no guarantee on write ordering (i.e. weak consistency in the presence of failures). FDS does not use explicit meta-data; instead, it has a Tract Locator Table (TLT) that deterministically maps GUID + tract ID tuples to tract servers (but not data on disk). It also has no durable state, and its state can be entirely reconstructed on the fly from others. Only interaction with the TLT is on process launch, then long-term caching of tract servers. Per-blob meta-data (e.g. size) is in special “tract -1″ and accessed just like data. Blobs also have an atomic “extend” operation (cf. GFS’s “append”).
Tracts are replicated, and a mechanism similar to RAMCloud is used to recover data on lost disks or machines. Writes are sent to all replicas by the application library (after TLT lookup, if necessary), and only complete when all replicas have acknowledged. Meta-data changes are sent to a “primary” replica, which executes 2PC to update everyone else. Replication level is a per-blob setting. Fault-tolerance is built around the key ingredient of version numbers on TLT rows, which are used to reject stale requests referring to the pre-failure conditions. Story on network partitions is a bit weak — they rely on only a single meta-data server running, and manually configure it (though looking into Paxos). TLTs are generall O(n^2) for n disks, representing all pairs, plus some random extra replicas. This can get quite big, but they have some optimization that decrease size.
Network is using a lot of shiny 10G hardware, with 5.5Tbps bisection bandwidth at cost of $250k. They found it hard to saturate 10G with a single TCP flow, since a single core cannot keep up with the interrupt load. Spreading them, using multiple short flows (design characteristic of FDS) and zero-copy architecture all help with this. They use RTS/CTS notification to avoid incast and receiver-side collisions. Evaluation test cluster is heterogeneous; they say that dynamic work allocation was key to utilizing it efficiently. They find ~1GB/s write and read throughput per client at nearly linear scalability. Random and sequential read and write performance at tract-level is identical; with triple-replication, maximum write performance goes down to ~20GB/s (from ~60GB/s). Max throughput they could reach was ~2GB/s per client (using 20Gbit network connections). Failure recovery is fast: 3.3s for 47 GB on 1,000 disks; 6.2s for 92 GB; 33.7s for compensating a whole machine failure (~655GB). Using FDS, they managed to claim the Daytona and Indy sort records, sorting around 1.4TB in 60s.
- “Little-data” is a solved problem, when you have a tightly-coupled setup with some processors and a RAID array directly connected.
- Dynamic work allocation is an old idea that has been lost in the move to big data.
- FDS = blob storage that does metadata management and physical data transport, and can scale to a whole datacenter.
- FDS has no affinity — that’s why it’s called flat.
- Uses a CLOS network with distributed scheduling.
- High read/write performance (2 GB/s, single-replicated from a single process).
- Fast failure recovery, and high application performance — the example used will be disk sorting (also web index serving and stock cointegration).
- Unit of data is an 8 MB “tract” — basic unit of reading or writing.
- API: CreateBlob, WriteTract and ReadTract.
- Component: tractservers that respond to read/write requests, and a metadata server. API hides tractservers’ existence.
- Metadata management is distributed, with no centralized components on common-case paths.
- Spectrum of existing ideas: from GFS/HDFS (totally centralized, big bottleneck, too-large extents (64MB)) to DHTs (fully decentralized, but multiple trips over the network to do read/writes, and slow failure recovery).
- FDS is in between on this spectrum.
- There is a centralized metadata server, but the client has an oracle that maps Blob_GUIDs and Tract_Nums to tractserver addresses (consistent, pseudorandom mapping). Reads and writes don’t generate traffic to the central server. Oracle is a table of all disks in the system (tract locator table) that is distributed to all clients — (H(BlobGUID) + Tract_Num) % Table_Size -> locates the appropriate servers in the system.
- Special metadata tract, numbered -1 — spreads metadata pseudorandomly across the system.
- FDS supports atomic append, by doing a 2PC on the metadata tracts.
- Networking: assume an uncongested path from tractservers to clients. Traditionally datacenter networks are oversubscribed, but building a CLOS network is much smarter. FDS provisions the network sufficiently for each disk.
- Largest testbed has ~250 machines.
- Full bisection bandwidth is only stochastic, which creates a problem for long flows. FDS generates a lot of short flows, which is ideal for load balancing in a CLOS network. CLOS networks push congestion to the edges, so you still need to do some traffic shaping. Short flows are not great for TCP, so there is some virtual circuit management described in the paper.
- Can read 950 MB/s/client and write 1150 MB/s/client. 516 disks saturates with <50 clients, and 1033 disks saturates with ~200 clients. (Single replicated, 3-replicated reduces the write throughput naturally.)
- Fast recovery because all of the transfers can happen in parallel. As the cluster gets larger, recovery gets faster. The disk table is constructed so that all disk pairs appear in the table (giving it size O(n^2) in the number of disks).
- - Recovery at about 40 MB/s/disk — all the way back to stable storage. A 1TB failure in a 3kdisk cluster recovered in ~17s.
- Sorting application — 2012 world record for disk-to-disk sorting. This is based on MinuteSort, which is how much data can you sort in a minute? Using <1/5 of the previous (Yahoo!) record, sort almost 3x the data (1470GB vs 500GB) in 59s. Also beat the UCSD sorting record, which however is a bit more CPU-efficient.
- Dynamic work allocation — ignore data locality constraints to allow everyone to pull work from a global pool (mitigating stragglers). Works well for sorting on FDS.
- Q: When extending a file, you need to contact replicas before extending the file — isn’t that a bottleneck? No because we distribute this across the tractserver for block -1. Also tracts are lazily allocated, so there’s little cost to extending more than you need.
- Q: Did you compare your system to commercial products, and how do you think it might scale to orders of magnitude larger? Couldn’t afford to buy a large commercial cluster. At the scale of the network we built, you could just buy a big Arista switch with guaranteed full bisection bandwidth, so we think the CLOS network will scale to 55k or more servers. Everything we’ve seen has been strikingly linear.
- Q: How well does this work when the map phase of a job will typically reduce the amount of data five-to-one? Today, a lot of clusters are built where you can only extract good performance as a data-parallel ship-computation-to-storage job. Sort is a classic example of this as an I/O torture test. FDS gives the programmer more flexibility to express the computation in the most appropriate way for their job.
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University; Carlos Guestrin, University of Washington
[to be added; please check back]
GraphChi: Large-Scale Graph Computation on Just a PC
Aapo Kyrola and Guy Blelloch, Carnegie Mellon University; Carlos Guestrin, University of Washington
[to be added; please check back]
Session 2: Privacy
Hails: Protecting Data Privacy in Untrusted Web Applications
Daniel B. Giffin, Amit Levy, Deian Stefan, David Terei, David Mazières, and John C. Mitchell, Stanford University; Alejandro Russo, Chalmers University
[to be added; please check back]
Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels
Alan M. Dunn, Michael Z. Lee, Suman Jana, Sangman Kim, Mark Silberstein, Yuanzhong Xu, Vitaly Shmatikov, and Emmett Witchel, The University of Texas at Austin
People want to run programs without leaving any traces. Claim: current “state of the art” is private browsing. But this doesn’t really work, as there is no OS support for privacy. There remain traces in the OS, and the application cannot do anything about it. Buffers remain (e.g. Pulseaudio, X server framebuffer stuff, network packets etc.). People have fixed this by zeroing memory on deallocation, but for some reason that I missed, that is not enough (no deniability?). So the goal is to make a system that has forensic deniability, and imposes low overheads and only on the “private” programs. Their system is called “Lacuna” and based on Linux+KVM. Applications are unmodified. First step: create “erasable program container”. IPC can be contained by running the program inside a VM, but that’s not enough, as the program needs access to peripherals (e.g. GPU), and the graphics driver and X server potentially have access to the data. Let’s look at storage first. We can encrypt, so that only encrypted data passes through the OS (this feels like the floor-hitting fruit to me). Graphics card a little trickier; use “ephemeral channels”. Two possible types: 1) leave no traces, using a hardware channel, giving the guest VM full control of HW, 2) ensure traces are not readable, encrypt data and have lightweight decryption proxy in the driver. They provide the first type of channel by exploiting virtualization support in hardware (e.g. NICs); this also seems to be a bit of a low-hanging fruit to me. The second type is based on something they built; e.g. for graphics this is based on an emulated graphics card and a modified driver, decrypting using CUDA. Modified VMM provides hardware channels for e.g. USB, Audio etc. Storage is also based on encryption, but need to make sure that the key is erased, and any pages in buffer cache are encrypted.
Evaluation: show that Lacuna protects privacy. To show, they inject “random tokens” instead of keyboard input, and then scan memory for those tokens. “Almost always” found without Lacuna, never with. They measure the number of LOC that handle sensitive data — it’s small (low hundreds). The overhead on switching between private and non-private mode is low (e.g. due to switching USB drivers). Runtime performance of typical desktop applications is unchanged, but CPU load is higher due to encryption overhead, although hardware AES support already helps with this.
Q: You are leaving encrypted data on the drive, right? Cannot decrypt that, but there are legal ways of enforcing decryption (cf. court cases). Any ways of dealing with that?
A: We don’t hide the fact that we used encrypted channels, but we do destroy the key, so nobody could actually get to the data (eh? Surely that means that any persistent storage is pointless?)
Q: What do you do with unencrypted data in device (not OS) buffers?
A: Quite possibly there are device-level HW buffers. But this isn’t our focus; we try to do as good a job as we can for any “publicly accessible API”.
Q: Cheating on graphics, since surely you can just reverse the encryption?
A: No, we zero out the memory afterwards. (?)
Q: How hard was it to modify the drivers?
A: Often can just modify generic subsystems; however, for graphics, we don’t currently support 3D (obviously).
CleanOS: Limiting Mobile Data Exposure with Idle Eviction
Yang Tang, Phillip Ames, Sravan Bhamidipati, Ashish Bijlani, Roxana Geambasu, and Nikhil Sarda, Columbia University
Mobile devices are ubiquitous, yadida. New challenges: security and privacy of data, since not protected by physical security or corporate firewalls. Devices can be lost, stolen, seized, or the user may connect to random unsecure wireless networks. Mobile OSes have not evolved to protect against this: for example, OS does not securely erase sensitive data or deleted files. Example: dump SQLite databases from app memory on Android for five out of 14 apps, they found the cleartext password this way, and 13 of 14 have “some kind of sensitive data” in RAM in cleartext. Protecting devices is hard! Encryption and remote wipe-out have issues; many users do not configure good passwords, so that devices are easy to unlock. Hence, these solutions are “imperfect stop-gaps”. Their claim: we need new OS abstractions to manage sensitive data rigorously, so that devices are always “clean” just-in-case. CleanOS solves the problem by pushing the data out to a “trusted cloud”. They do so by implementing sensitive data objects (SDOs), and pulling them on-demand from the cloud. Hence, a thief or attacker must then fetch the data from the cloud, where it can be more easily removed or protected using stronger passwords and encryption. Key insight: much sensitive data is in memory in cleartext, but is only used rarely (e.g. on data refresh). Applications can create SDOs in CleanOS, thereby identifying sensitive data. CleanOS then tracks these objects using taint tracking, and evicts the SDOs to the cloud “when idle” (I guess as part of GC, or something?). Actually don’t push the data there, just encrypt and put the key into the cloud, and fetch it when necessary.
Comparison of CleanOS vs standard mobile OS: the benefit of CleanOS in the time between attack and the user noticing is that it can audit accesses, or disable them based on heuristics for suspicious behaviour. Once the user notices, access can be completely cut off (arguably, this is also true for remote wipe-out, surely?).
SDOs have a unique ID (set by whom?) and a textual description for auditing. Using these is as simple as wrapping Java objects in SDO containers. The way they work is by having VM-level support (Dalvik, based on TaintDroid) and modified interpreters and garbage collectors. When an SDO is created, a random ID is generated and the SDO is registered with “the cloud” in a trusted SDO database. When SDO becomes eligible for GC, it will be encrypted and the key saved in the cloud. When the object is used again, the key must be retrieved from the cloud, which creates an audit log entry. Their new garbage collection is called “eiGC”, which is more aggressive than a tradtional GC in that it will evict objects that have not been used in “some time”, rather than just orphaned objects. Various bits of nitty-gritty stuff about carrying ciphertext in Java objects while maintaining their API. CleanOS can work without any app support, but works better with such support. They provide some sensible defaults (SSL, user input and password SDOs). All of this stuff of course has runtime overheads and energy implications. They do, however, include a bunch of optimizations that improve performance (not many details given).
Here comes the eval, or a talk-sized subset thereof. Key questions: does CleanOS limit data exposure? According to some slightly unclear metric, SDOs reduce data exposure by ~90%. “Audit precision” (?) is high, but even higher if apps support SDOs directly. Time overheads of using CleanOS are in the millisecond range on WiFi, but second range for 3G. Their optimizations help to curb this down again, though.
Q: What is the impact on power consumption?
A: Less than 9% overhead. Most energy used for screen anyway (not really a new result).
Q (Jason Flinn, UMichigan): You discovered a fundamental trade-off between the performance benefit of caching, and the granularity of privacy. Batching could really help here. How do you balance these concerns?
A: Let the user configure the eviction policy.
Q (someone from EPFL): What happens when the device is taken offline?
A: We have two types of disconnection: temporary and long-term. Different solutions: for the former, just extend the lifetime of the SDOs, for latter, user can disable the eviction.
Q (Bryan Ford, Yale U): What percentage of objects are actually sensitive, and how does the taint propagate?A: So far, not seen a massive impact of tainting. In 24h, only about 1.8% of objects are tainted.
Q (someone from EPFL): Do apps have some kind of API that ensures that SDOs are available, or will they just freeze if the SDO has been evicted?
A: Not currently, but the SDO abstraction should be transparent to apps.
Session 3: Mobility
COMET: Code Offload by Migrating Execution Transparently
Mark S. Gordon, D. Anoushe Jamshidi, Scott Mahlke, and Z. Morley Mao, University of Michigan; Xu Chen, AT&T Labs—Research
Offloading is neat, as it gives us extra resources, especially on weak mobile devices. Existing work in this area follows the “capture and migrate” paradigm, usually on a method granularity. Drawback: doesn’t work well in multi-threaded environments, or with synchronization. Goals for COMET: higher mobile computation speed, no programmer effort, generalize well with existing applications, resist network failures. Implementation: modified Dalvik VM, synchronizing two devices (mobile and server) using distributed shared memory. COMET = offloading + DSM, i.e. global address space. DSM is tradtionally used in well-connected environments, but this is using mobile data connections, which are probably quite pathological for this (many round-trips for writes). The Java memory model is important to the implementation, since it specifies the consistency semantics (accesses in single thread are totally ordered; lazy release consistency locking). Simple DSM scheme: they track dirty fields (Java MM is field-granularity); [missed the rest of slide]. They sync two Java VMs, including bytecode and thread stacks. There is a “pusher” and a “puller” (directional sync). First step on app launch is to load the bytecode (usually just one file), which may be cached or may need to be sent. Then thread stacks, and finally heap updates/changes. Locks are annotated with an ownership flag, which is used to establish “happens-before” relationships as required by Java memory model. Thread migration now becomes simple: 1) push VM sync, 2) transfer lock ownership. Native methods are a challenge, since they are performance critical and often interact with device hardware. They manually white-list methods that are safe to run on the server side. The whole thing is fail-safe in the sense that we can lose the server, since there is always enough information on the client in order to just run threads locally.
Scheduling: currently fairly simple. They monitor the execution of threads, and migrate if T = 2*migration time, i.e. if it is “worthwhile” to migrate according to a simple heuristics. They evaluate it using a 1 GHz Samsung phone and an 8-core high-performance Xeon server. Use a set of hand-picked applications from Google Play (as opposed to own applications in previous work). Get speedups of ~2.88x on WiFI and 1.28x on 3G; energy savings are 1.51x for WiFi and 0.84 for 3G (i.e. not actually a win). On LINPACK, they get ~10x speedup, and 500+x speedup on their hand-crafted multi-threaded demo application. Also looked at web browsing and whether it could be accelerated: somewhat challenging, because web browsers usually written in C. On a Java-based JS interpreter, they get 6x speedup. Unlike previous work, they have full multi-threading support, and their field-based DSM coherency is novel.
Q (someone from Simon Frazer U): How will applications like games on Android benefit from code offloading?
A: Apps with lots of user interaction will probably not benefit much (and this includes many games). Image filtering and processing apps probably better.
Q: Reminds of CloneCloud, how is this different?
A: Full multi-threading support, can migrate in the middle of a method (different granularity), no need to block on fetching data (?).
Q (someone from Purdue): Hand-picked apps — speculate on the possibility of an automated analysis that determines the benefit of offloading?
A: Consider doing a user study to figure out if this is of practical benefit.
Q (someone from MSR): Which apps would benefit most if you did this commercially, and what is the incentive for an app writer to include this functionality?
A: Compute-intensive apps, e.g. kernel-based image processing.
Q (someone from UCSD): How do you deal with disk IO?
A: Will force migration back to client. Unlike CloneCloud, they do no virtualize the file system.
Q: What are the requirements for supporting a native method for offloading?
A: As a first order approximation, cannot support it if it makes syscalls.
AppInsight: Mobile App Performance Monitoring in the Wild
Lenin Ravindranath, Jitendra Padhye, Sharad Agarwal, Ratul Mahajan, Ian Obermiller, and Shahin Shayandeh, Microsoft Research
People have hundreds of apps on their mobile devices. In total, there are >1M apps, with >300k developers. This is a lot like other software markets. Many apps, however, are too slow, causing unhappy users and bad ratings. But why do apps perform poorly in the hands of users? Diversity of ecosystem, connectivity etc. — all of this is almost impossible to reproduce comprehensively in the lab. What a developer needs in order to get useful feedback is monitoring “in the wild”, in the hand of the user, on their device. Currently, the only option is to instrument the app with bespoke monitoring infrastructure, but that is beyond what hobbyist developers can do. Their thing, AppInsight, runs with zero developer effort and instruments compiled binaries, which can then be distributed to users.
However, app instrumentation is challenging because you do not want to slow things down even further, and often only have limited resources at hand. Apps, to make matters worse, are very highly asynchronous due to frequent UI interactions etc.; most tasks are performed asynchronously. Tracking user-perceived delay across thread boundaries is really hard. Somehow, they can extract a “transaction graph” from real apps; results show that, on average apps have ~19 asynchronous calls and use ~8 threads. The bottleneck path through the transaction graph is called “critical path”; if it can be made shorter, user-perceived delay improves. Back to AppInsight: need to capture *just* enough information at low overhead to understand the transaction. They try to identify “upcalls” to do so: e.g. event handlers, function pointers etc. (they have heuristics). These calls can then be instrumented. Matching callbacks to threads is also very hard, partly because of “detour callbacks” and non-1:1 matching. They measure a whole lot of stuff and send it to the server.
Goal: optimize the critical path, and make it shorter in time. They batch and aggegate in order to reduce profiling overhead; programmers provide annotations e.g. indicating corner cases. Using similar mechanisms, they can also monitor and trace thread crashes, giving the “exception path”. Everything is aggregated in a shiny web-based interface.
Evaluation: they deployed this in 30 Windows Phone apps, which actually do have performance problems: 15% of user transactions take more than 5s. They observe a huge variability in the wild, and measure the overhead at runtime: it is <1% on compue, ~4% on network and <1% on battery. Case study: “My App”, which has an UI hog problem. AppInsight told them where to look for the issue. Another one had slow transactions, due to bad caching policies; which AppInsight highlighted. Also did a pilot study with a large enterprise that actually built its own instrumentation pipeline, which then turned out to make up the critical path, responsible for extra latency!
[apologies, for the bad coverage of the last talk -- I was very tired by this point, after having come to the conference straight off an 11-hour flight!]