{"id":1043,"date":"2012-10-09T15:59:54","date_gmt":"2012-10-09T15:59:54","guid":{"rendered":"http:\/\/www.syslog.cl.cam.ac.uk\/?p=1043"},"modified":"2013-07-12T16:46:00","modified_gmt":"2013-07-12T16:46:00","slug":"blogging-osdi-2012-day-1","status":"publish","type":"post","link":"https:\/\/www.syslog.cl.cam.ac.uk\/2012\/10\/09\/blogging-osdi-2012-day-1\/","title":{"rendered":"Blogging OSDI 2012 — Day 1"},"content":{"rendered":"

\"\"For the next couple of days, I am attending OSDI in Hollywood. However, due to various scheduling constraints on both sides of the Atlantic, I only arrived there at lunch time on Monday, and missed the first session. Fortunately, in addition to my \u00c2\u00a0delay-tolerant \"live blog\" from the plane, where I read the first session's papers, Derek Murray<\/a> was kind enough to take some notes on the actual talks. Normal live-blogging service of the talks will be provided for the other days! :)<\/p>\n

<\/p>\n

Session 1: Big Data<\/h1>\n

Flat Datacenter Storage<\/strong>
\nEdmund B. Nightingale, Jeremy Elson, and Jinliang Fan, Microsoft Research; \u00c2\u00a0Owen Hofmann, University of Texas at Austin; \u00c2\u00a0Jon Howell and Yutaka Suzue, Microsoft Research<\/em><\/p>\n

Malte's summary:<\/strong><\/p>\n

\"Flat storage\", a.k.a. a simple network file server, is simple and neat. In conventional data centres, however, we see more complex approaches that try to move the computation to the data, since data motion is expensive as a result of the tree topology of the DC. Many common jobs however inherently require data motion, and MR et al. do not cut it for those. However, we can now build full bisection bandwidth DCs, so the locality constraint can go: in FDS, \"all compute nodes can access all storage with equal throughput\". It is a clean-slate re-think of DC storage, and the guiding principle is statistical multiplexing of I\/O across all disks and network links in the cluster. Data is structured in blobs and \"tracts\": a blob is a sequence of mutable tracts, which are small units of data (~8MB) named by a 128-bit GUID. Tracts are stored directly via raw block device access (i.e. there is no file system), and all meta-data is kept in memory (!). Simple non-blocking API with atomic writes, but no guarantee on write ordering (i.e. weak consistency in the presence of failures). FDS does not use explicit meta-data; instead, it has a Tract Locator Table (TLT) that deterministically maps GUID + tract ID tuples to tract servers (but not data on disk). It also has no durable state, and its state can be entirely reconstructed on the fly from others. Only interaction with the TLT is on process launch, then long-term caching of tract servers. Per-blob meta-data (e.g. size) is in special \"tract -1\" and accessed just like data. Blobs also have an atomic \"extend\" operation (cf. GFS's \"append\").<\/p>\n

Tracts are replicated, and a mechanism similar to RAMCloud is used to recover data on lost disks or machines. Writes are sent to all replicas by the application library (after TLT lookup, if necessary), and only complete when all replicas have acknowledged. Meta-data changes are sent to a \"primary\" replica, which executes 2PC to update everyone else. Replication level is a per-blob setting. Fault-tolerance is built around the key ingredient of version numbers on TLT rows, which are used to reject stale requests referring to the pre-failure conditions. Story on network partitions is a bit weak -- they rely on only a single meta-data server running, and manually configure it (though looking into Paxos). TLTs are generall O(n^2) for n disks, representing all pairs, plus some random extra replicas. This can get quite big, but they have some optimization that decrease size.<\/p>\n

Network is using a lot of shiny 10G hardware, with 5.5Tbps bisection bandwidth at cost of $250k. They found it hard to saturate 10G with a single TCP flow, since a single core cannot keep up with the interrupt load. Spreading them, using multiple short flows (design characteristic of FDS) and zero-copy architecture all help with this. They use RTS\/CTS notification to avoid incast and receiver-side collisions. Evaluation test cluster is heterogeneous; they say that dynamic work allocation was key to utilizing it efficiently. They find ~1GB\/s write and read throughput per client at nearly linear scalability. Random and sequential read and write performance at tract-level is identical; with triple-replication, maximum write performance goes down to ~20GB\/s (from ~60GB\/s). Max throughput they could reach was ~2GB\/s per client (using 20Gbit network connections). Failure recovery is fast: 3.3s for 47 GB on 1,000 disks; 6.2s for 92 GB; 33.7s for compensating a whole machine failure (~655GB). Using FDS, they managed to claim the Daytona and Indy sort records, sorting around 1.4TB in 60s.<\/p>\n

Derek's take:<\/strong><\/p>\n