NSDI 2011 Day 3 LiveBlog

1Apr/110

NSDI 2011 Day 3 LiveBlog

Morning all once more. LiveBlogging traditional style today: check back here after each talk for our summary/remarks/ranting.

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

It sucks that there's no MapReduce (or Ciel :))-style thing that's optimal for all applications. Static assignment of jobs to machines is liable to use their resources poorly; let's do statistical multiplexing. They design an OS-esque layer to sit underneath existing frameworks; it can provide isolation between framework instances as well as doing more ordinary muxing.

Goals: high resource utilisation, support existing frameworks, scale to 10,000 nodes, tolerate failures. They produce a "microkernel-like" core that requires the frameworks to self-schedule. They decided not to write a global scheduler to allow maximum flexibility and to avoid having to make invasive changes to the frameworks. Their core works by offering available resources to the frameworks which can then choose which ones to use.

The central scheduler does implement some policy: typically it tries to fair-share between the frameworks. Sounds like the frameworks need to be modified to use the resource offers -- their policy at the moment is to pause briefly to accumulate offers before picking the most desirable one (e.g. data-local). Can push a "filter" to the Mesos master to fast-path reject certain offers. They've ported Hadoop, MPI and Torque as well as designed a new framework called Spark.

Results: compared vs. a naive solution in which four frameworks are given 25% of the cluster nodes each: found 0.96 - 2.10x performance (MPI scored 0.96, suffering slightly from the introduction of statistical multiplexing). Eval 2: use simple delay scheduling to try to acquire a node with locality for Hadoop. Found much better locality than for statically partitioned example, with 1.7x improvement in job execution time as a result. Seems strange: does this imply that Hadoop core doesn't do delay scheduling?
Scalability: found 0.2-1 second delay to start a task with 100K tasks on 50K machines. Recovery from fault in 10 seconds. (cs)

Recently, there has been lots of rapid innovation in cluster computing, leading to the emergence of a variety of frameworks. Since these are useful for different purposes, and people have different personal preferences, we might want to run several frameworks in parallel on a single cluster. The naive way of doing this would be to statically partition the cluster, but Mesos tries to do better by dynamically sharing the cluster and statistically multiplexing resources.
Additional benefits gained from this are the ability to run isolated instances of the same framework and the ability to run specialist frameworks on the cluster.

High level goals: (1) High cluster utilization, (2) Support for diverse and future frameworks, (3) Scalability to 10,000s of nodes, (4) Reliability in face of failures.
Result: small micro-kernel that pushes most of the scheduling logic into frameworks. Fine-grained sharing to multiplex resources in space and time. They decided against using a global scheduler even though this would have given them the ability to make optimal decisions, because it is complex and difficult to scale and make robust. Instead, what they use is "resource offers": Mesos knows what resources are available in the cluster and tells frameworks about them. The frameworks then decide what and where to schedule independently. Nonetheless, there is an "allocation module" in the master which decides which framework to offer resources to, thereby doing global scheduling. Frameworks can accept or reject resources according to its own scheduling policies. Since this may end up with lots of resource rejections and thus suboptimal utilization, they introduced filters, which are predicates on resources required that frameworks send to the Mesos master.

Implementation: 10k lines of C++, ZooKeeper for fail-over; supported frameworks: Hadoop, MPI, Torque, Spark. They show that their dynamic resource multiplexing works, with frameworks scaling up and down and gaps being filled. Compared to static partitioning, they get a job time completion improvement of between 0.9 (MPI) and 2x (Hadoop). Data locality is also still good, and in fact better than with static partitioning, because each framework has more nodes available to it. They show good scalability using 50,000 emulated slaves. Fault tolerance is implemented by the master having only soft-state that can be re-computed on fail-over. This however appears to rely on frameworks to do their own fault-tolerance. (ms)

Sharing the Data-Centre Network

How to be abusive: use UDP, or TCP variants that are close enough. Want to allocate network resources to possibly malicious entities without requiring software rewrites or foreknowledge of exact requirements.

WFQ isn't good enough, because switches don't understand a "flow" well enough (i.e. lack application-layer knowledge). Per-host rate limiting is troublesome too as apps might be distributed, so you can be as abusive as you have hosts.

Their solution: "network weight" per VM, plus a hypervisor-embedded rate limiter. They monitor per-link and ensure that you can only get a fair share according to your weight at the bottleneck. The actual limits are imposed end-to-end rather than at the bottleneck link; the appropriate limit is communicated to the hypervisor shim. (cs)

Dominant Resource Fairness

Max-min fairness is handy because you can't out-strategise it -- you get a fair share, or more if nobody else wants it. Their problem: what's fair if two users ask for multiple resources in an asymmetric way (e.g. A wants 3 CPU cores and 1GB memory; B wants 1 core and 4GB memory). Max-min both things seperately?

They considered Asset Fairness: give users equal sums of resource usage (what weight relates 1GB of memory to 1 core?). Provide "share guarantee": each user should get at least 1/n of *some* resource. A user gets max-min fair share of their *dominant resource*: that which they demand the most relative to the available resource. New tasks are assigned to the user with the least dominant share.

Economists have a different approach: just price the resources. But what price to set? Could let the market sort that out -- divide resources equally then let the users trade. Can use cunning strategy though, e.g. lying about your requirements (but wouldn't you pay more for what you get?) DRF has lots of desirable properties some of which are missing from their economic model and from asset fairness.

Eval: compared how many of two classes of jobs finish in 15 minutes using DRF or using "slot fairness," which is apparently Hadoop's default allocation mechanism. Found that more of both classes of jobs finished than using slots (so what was Hadoop doing with the rest of the time?) Also eval'd by simulating a big Facebook Hadoop trace: found ~50-60% job completion time reduction. (cs)

Max-min fairness gives each user at least 1/n of a shared resource, but less if only less is needed (share guarantee). It also encourages users to ask for what they actually need, rather than asking for more (strategy-proof). Lots of things use max-min fair scheduling. However, when there are multiple resources being considered and demands are heterogeneous, the straightforward solution is not applicable (for example, consider dividing up CPUs and RAM between two users who ask for 1 CPU/3 GB and 3 CPUs/1 GB).

Core model: users have tasks with demand vectors. Asset fairness
Share guarantee: every user should get 1/n of at least one resource. Trivially true with max-min fairness in single resource model. Strategy-proofness: users should not be able to increase their allocation by lying in their demand vector.

DRF explanation: each user has a dominant resource. This is the resource that she has the biggest share of (as in, the largest percentage of total resources available). DRF applies max-min fairness to dominant resources. Each time a task is scheduled, pick the user with the least dominant share.

They compare a model from economics (CEEI) to DRF, and find that DRF is fairer, but CEEI achieves a better resource utilization. However, CEEI is not strategy-proof, and users can increase their dominant resource share by lying about their other demands.

They also looked various other properties supported by asset fairness, DRF and CEEI. Nobody achieves resource monotonicity, but DRF ticks all other boxes, while the other two each fail one.

In summary, DRF provides multi-resource fairness in the presence of heterogeneous demand, and it is is the first generalisation of max-min fairness to multiple resource. They also conjecture (and are working on a proof that) DRF is the only "reasonable" policy that satisfies both strategy-proofness and share guarantee. (ms)

PIE in the Sky: Online Passive Interference Estimation for Enterprise WLANs

Only saw the last 5 minutes of this one due to an extended coffee trip. Let's see what I can infer.

They're trying to estimate something called LIR. Linear Induction of Rollerball (the 1984 movie). Their estimates impact WLAN applications. It can boost user throughput from 11Mbps using a "distributed" scheme compared to PIE's centralised scheduler. I wonder what they're scheduling -- replacing MACA etc with something more like GSM's centralised TDMA scheme? They have a competitor centralised scheduler called "BW", which does well for static nodes but performs like the Distributed scheme for mobile nodes. For that mobile situation they improve dist/BW's 10Mbps to 14Mbps.

They profiled two real WLANs in action monitoring for hidden terminals and "rate anomalies," which I'd guess would be a pathalogical case of the rate negotiation protocol which yields a demonstrably wrong answer. Found a hidden terminal to be present ~10% of the time. Limitations: can only do cunning scheduling when all interference is due to other WiFi devices, not some pesky ingrate microwaving a burrito. It also sounds like the clients need to be actively involved.

Who am I kidding? I've no idea what this was about. Read the paper.

SpecNet: Spectrum Sensing Sans FrontiÃ¨res

Wireless spectrum utilization is low (cite various studies in US and China, plus their own in India). Bottom line: spectra are underutilized. FCC has approved use of "white spots", i.e. incidentally available bits of frequency spectra. They however say that studies have only been sparse local analyses, and blame this on the lack of a framework for spectrum analysis. What they've done with SpecNet is "PlanetLab for spectrum analysis". It allows researchers to conduct spectrum studies at remote locations, and helps maintaining spatio-temporal frequence spectrum maps.

A challenge is that the spectrum analyzer devices are very expensive and hence their availability is limited. Furthermore, they need to support user demands and allow quick response as some applications might require that.

Components are, as one would expect, a local slave machine with a spectrum analyzer connected to it, and a master server who manages all those slaves. It also aggregates the data collected in a single database. They consider three different classes of users: (1) sophisticated users who want low-level access, (2) policy users who want to look at historic information and (B) others (incl. network operators), who are not interested in low-level access to devices but want to obtain concrete answsers to specific questions.

A spectrum analyzer measures the spectral compodition of waveforms. It has two key parameters: frequency span (Q) and resolution bandwidth (RBW, omega). It is important to have these set correctly, else you may end up measuring noise only. Linear time dependency on frequency span: scanning a larger frequency span takes longer. In theory scan time should also be inversely proportional to RBW, but they didn't find this to be true -- instead they found it to be piecewise linear. They can cut down scan time by dividing the frequency span between two analyzer devices in the same region (however, what's the chance of that happening given that those devices are so expensive?! Maybe they can each cover huge areas, thus overlapping). They further extend this to partition geographically on top of this to further reduce scan time, and combine both to geo-spectral scanning.

They give a demo of remotely scanning the current spectrum utilization in Bangalore; a real-world application of this might be to locate spectrum violators.

It sounds as though what they really need is a framework to schedule scan tasks and aggregate the results. Surely something like the standard task-farming frameworks could do that (or even CIEL! :-) ), given that they already have local slave machines? (ms)

A study by the NSF found 5.2% spectrum occupancy beween 50Mhz and 3Ghz; later studies found 17%. They did a study in India; found VHF, UHF, CDMA-wireless, GSM being used but little else up to 1Ghz. The FCC has approved use of "white space": incidental free space in the spectrum. Only the US has licensed this stuff so far. Problem: not many people have measured occupancy; they want to analyse the spectrum across vast areas in real time. Their thing is a platform for coordination of spectrum analysers.

Challenges: spectrum analysers cost $10-40k, and can't be multi-tasked (?). OSA (Opportunistic Spectrum Access) requires more rapid scanning than they can deliver. They essentially supply a web service that permits users to see a global real-time map but which functions by programming analysers in appropriate places.

Analyser devices are configured by the range you wish to monitor and the granularity of measurement. Poor granularity can completely hide narrowband transmissions like digital content over FM. They're also limited by the noise levels at their location. Scan time is proportional to range scanned, apparently regardless of resolution bandwidth, which is surprising. Indeed they expected it to be proportional to 1/granualarity; actually they find that it's piece-wise linear presumably because DSP techniques are being used to mock up some granularities?

They partition both frequency space and actual physical space in a fairly cunning way that I didn't understand, but taking into account the scan time that will be required. They use a simple heuristic to prioritise division by frequency or physical space. Eval of this: 3 analysers, two colocated and one far away, trying to find a single source. Found that takes 1118s to find it partitioning solely by frequency, 1205s if partitioning by space (presumably reducing your scan time using the maximum range actually required after partitioning), and about half that if taking both factors into account. Geo-only performs badly because the colocated scanners can't actually do anything useful; they both need to extend to full range anyway.

Application: finding spectrum violators quickly. They boast a 15-line python program using their API is enough to achieve this.Â (cs)

Towards Street-Level Client-Independent IP Geolocation

Challenge: find an IP's location without cooperation from the user. The server wants to know because it's trying to do security, or analytics, or targeted ads.

Previous work: Constraint Based Geolocation sounds like it does active ping triangulation. Topology Based Geolocation used network topology knowledge, i.e. tracert. Octant used geographic information too (i.e. they're probably in a populous place). That last achieved median error 35.2km.

Two main insights for their system: websites often declare their actual location (might be hosted remotely, but they hope at least some of their servers are in their building), and relative network delays correlate strongly with distances.

Method: step 1: get a coarse-grained location by pinging from a few locations. Step 2: populate that region with passive landmarks (passive in that they can't send probes). Presumably these are the ID'd websites. Then send some traceroutes to the target (not clear on how these involve the landmarks; perhaps they just hope that the landmarks are in the way of the traceroute / share intermediate routers with the target?). Step 3: find landmark with minimum delay-to-target and associate the user with that.
They briefly mention a technique for filtering landmarks that aren't useful (entirely remote hosted, bad address...), and that they have a proof that wildly erroneous landmarks don't affect the results.

Eval'd against three datasets: PlanetLab, some manually collected data, and Bing Maps (probably). Found median error 0.69, 2.25, 2.11km respectively. Landmark density helps them of course; median error goes up to 12km at low pop density (50 people/sq mile, rural). Found that the provider's access network had a fairly large affect on quality; ComCast caused trouble because it's a cable provider, and EPON introduces considerably more latency variance than DSL technologies' private lines from the DSLAM to the home. (cs)

Problem: How to we accurately locate IPs on the internet. Side-question: do we even need this? There's GPS, WiFi location etc., but they are host-dependent (client needs to tell the server about its location). We might want to do this with no client cooperation at all, which brings us back to good old 2000's geo-IP.

Prior work is in Constraint-based Geolocation, spiced up with network topology information, router and demographics (!) information. Best one still has median error distance of ~35km though. Their stuff is much better. Based on two insights: websites often have definite information on their geolocation, and relative network latencies are much more useful than absolute ones.

Method: Three tiers -- (1) get coarse-grained location using CBG method (probing from multiple vantage points), (2) get finer-grained region by doing traceroutes from various landmarks in the coarse-grained region; this uses absolute network latencies and yields a smaller region, (3) geo-locate the target using passive landmarks; abandon absolute latencies and use probing against known locations in the target area. Didn't fully understand the methodology (and particularly the large-number-of-kms-against-small-distance graph), but somehow they get very accurate results from this. Of course, their landmark locations may be erroneous, but they can show that this only has a limited effect on the geolocation results.

Evalution on three datasets: PlanetLab, residential crowd-sourced data, online maps. They cover both rural and urban areas (which is presumably important because the density of landmarks is much lower in rural areas); especially the online maps data set covers these. The managed to get a median accuracy of 0.69km for the Planetlab trace, with the tail being at 5.2km. For the residential dataset, the values are 2.25km/8.1km and for online maps 2.11/13.2, which is orders of magnitude better than previous results. As expected, they explain this by the landmark density being highest for the PlanetLab data. Extreme case is a user living in the middle of nowhere, with the next web server being 13km away... They also found that cable networks are doing worse than DSL networks, because they have a higher latency variance. (ms)

This is it -- the end of NSDI 2011. Thanks to everyone for tuning into our syslog live coverage, and see you at the next conference!

Chris Smowton and Malte Schwarzkopf