HotOS 2011 LiveBlog Day 2

Disk-Locality in Datacenter Computing Considered Irrelevant

Hadoop et al try to put tasks where the disk containing their source is. But network bandwidth > disk bandwidth! And net bandwidth is alleged to be a plentiful resource! And disks aren't getting faster at anything like the rate of the network! (SSDs?) Memory locality is still worthwhile, though, but even modern datacentres can't just keep everything in memory (Facebook example: mem_bytes = data_bytes/200. Sadly this isn't much use: datacentre jobs typically have very large working sets, so it's all-or-nothing when it comes to in-memory caching. 96% of FB jobs can fit entirely in memory. Cache replacement algorithms need to be adapted to fit this sort of use case. (cs)

The key idea in datacenter computing is co-locating the computation with their input, and much work has gone into this. Looking at the Facebook production cluster, 85% of jobs see tasks reading from the network running just as fast as disk-local tasks; also, disk-local reads are not faster than rack-local reads. Networks are getting faster and over-subscription is reducing; however, disk speeds are plateauing. Locality matters a lot, because memory is two order of magnitude faster than disk. Problem for caching is that the working set is close to the size of the whole input, and you need to cache everything because a single from-disk job will be much slower. Apparently, 96% of jobs at Facebook can fit in memory. - dgm36

Optimising Data Partitioning for Data-Parallel Computing

Data partitioning controls the degree of parallelism. We have many possible choices of partition function, and degrees of partitioning which can have a big effect on performance and cost. There are problems of data and computation skew. Also, a balanced workload does not necessarily give optimal performance because we can get more cross-node traffic with more partitions. The problem is harder than for databases because we need to understand user-defined code, and data with less structure than simple tables. Analyzing user-defined code may be tricky if the amount of processing is data-dependent. Perform cost modeling and optimization to get a low estimated cost. Two approaches: white box (analytical formulation) and black box (based on down-sampling). - dgm36

Suppose we've got a boatload of data to process: how many chunks should we split it into to achieve data parallelism, and how should we perform the split? We want to avoid skew in data size, and also in compute costs applied to that data (i.e., one record is not necessarily like another). Perfect balance might not be the best idea either -- we have to consider the volume of cross-node traffic that will result. Challenge: analyse the user's code to figure out how long it will take to process a given record in advance of execution. Then apply a do global cost analysis using the predicted runtimes for each partition and the supposed properties of the network, etc. (cs)

Disks Are Like Snowflakes: No Two Are Alike

Historically, disks were fairly precisely spec'd and performed the same as others of their type notwithstanding partial failures. Now they're much more variable because disks are built to use "adaptive zoning". Disks have always used zoning: concentric rings with differing numbers of sectors per track. This means that ordinary disks show bulk throughput rate that varies in the manner of a staircase with logical block number, as we move between zones. More modern drivers vary much more unpredictably with logical offset because they've been adapatively zoned -- presumably fitting as many sectors as possible without exceeding some intolerable error rate. Apparently the variance stems mostly from the head -- how rapidly it responds to changing magnetic properties of the underlying disk surface. Different platters have different heads, so you can see differing rotation rates / zoning density per platter, as well. Therefore: we should expose that sort of knowledge to RAID drivers and the like, as the disks are not as alike as you might suppose, either within the disk or between them. (cs)

In the past, buying a batch of disks, you could measure one and assume that the rest would perform the same. Now, disks are unique by design. In 2002-era disks, the performance would be the same. Disk zoning techniques provide bandwidth proportional to the number of bits along the track, so store more data on the outer tracks. In 2006-era disks, there is much more variability, and this trend has increased in 2008-era disks. Adaptive zoning figures out the density capability adaptively based on how good the disk head is. This means there is no longer such a thing as a homogeneous set of disks, which has implications for striping, equal-work partitioning and other low-level disk techniques (about placement, etc.). - dgm36

The Case for Power-Agile Computing

Mobile devices have a variety of applications with differing power needs. We want to have the capability of all of a desktop/laptop/supercomputer/mote all in a mobile device. The idea is to have an ensemble of different storage and processing elements. There are challenges in measuring efficiency, predicting different ensemble's performance, selecting the appropriate ensemble, and transitioning between ensembles. - dgm36

Enter the genie. Actually. Man dressed as a genie. Pretty good sketch comedy for two ensues, the point of which is that it'd be neat if your mobile phone could scale between the characteristics of tiny mote-like devices all the way up to a desktop. Proposed solution: slam a bunch of diverse components into the same box, and power up the appropriate one at the right time. Like sellotaping my good ol' fashioned Nokia 6310 (for most of the time) to an iPhone (for the odd occasion I really need it). (cs)

Mobile Apps: It's Time to Move Up to CondOS

Phones have a bunch of sensors. Can figure out the user's current activity, location, "are they interruptible" (?). Can't do it with an app, because mobile apps are typically not allowed to run all the time. So let's do it in the OS. Or, just permit a background app, since that's functionally the same. Supply this sort of stuff to subscribers. Could use this sort of info to preload apps when we think they'll need them. Or adjust ringer volume based on background noise. Or customise the way the WLAN card operates depending on whether you're moving (done at NSDI). Apparently you could selectively apply password protection if you think they're away from home (!) Or, predict time to recharge and do power conservation based on that. (cs)

Devices are very programmable these days, and they have a bunch of different sensors on board. The question is how new apps might use new sensors. Sensors can provide contextual information that applications might use. Since there are some risks to this, we can't let applications have unfettered access, and cloud offload might be expensive. ConDOS will do it in the OS. Sensors export "context data units" rather than raw sensor data, and these can be composed into a graph (like e.g. Click modular router). If context is OS-managed, the system services can use context (e.g. for memory management to make applications more responsive, security, energy management, etc.). This abstraction can also aid sensor privacy (by adding policy decisions to how sensor data is exposed). - dgm36

Free Lunch: Exploiting Renewable Energy for Computing

Part of a grander scheme from Andy Hopper about Computing for the Future of the Planet. Computing uses a lot of energy, but some companies have responded by installing renewable energy technology. Wind and solar renewable energy at a given location is intermittent. The idea is to migrate workloads and data instead of energy. Data centers must be connected with high speed links. Challenges include predicting VM migration time and storage synchronization (solved in MASCOTS papers), and scheduling and placement. Migration has a 0.5s downtime overhead due to memory and disk over a 10 Gbps link. With 615 migrations a year, the downtime is 415 seconds. This fits within a 99.95% SLA. The energy consumption of migrating  one VM is 57.5 kilojoules, assuming 7.5GB of RAM and 20GB of modified disk. - dgm36

Renewable energy == good. Renewable availability varies with time (sun, wind). So migrate your VMs to where the power is. Challenges: predict migration time, virtualise net+storage well enough that you can do this without causing service outages. They estimate that downtime will be okay if you've got a 10Gbps link between the data centres. (cs)

Debug Determinism: The Sweet Spot for Replay-Based Debugging

Logging for replay debugging is expensive, especially if it's a data race. Existing systems use "relaxed determinism": produce a replay that's not quite the same but good enough. The best so far is "ESD", which guarantees only that the same failure will occur. Can be too relaxed, though -- they might produce the same output by back-inferring reasonable inputs, rather than unreasonable inputs plus a data race. It'll only really work if the output observed can *only* occur due to a data race. We need higher fidelity than that. They propose a system which has high debugging fidelity (indicating the right root cause for a problem) and low overhead. They propose to do this by recording with high fidelity the root cause and the failure, but we don't know what's going to be a root cause in advance. Need a heuristic for where the root cause is likely to lurk. (cs)

Non-Deterministic Parallelism Considered Useful

Distributed execution engines benefit from nondeterministic parallelism! They have some useful properties, like taking care of synchronisation and scheduling. Most distributed execution engines provide guaranteed termination; Ciel sacrificed this in the name of flexibility and expressiveness. Should we sacrifice our support for fault tolerance along similar lines? Yes! Because fault tolerance as we know it (Jim) requires determinisitic tasks, but the real world tends to perturb your programs and introduce nondeterminism (e.g. network delays, competing processes)... Requiring us to be deterministic in the face of such things requires us to sacrifice neat things like wait-for-multiple-handles, async interprocess signals... But what would we do when failures occur? We could fail the whole job, or at the other end we could record all nondeterministic events for replay; there are also a few intermediate points on this spectrum. Non-determinism is useful for other things too: handling interactivity, adapting the job to cluster conditions... (cs)

Finding Concurrency Errors in Sequential Code—OS-level, In-vivo Model Checking of Process Races

Process race: when multiple processes race for shared system resources like the filesystem. Neat example: ps aux | grep XYZ | wc might return 0 or 1 depending on whether grep is already running when ps runs, and if ps is started first, who wins out of the shell spawning grep and ps getting through /proc's directory listing. They did a survey for bug reports and found process races to be at least as common as thread races, and to often have serious consequence. Challenge: detect process races. Syscall interactions are complex... Their solution: do in-kernel record/replay of process/syscall events, and do offline analysis of the log, looking for racey interactions. They try flipping the order of apparently racey interactions and test the program (failure detection?) They have a prototype that's found some real races. (cs)

Privacy Revelations for Web and Mobile Apps

What OS support is needed to track personal information exposure? The research community has wasted time on coming up with clever (e.g. side-channel) attacks. The real problem is things that people in fact authorize without knowing it. The OS should give us transparency about what information about you is being sent over the network when we use mobile and web apps. The paper talks more about the idea of "privacy revelations" which are OS-supported, high-level descriptions of application exposure. They need to be relevant to the user, and actionable by the user, which current tracing frameworks don't give you. - dgm36

It's the TaintDroid people! Question: what OS support do you need to be able to control information release? Assertion: OS community has done privacy control badly. It's been identifying clever attacks rather than helping to plug the simple vulnerabilities, or help the users understand their privacy. The OS should make people aware when something they value is becoming known; then the it's the user's job to decide whether they're happy or not. (cs)

Do You Know Where Your Data Are? Secure Data Capsules for Deployable Data Protection

We need to share data in ways which we might not always be aware of, e.g. medical records. How would we manage propagation of sensitive health data without tying up your doctors? Need to support existing apps as there's a huge glut of it out there in the medical community, and perform well enough / be reliable enough that we never leave them without the information they need. Solution: wrap your app in a taint tracker to detect when stuff escapes, and wrap network-transmitted data with unforgable provenance information. This is essentially a way to implement the goals of the NHS National Program for IT (a disaster, incidentally) with the OS providing enforcement that I suspect the NPfIT does by trusting its applications. They're working on formally proving the intermediate components -- the network communication wrapper and the application sandboxing kit. (cs)

Critical data (e.g. healthcare records) often need to be shared between different stakeholders, often out of your custody. Today we have a choice between allowing anyone to do anything undetected (with your records) or die. Data Use Controls are the owner-set policy. We need to support current OSs and applications without limiting choice. Getting this to work with legacy is the killer app. Take the OS out of the TCB with Overshadow. Take the application out of the TCB with system call interposition and taint tracking. Need to store provenance and policy with the data. The DUC engine is a module in the OS that releases access to data based on policy. Validating and verifying formally the TCB is one of the other things being looked at. - dgm36

Making Programs Forget: Enforcing Lifetime for Sensitive Data

Lingering data is a key obstacle to privacy: it might be swapped out to disk, or lying around in a cache. Closing the application doesn't work; neither does rebooting. Can we rely on the application? Probably not. The aim of this talk is to come up with a system that can guarantee that all sensitive data and data derived from it will disappear after a given time limit. No application support is expected. Observe that there are equivalent states of the program that don't depend on sensitive data. The approach is "state reincarnation". Equivalent states are derived by deterministic replay with perturbed input. Picking the replacements for sensitive data is tricky---it can't just be arbitrary garbage. So is knowing that you've got everything completely scrubbed. Could use random garbage for replacing a sensitive email, but there might be some constraints on what we can use. Implemented recording at user-level and it gave a 1.2x slowdown on bash. - dgm36

Data you might want erased often gets copied all about the place. They want to allow the user to put a constraint on when your data will be destroyed, without requiring application support. They have a prototype: It tries to replace "sensitive state" with equivalent non-sensitive state. They use deterministic replay gubbins to replay your session without the sensitive input (presumably network traffic in most cases). Challenges: how do you track all the sensitive stuff? How do you keep the costs of using replay down? Problems: "equivalent state" often more complicated than removing the sensitive packets, e.g. might effect later sequence numbers. (cs)

More Intervention Now!

We want to predict what will happen if we intervene in a system to change some of its hardware. Observation only shows correlation; for causation we need intervention. We need some math, some theory and some engineering. We want to see if we can discover some causality from observational data if we don't have a full functional model. A Bayesian network is the formalism we need. The dual calculus gives us a set of rewriting rules we can use. A Dryad cluster with a few hundred machines gives us enough observational data for statistical significance. - dgm36

make world

Programs are rubbish, and you should improve them by using partial evaluation: meaning programs like Eclipse, Firefox and OpenOffice, which repeat a lot of work that they've done before (either in the same session or in a previous session). Example is a spelling checker, which is based on the global and a local dictionary, converts it into a machine-readable form and then uses it for checking. It wastes time by doing the conversion every time I use it. We could amortize this we converting it only once each session, but it might be hard to identify what a "session" is. Or we could rebuild the dictionary using a process like `make` each time. What if we could do this automatically? Partial evaluation does this for us. We can somehow treat the dictionary as a quoted constant, then do constant propagation through the rest of the program. Whole program optimization provides us with the mechanism for doing this. A prototype exists, and it can eliminate a program that does word counting on a file. Loop peeling optimizations evaluate the code on the program, and completely unwind the loop. There are challenges in adaptive optimization (with respect to the program's environment), using IPC between multiple processes (read from pipe or socket), and making it efficient. - dgm36

Comments (0) Trackbacks (0)
  1. Interesting to see the CondOS thing as a research paper — there used to be an Android app which did this (by running in the background), called “Locale”. This would, despite the name hinting it was about location only, allow fairly generic “if-condition-then-action” kind of rules to be set. I used to use a “set phone to silent if within some radius around the CL” rule, for example.

    I guess you couldn’t do selective password protection if you don’t have OS integration. But then again, that’s not an uber-compelling use case that convinces me that we need OS-level support for this. Or maybe they just all had iPhones?

Leave a comment

No trackbacks yet.