Place-Friends: designing a link prediction system for location-based services

Posted by Salvatore Scellato

Online social networks often deploy friend recommending systems, so that new users can be discovered and new social connections can be created. Since these service have easily millions of users, recommending friends potentially involves searching a huge prediction space: this is why platforms such as Facebook, LinkedIn and Twitter merely focus their prediction efforts on friends-of-friends, that is, on users that are only 2 hops away in the social network, sharing at least a common friend. Extending prediction efforts beyond this social circle is simply not worth it.

017/365 areyoucheckedin?

Nonetheless, in location-based social networks there is an unprecedented source of potential promising candidates for recommending new friends: the places where user check-in at.  In a recent paper which will appear at the upcoming ACM SIGKDD 2011 conference we address the problem of designing a link prediction system which exploits the properties of the places that users visit.


Opportunistic Networking, Altruism and Disasters+secure p2p socnets

Posted by Jon Crowcroft

The Social Nets project just got a very good final review in Brussels (excellent and beyond expectations) - one of its outputs is a prototype p2p decentralised social net tool called Safebook.

This was partly a successor project to Haggle... ... ...

One of the premise behind Haggle and opportunistic networking was that people are altruistic more often than not - while this may not be true when too much money (relative to personal wealth) is involved, a lot fo studies show it is true particularly in disasters - I am reading this popular journalism book on the topic called "A Paradise built in hell", by Rebecca Solnit, which is heavily researched and cites many many studies to show that contrary to government fears and many hollywood disaster movies,the vast majority of people in major catastrophes "do the right thing" and do not panic and loot - from 1906 SF earthquake to the 2005 New Orleans Katrina floods , the biggest cause of unnecessary death was autoritarian policing over-reacting (to the tune of 500 deaths in SF from shootings by national guard when there was almost no looting at all and evidence that much panic was _caused_ by people being told there might be panic...)

In contrast, people carried out many activities of mutual aid - one assumes that a comms network that supported this  would work really rather well...

On another note, this UCSD study on spam finds choke point - this from Stefan Savage et al., so probably very thorough.

This is a result of a long running project with UC Berkeley which was heavily funded, so sometimes these big projects make some sense(maybe)?




IMDEA Workshop on Internet Science

Posted by Jon Crowcroft

See here for titles/speakers and slides...

So far Pablo's talk had some v. interesting stuff about scaling the twitter service - clever work on solving hotspots andoverloads in memcache/mysql setup - reminded me of previous work on trying to get the IMDB system to scale - seems like these inverted databases are a pain in general, so a fundamental solution would be welcome...for those of you working on social net analysis, worry about (particularly un-self-declared) spambots in twitter - see Mowbray's talk - plus looking at Vattay'sstats talk is worthwhile
anyhow, I was reading this Future Internet Roadmap and decided that Private Green Clouds is defintely the way to go (andwe are there yet, so that is good:-)
here's the barking bit: why not put a data center in every car?

rationale:1. future cars will be electric.2. its proposed that future electricity generation will incorporate a lot of micro-generation(certainly solar here in spain, and wind in uk etc etc)3. the power distribution net is not fit for "uplink" electricity in large amount, so...4. micro-generation is largely intermittent (esp. wind, but obviously solar is at least on/off day/nite)5. hence we need to do local distribution of micro-generated power6. or else we need to store micro-generated electricty
power solution=> use electric cars as storage; to get an idea of scale, (see Mackay's book) cars could store about 30% of UK generated power -when we get to 100% of the carpool of the UK being electric...
so then where we plug cars in, why not also have a dataport too then instead of using meagre compute resource in someone's house, have a big-fat data center in the car(s) in a street - they can run off stored power when local production exceeds demand (or predicted (say nighttime) production/stored exceeds local and car demand...

the numbers should work very can easily smooth day/night variation, but also short term wind variation...
before you all shout, one problem is that the batteries are really designed for a relatively small number of discharge cycles - however, some technologies (hydrogen fuel cell etc) would fixthat
so this needs 2 things.1 a smaller unit for data center2 a plan to do fiber-to-the-charging-point....


HotOS 2011 LiveBlog Day 3

Posted by Chris Smowton

What If You Could Actually Trust Your Kernel?

seL4 is our new toy: it's formally verified. It's guaranteed not to crash or suffer from security holes. (As long as you can trust the compiler.) Hypervisors are really very large and full of bugs. We could use this for isolating web browser processes as in IBOS (OSDI 2010). We can make TPMs useful (they currently aren't, because PCRs get updated for any app loaded; DRTM/TXT/SVM isn't much better). seL4 provides a trusted environment with a trusted loader that can be verified. It can also improve performance for things like TPC-C by providing a crash-proof virtual storage device that can be less aggressive in writing out to disk. - dgm36

What can you do with a verified kernel? Do a VMM right, do browser sandboxing right. More interesting: use a TPM for real. Existing remote attestation is problematic because it doesn't guard against the measured code being itself buggy, and it only works if you have a complete trusted chain. To use this for e.g. home banking you'd need to make the user boot a complete bootloader+VMM+kernel+app chain all of which measures their successor. Better: DRTM (?) which suspends the present system and does a parallel boot of an untainted kernel which can be measured -- but, suspends the original install which will do interesting things to hardware. Much better: run seL4 as a VMM and construct a trusted, measured chain through a mini OS and banking app. Another idea: can discard the sync-writes in a DBMS that are intended to ensure persistence past an OS crash. Of course this also buys you vulnerability to power failure! (but they assert you can use the capacitors in your PSU as a tiny UPS!) (cs)

Provable Security: How feasible is it?

Let's prove useful properties of real systems! Example of such a property: prove that software provides confidentiality. Verifications should be done against real C. Prove that the C is an implementation of a formal model. Aiming to prove properties of >MLOC systems with difficult code intended for performance. They've proved that seL4 can provide integrity guarantees. They anticipate building systems using chunks the size of seL4 in which each is proven functionally correct, making whole-system proofs easier. Problem: they want to prove confidentiality properties, but they can't make guarantees about timing channels without hardware that deliberately works to eliminate them. (cs)

It's very feasible, but it might not be easy. Real proofs are machine-checked and tell you something that you didn't already know. If you're looking for buffer overflows, you're not trying hard enough. Proofs should cover high-level security goals like integrity and confidentiality. It has to apply to real code (C or assembler), which has been written to be run, not to be proved. You need to structure your large system so that you don't need to prove the correctness of every one of millions of lines of code. 10kloc is a feasible size of component to verify on the present horizon. Proving things about timing effects is still too hard (we'd need a formal model of the hardware for that). - dgm36

Toward Practical and Unconditional Verification of Remote Computations

Want to verify that the server has executed a function correctly given the result, and without re-executing the code. This might be useful for cloud and volunteer computing. PCPs are probabilistically checkable proofs. These are very expensive to perform (hundreds of thousands of years), and the proposal is to make them more efficient. The impracticality comes from using Boolean circuits to represent computations, which makes the proof huge. The idea is to do PCPs interactively, whereby the client sends the checking locations to the server, and only generates those parts of the proof; the server uses cryptographic commitment so that it cannot change the function. Also, use arithmetic circuits instead of Boolean circuits. The refinements for polynomial evaluation went from 130000 years to 11.5 hours. - dgm36

Want to check that an untrusted remote party correctly calculated some function of its input without just plain old recomputing. There exists somthing called Probabilistically Checkable Proofs which allege to do this, but the cost of checking is huge! Basically they work by turning the desired function into a boolean circuit, then checking that a few randomly picked input values map to what the client gave us. More test values --> less probability of error. They use "arithmetic circuits" (dataflow graphs, I guess?) instead of boolean ones to save time computing results. Apparently you can also make major gains by having your remote worker do a bunch of parallel work and batch-verifying the results. The time savings are about 10 orders of magnitude. (cs)

MOMMIE Knows Best: Systematic Optimizations for Verifiable Distributed Algorithms

It sucks that often you have to mangle your ideal algorithm in the name of efficiency. Similarly layering can imply suckitude -- e.g. you might happen to be running a protocol with integrity properties on top of another one; could elide one of those checks. If only we had reasonable devices for composition! They define MOMMIE, which is a way to specify your desired algorithm which is compiled down to TLA+ (a logic for verification) or C++ (for execution). It's a sort of imperative specification... ish. It includes useful primitives for gathering messages etc. (cs)

Abstractions are good. In theory, P2 was a great abstraction; in practice, we wrote the equivalent of assembly code in a relational database. BFT is a great abstraction, but you end up having to cram a lot of implementation details into your proofs because you want it to be efficient. People need a way to do composition, so that proofs carry over into the implementation. MOMMIE is a high-level language for writing your system; it can be turned into a TLA+ specification and a C++ program, and composed with a bunch of implementation modules. A MOMMIE statement has an issuer, a verifier and an auditor, followed by immutable content in c-structs. Program looks like event-condition-actions. Actions are imperative code with loops and variables, etc. - dgm36

The Case for VOS: The Vector Operating System

A web server handling two simultaneous requests will make the same system calls multiple times. The similar system calls should be made only once, with vector arguments, and the vectorisation can be pushed down into the kernel implementation. This will cut down on syscall overhead, by batching the calls and using SIMD parallelism in execution. Memory protection with mprotect: some NVRAM work required doing this very often, getting up to 375k page protections per second. A vectorised mprotect was much better by amortizing context switches, optimizing data structures used and eliminating TLB flushes: 3x better. A challenge is handling convergence and divergence (where different elements require different work): we need some kind of fork-join parallelism or message passing. There is also a tradeoff in deciding when to join or to do replicated work. It will be hard to get the full benefits of this in Linux. The OS should be a staged event system instead, with message passing as an underlying primitive. - dgm36

Parallel webservers do lots of similar syscall sequences (accept, open, read, sendfile)... how about making an accept_vector thing? Could vectorise the rest of the thing too, like doing userland work to compress for Transfer-Encoding, or encrypt for HTTPS. Could ameliorate syscall costs, and could open opportunities to use SSE etc. Particular useful: vector-lookup-directory would make one walk not many. Another example: vector-mprotect using an async shared queue for U->K data movement. Things pushed into the queue are sorted so that again we make a single walk through the VMA-list. They think that lots of the gains were made by avoiding a TLB flush per syscall. Better still: pass flowgraphs down into the kernel which are used to route incoming packets or other hardware events. Kind of like sys_splice, or Vino-style SFI-in-kernel. (cs)

Operating Systems Must Support GPU Abstractions

GPUs are currently tricky to use. The OS conceives of them as an I/O device; we need a better abstraction to make them usable. There are user-mode libraries providing interesting interfaces, but the OS doesn't expose much of an interface (ioctl is all we get). We need a better interface because the kernel isn't providing fairness or isolation, and there's no interface usable by the kernel (without going via a userland surrogate). The schedulers for the CPU and GPU aren't integrated. Similarly there's no concept of a "background" GPU task, e.g. F@H vs. the user interface. Another problem: we can't easily pipeline GPU applications, because there's no OS-level "pipe" that's implemented by passing refs to GPU memory. Their proposed design does the logical thing of generalising pipes and also providing some support for event routing to facilitate a path from say a USB device straight to the graphics card without an indirection via system memory. (cs)

We need support for dataflow in the OS. The problem is that GPUs are underutilized, because they are difficult to program (SIMD execution model, no access to main memory and treated as an I/O device). Programmer-visible interface is things like processes, pipes and files, which map directly onto OS abstractions. For GPU programming, all you get is an ioctl, and this isn't enough for things like fairness, isolation, etc. There is no kernel-facing interface which inhibits the OS from using the GPU and managing it. It also leads to suboptimal data movement. A high CPU load will inhibit GPU performance because the CPU scheduler is not integrated with the GPU scheduler. This might be a problem when the GPU is used to do things like rendering the mouse pointer. Why can't we compose a graphics processing pipeline with pipes? Because of user/kernel mode crossings and the amount of data copying. The ideal programming interface is dataflow between tasks. - dgm36

Multicore OSes: Looking Forward from 1991, er, 2011

We have been saying that parallel programming is hard since 1991. High-level lessons are things like shared-memory with locks doesn't scale; shared-nothing hardware will prevail; and programming will involve message passing. The proposed model is lightweight messages and channels, like Erlang, CSP, etc. Message sending will be like procedure calling (or at least as efficient) and there will be a non-deterministic choice operator. In a channel-based OS, system calls become messages, like in a microkernel. There are some foreseeable issues including the implementation of choice, waiting for channels, and implementing protection. There is an assumption that there is hardware-level messaging. Microkernels use "middleweight messages" whereas this will use "lightweight messages". - dgm36

Hardware isn't getting faster; parallelism is the only way forwards. Hardware is heading towards shared-nothing; programming is likely to need explicit message passing. They advocate erlang-style programming with communicating lightweight threads, which can be assigned to OS threads as we desire. We need a complete system built along these lines, including the kernel (so, Barrelfish). Questions: what's the right model for virtual memory (shared-something, like Corey?); how should we deal with partial failure (i.e. one bit of your multikernel explodes). (cs)

Filed under: Uncategorized No Comments

HotOS 2011 LiveBlog Day 2

Posted by Chris Smowton

Disk-Locality in Datacenter Computing Considered Irrelevant

Hadoop et al try to put tasks where the disk containing their source is. But network bandwidth > disk bandwidth! And net bandwidth is alleged to be a plentiful resource! And disks aren't getting faster at anything like the rate of the network! (SSDs?) Memory locality is still worthwhile, though, but even modern datacentres can't just keep everything in memory (Facebook example: mem_bytes = data_bytes/200. Sadly this isn't much use: datacentre jobs typically have very large working sets, so it's all-or-nothing when it comes to in-memory caching. 96% of FB jobs can fit entirely in memory. Cache replacement algorithms need to be adapted to fit this sort of use case. (cs)

The key idea in datacenter computing is co-locating the computation with their input, and much work has gone into this. Looking at the Facebook production cluster, 85% of jobs see tasks reading from the network running just as fast as disk-local tasks; also, disk-local reads are not faster than rack-local reads. Networks are getting faster and over-subscription is reducing; however, disk speeds are plateauing. Locality matters a lot, because memory is two order of magnitude faster than disk. Problem for caching is that the working set is close to the size of the whole input, and you need to cache everything because a single from-disk job will be much slower. Apparently, 96% of jobs at Facebook can fit in memory. - dgm36

Optimising Data Partitioning for Data-Parallel Computing

Data partitioning controls the degree of parallelism. We have many possible choices of partition function, and degrees of partitioning which can have a big effect on performance and cost. There are problems of data and computation skew. Also, a balanced workload does not necessarily give optimal performance because we can get more cross-node traffic with more partitions. The problem is harder than for databases because we need to understand user-defined code, and data with less structure than simple tables. Analyzing user-defined code may be tricky if the amount of processing is data-dependent. Perform cost modeling and optimization to get a low estimated cost. Two approaches: white box (analytical formulation) and black box (based on down-sampling). - dgm36

Suppose we've got a boatload of data to process: how many chunks should we split it into to achieve data parallelism, and how should we perform the split? We want to avoid skew in data size, and also in compute costs applied to that data (i.e., one record is not necessarily like another). Perfect balance might not be the best idea either -- we have to consider the volume of cross-node traffic that will result. Challenge: analyse the user's code to figure out how long it will take to process a given record in advance of execution. Then apply a do global cost analysis using the predicted runtimes for each partition and the supposed properties of the network, etc. (cs)

Disks Are Like Snowflakes: No Two Are Alike

Historically, disks were fairly precisely spec'd and performed the same as others of their type notwithstanding partial failures. Now they're much more variable because disks are built to use "adaptive zoning". Disks have always used zoning: concentric rings with differing numbers of sectors per track. This means that ordinary disks show bulk throughput rate that varies in the manner of a staircase with logical block number, as we move between zones. More modern drivers vary much more unpredictably with logical offset because they've been adapatively zoned -- presumably fitting as many sectors as possible without exceeding some intolerable error rate. Apparently the variance stems mostly from the head -- how rapidly it responds to changing magnetic properties of the underlying disk surface. Different platters have different heads, so you can see differing rotation rates / zoning density per platter, as well. Therefore: we should expose that sort of knowledge to RAID drivers and the like, as the disks are not as alike as you might suppose, either within the disk or between them. (cs)

In the past, buying a batch of disks, you could measure one and assume that the rest would perform the same. Now, disks are unique by design. In 2002-era disks, the performance would be the same. Disk zoning techniques provide bandwidth proportional to the number of bits along the track, so store more data on the outer tracks. In 2006-era disks, there is much more variability, and this trend has increased in 2008-era disks. Adaptive zoning figures out the density capability adaptively based on how good the disk head is. This means there is no longer such a thing as a homogeneous set of disks, which has implications for striping, equal-work partitioning and other low-level disk techniques (about placement, etc.). - dgm36

The Case for Power-Agile Computing

Mobile devices have a variety of applications with differing power needs. We want to have the capability of all of a desktop/laptop/supercomputer/mote all in a mobile device. The idea is to have an ensemble of different storage and processing elements. There are challenges in measuring efficiency, predicting different ensemble's performance, selecting the appropriate ensemble, and transitioning between ensembles. - dgm36

Enter the genie. Actually. Man dressed as a genie. Pretty good sketch comedy for two ensues, the point of which is that it'd be neat if your mobile phone could scale between the characteristics of tiny mote-like devices all the way up to a desktop. Proposed solution: slam a bunch of diverse components into the same box, and power up the appropriate one at the right time. Like sellotaping my good ol' fashioned Nokia 6310 (for most of the time) to an iPhone (for the odd occasion I really need it). (cs)

Mobile Apps: It's Time to Move Up to CondOS

Phones have a bunch of sensors. Can figure out the user's current activity, location, "are they interruptible" (?). Can't do it with an app, because mobile apps are typically not allowed to run all the time. So let's do it in the OS. Or, just permit a background app, since that's functionally the same. Supply this sort of stuff to subscribers. Could use this sort of info to preload apps when we think they'll need them. Or adjust ringer volume based on background noise. Or customise the way the WLAN card operates depending on whether you're moving (done at NSDI). Apparently you could selectively apply password protection if you think they're away from home (!) Or, predict time to recharge and do power conservation based on that. (cs)

Devices are very programmable these days, and they have a bunch of different sensors on board. The question is how new apps might use new sensors. Sensors can provide contextual information that applications might use. Since there are some risks to this, we can't let applications have unfettered access, and cloud offload might be expensive. ConDOS will do it in the OS. Sensors export "context data units" rather than raw sensor data, and these can be composed into a graph (like e.g. Click modular router). If context is OS-managed, the system services can use context (e.g. for memory management to make applications more responsive, security, energy management, etc.). This abstraction can also aid sensor privacy (by adding policy decisions to how sensor data is exposed). - dgm36

Free Lunch: Exploiting Renewable Energy for Computing

Part of a grander scheme from Andy Hopper about Computing for the Future of the Planet. Computing uses a lot of energy, but some companies have responded by installing renewable energy technology. Wind and solar renewable energy at a given location is intermittent. The idea is to migrate workloads and data instead of energy. Data centers must be connected with high speed links. Challenges include predicting VM migration time and storage synchronization (solved in MASCOTS papers), and scheduling and placement. Migration has a 0.5s downtime overhead due to memory and disk over a 10 Gbps link. With 615 migrations a year, the downtime is 415 seconds. This fits within a 99.95% SLA. The energy consumption of migrating  one VM is 57.5 kilojoules, assuming 7.5GB of RAM and 20GB of modified disk. - dgm36

Renewable energy == good. Renewable availability varies with time (sun, wind). So migrate your VMs to where the power is. Challenges: predict migration time, virtualise net+storage well enough that you can do this without causing service outages. They estimate that downtime will be okay if you've got a 10Gbps link between the data centres. (cs)

Debug Determinism: The Sweet Spot for Replay-Based Debugging

Logging for replay debugging is expensive, especially if it's a data race. Existing systems use "relaxed determinism": produce a replay that's not quite the same but good enough. The best so far is "ESD", which guarantees only that the same failure will occur. Can be too relaxed, though -- they might produce the same output by back-inferring reasonable inputs, rather than unreasonable inputs plus a data race. It'll only really work if the output observed can *only* occur due to a data race. We need higher fidelity than that. They propose a system which has high debugging fidelity (indicating the right root cause for a problem) and low overhead. They propose to do this by recording with high fidelity the root cause and the failure, but we don't know what's going to be a root cause in advance. Need a heuristic for where the root cause is likely to lurk. (cs)

Non-Deterministic Parallelism Considered Useful

Distributed execution engines benefit from nondeterministic parallelism! They have some useful properties, like taking care of synchronisation and scheduling. Most distributed execution engines provide guaranteed termination; Ciel sacrificed this in the name of flexibility and expressiveness. Should we sacrifice our support for fault tolerance along similar lines? Yes! Because fault tolerance as we know it (Jim) requires determinisitic tasks, but the real world tends to perturb your programs and introduce nondeterminism (e.g. network delays, competing processes)... Requiring us to be deterministic in the face of such things requires us to sacrifice neat things like wait-for-multiple-handles, async interprocess signals... But what would we do when failures occur? We could fail the whole job, or at the other end we could record all nondeterministic events for replay; there are also a few intermediate points on this spectrum. Non-determinism is useful for other things too: handling interactivity, adapting the job to cluster conditions... (cs)

Finding Concurrency Errors in Sequential Code—OS-level, In-vivo Model Checking of Process Races

Process race: when multiple processes race for shared system resources like the filesystem. Neat example: ps aux | grep XYZ | wc might return 0 or 1 depending on whether grep is already running when ps runs, and if ps is started first, who wins out of the shell spawning grep and ps getting through /proc's directory listing. They did a survey for bug reports and found process races to be at least as common as thread races, and to often have serious consequence. Challenge: detect process races. Syscall interactions are complex... Their solution: do in-kernel record/replay of process/syscall events, and do offline analysis of the log, looking for racey interactions. They try flipping the order of apparently racey interactions and test the program (failure detection?) They have a prototype that's found some real races. (cs)

Privacy Revelations for Web and Mobile Apps

What OS support is needed to track personal information exposure? The research community has wasted time on coming up with clever (e.g. side-channel) attacks. The real problem is things that people in fact authorize without knowing it. The OS should give us transparency about what information about you is being sent over the network when we use mobile and web apps. The paper talks more about the idea of "privacy revelations" which are OS-supported, high-level descriptions of application exposure. They need to be relevant to the user, and actionable by the user, which current tracing frameworks don't give you. - dgm36

It's the TaintDroid people! Question: what OS support do you need to be able to control information release? Assertion: OS community has done privacy control badly. It's been identifying clever attacks rather than helping to plug the simple vulnerabilities, or help the users understand their privacy. The OS should make people aware when something they value is becoming known; then the it's the user's job to decide whether they're happy or not. (cs)

Do You Know Where Your Data Are? Secure Data Capsules for Deployable Data Protection

We need to share data in ways which we might not always be aware of, e.g. medical records. How would we manage propagation of sensitive health data without tying up your doctors? Need to support existing apps as there's a huge glut of it out there in the medical community, and perform well enough / be reliable enough that we never leave them without the information they need. Solution: wrap your app in a taint tracker to detect when stuff escapes, and wrap network-transmitted data with unforgable provenance information. This is essentially a way to implement the goals of the NHS National Program for IT (a disaster, incidentally) with the OS providing enforcement that I suspect the NPfIT does by trusting its applications. They're working on formally proving the intermediate components -- the network communication wrapper and the application sandboxing kit. (cs)

Critical data (e.g. healthcare records) often need to be shared between different stakeholders, often out of your custody. Today we have a choice between allowing anyone to do anything undetected (with your records) or die. Data Use Controls are the owner-set policy. We need to support current OSs and applications without limiting choice. Getting this to work with legacy is the killer app. Take the OS out of the TCB with Overshadow. Take the application out of the TCB with system call interposition and taint tracking. Need to store provenance and policy with the data. The DUC engine is a module in the OS that releases access to data based on policy. Validating and verifying formally the TCB is one of the other things being looked at. - dgm36

Making Programs Forget: Enforcing Lifetime for Sensitive Data

Lingering data is a key obstacle to privacy: it might be swapped out to disk, or lying around in a cache. Closing the application doesn't work; neither does rebooting. Can we rely on the application? Probably not. The aim of this talk is to come up with a system that can guarantee that all sensitive data and data derived from it will disappear after a given time limit. No application support is expected. Observe that there are equivalent states of the program that don't depend on sensitive data. The approach is "state reincarnation". Equivalent states are derived by deterministic replay with perturbed input. Picking the replacements for sensitive data is tricky---it can't just be arbitrary garbage. So is knowing that you've got everything completely scrubbed. Could use random garbage for replacing a sensitive email, but there might be some constraints on what we can use. Implemented recording at user-level and it gave a 1.2x slowdown on bash. - dgm36

Data you might want erased often gets copied all about the place. They want to allow the user to put a constraint on when your data will be destroyed, without requiring application support. They have a prototype: It tries to replace "sensitive state" with equivalent non-sensitive state. They use deterministic replay gubbins to replay your session without the sensitive input (presumably network traffic in most cases). Challenges: how do you track all the sensitive stuff? How do you keep the costs of using replay down? Problems: "equivalent state" often more complicated than removing the sensitive packets, e.g. might effect later sequence numbers. (cs)

More Intervention Now!

We want to predict what will happen if we intervene in a system to change some of its hardware. Observation only shows correlation; for causation we need intervention. We need some math, some theory and some engineering. We want to see if we can discover some causality from observational data if we don't have a full functional model. A Bayesian network is the formalism we need. The dual calculus gives us a set of rewriting rules we can use. A Dryad cluster with a few hundred machines gives us enough observational data for statistical significance. - dgm36

make world

Programs are rubbish, and you should improve them by using partial evaluation: meaning programs like Eclipse, Firefox and OpenOffice, which repeat a lot of work that they've done before (either in the same session or in a previous session). Example is a spelling checker, which is based on the global and a local dictionary, converts it into a machine-readable form and then uses it for checking. It wastes time by doing the conversion every time I use it. We could amortize this we converting it only once each session, but it might be hard to identify what a "session" is. Or we could rebuild the dictionary using a process like `make` each time. What if we could do this automatically? Partial evaluation does this for us. We can somehow treat the dictionary as a quoted constant, then do constant propagation through the rest of the program. Whole program optimization provides us with the mechanism for doing this. A prototype exists, and it can eliminate a program that does word counting on a file. Loop peeling optimizations evaluate the code on the program, and completely unwind the loop. There are challenges in adaptive optimization (with respect to the program's environment), using IPC between multiple processes (read from pipe or socket), and making it efficient. - dgm36

Filed under: Uncategorized No Comments