<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>syslog</title>
	<atom:link href="http://www.syslog.cl.cam.ac.uk/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.syslog.cl.cam.ac.uk</link>
	<description>The Cambridge Systems Research Blog</description>
	<lastBuildDate>Mon, 13 May 2013 16:39:07 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Liveblog from Eurosys 2013 &#8211; Day 3</title>
		<link>http://www.syslog.cl.cam.ac.uk/2013/04/17/live-blog-from-eurosys-2013-day-3/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2013/04/17/live-blog-from-eurosys-2013-day-3/#comments</comments>
		<pubDate>Wed, 17 Apr 2013 09:56:12 +0000</pubDate>
		<dc:creator>Natacha Crooks</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Networks]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Parallelism]]></category>
		<category><![CDATA[Research Agenda]]></category>
		<category><![CDATA[Storage]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1171</guid>
		<description><![CDATA[Hi again from all of us here in Prague -- this is day 3 of Eurosys, the last day and we'll be running the live blog as usual! Your friendly bloggers are Natacha Crooks (nscc), Ionel Gog (icg), Valentin Dalibard (vd) and Malte Schwarzkopf (ms). Session 1: Scheduling and Performance Isolation [No blog coverage available] hClock: Hierarchical QoS for Packet Scheduling [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignleft" style="margin: 5px;" alt="EuroSys Logo" src="http://eurosys2013.tudos.org/wp-content/themes/eurosys/images/supporters/Eurosys_logo.png" width="160" height="45" />Hi again from all of us here in Prague -- this is day 3 of Eurosys, the last day and we'll be running the live blog as usual!</p>
<p>Your friendly bloggers are <a href="http://www.cl.cam.ac.uk/~nscc2/">Natacha Crooks</a> (nscc), <a href="http://www.cl.cam.ac.uk/~icg27">Ionel Gog</a> (icg), Valentin Dalibard (vd) and <a href="http://www.cl.cam.ac.uk/~ms705/">Malte Schwarzkopf</a> (ms).</p>
<p><span id="more-1171"></span></p>
<h2>Session 1: Scheduling and Performance Isolation</h2>
<p>[No blog coverage available]</p>
<p><strong>hClock: Hierarchical QoS for Packet Scheduling in a Hypervisor </strong></p>
<p><em> Jean-Pascal Billaud and Ajay Gulati (VMware, Inc.) </em></p>
<p><strong>RapiLog: Reducing System Complexity Through Verification </strong></p>
<p><em> Gernot Heiser, Etienne Le Sueur, Adrian Danis, and Aleksander Budzynowski (NICTA and UNSW) and Tudor-Ioan Salomie and Gustavo Alonso (ETH Zurich) </em></p>
<p><strong>Application Level Ballooning for Efficient Server Consolidation </strong></p>
<p><em> Tudor-Ioan Salomie, Gustavo Alonso, and Timothy Roscoe (ETH Zurich) and Kevin Elphinstone (UNSW and NICTA)</em></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h2>Session 2: Scheduling and Performance Isolation</h2>
<p><strong>Omega: flexible, scalable schedulers for large compute clusters -- BEST STUDENT PAPER!</strong></p>
<p><em>Malte Schwarzkopf (University of Cambridge Computer Laboratory), Andy Konwinski (University of California Berkeley), and Michael Abd-el-Malek and John Wilkes (Google Inc.)</em></p>
<p>Omega is Google's next gen cluster scheduling system. Scheduling in the cluster: tasks that are part of job and map those to machines. At Google, observed a number of trends in recent years: workload are becoming increasingly diverse. The size of the cluster keeps increasing and so does the rate at which jobs arrive at the scheduler. The scheduling logic could be the same for all workloads / tasks, but observe that overtime, keep adding scheduling tweaks and specifications  The clusters are very big, with huge number of machines. This complexity in the scheduler is complex because huge monolithic piece of code. The idea is therefore to break the scheduler up into independent schedulers  But in order to do that, need to arbitrate resources in some way or another. You have various techniques, monolithic scheduler, static partitioning. Also include two level approach, where you have a resource manager component, and partitions the cluster according to use. (ex: Mesos).</p>
<p>Do something different: shared state, try, hope for the best, and solve any conflicts afterwards. Have n schedulers with replicas or the cluster state , generate "delta" which sends it to the global state. In the shared cluster states, deltas are applied, which may succeed or may conflict. If conflict, the first one succeeds, the other one fails and may reply.</p>
<p>Workload characterisation: break workload into batch and service jobs. Observe that most jobs are batch, but most resources are consumed by service jobs. Batch jobs run for a shorter time but arrive much more frequently. Service jobs are less frequent (larger scheduling budget) but run for longer (so worth it).</p>
<p>Simulation workload:<br />
- model scheduler decision time. As more tasks in a job, take more time to schedule it. There's also a constant baseline oof work that you have to do for each job independent of it size.</p>
<p>Vary decision time for all tob vs per task decision time for each job vs CPU load<br />
1) Monolithic scheduler. Observation that it does't scale for long jobs.<br />
2) Monolithic fast path batch decision time. Problem is that have head of line blocking because even if have two scheduling logic, scheduler is not parallelised.<br />
3) Mesos. Failed to schedule jobs in every case. Mesos works based on offers. But is greedy so could havc eone which receives offer of all availalbe resources, next one receives tiny offer, but first one is long running so has to retry many times.<br />
4) Omega, no optimisations. Conflicts are problems. Some transactions class and schedulers have to redo work, so schedulers are busier.<br />
5) Omega, optimisation. Figure comparable to monolithic. Slightly worse because conflicts still occur.<br />
The omega shared state model performs as well as a coplex monolithic fast batch path schedluer.</p>
<p>Scaling to many schedulers: Scale up to 32 scheduler (with load balancing). Utilisation goes down by about 8 (rather than 32), that's because of conflicts. But still, quite a large number.</p>
<p>Trace based simulation:<br />
How much interference observed? There's a large overhead due to conflcits. Turns out is because of oversubcription and placement constraints. Interference is higher for real word settings.<br />
Optimisations: 1) fine grained conflict detection, (avoiding false conflicts). Can slap a sequence number, and when touch machine, increment sequence number. But if there's still headroom to run antoher task, allow scheduling anyways. Reduces rate signifcantly.<br />
2) incremental conflicts: if one task out of all jobs, previously failed entire taks, now only retry tasks that failed not whole job.<br />
2x difference observed. significant improvement in performance.</p>
<p>Case scheduler: mapreduce scheduler with opportunistic extra resources. Takeaway is to give fleixbitly to easily support custom policies.</p>
<p>Conclusion: flexibility and scale require parallelism. parallel scheduling works if you do it right, and using shared state + optimistic concurrency is the way to do it right.   <em>(nscc)</em></p>
<p>&nbsp;</p>
<p>Scheduling many jobs on many machines: Because jobs are very different, many different schedulers have been designed. But they are all scheduling over the same set of resources. Previous solutions:<br />
-Static scheduler: split the machines into different fixed size groups associated with each scheduler. Doesn't work if a scheduler is overloaded and doesn't have enough machines<br />
-Dynamic scheduler: like mesos, each scheduler is associated a set of machines which changes dynamically<br />
-Omega: Each scheduler can schedule on any machines. If there is a conflict, one of the scheduler wins and schedule his job whilst the other looses</p>
<p>Workload characterisation: 2 types of jobs<br />
-Batch jobs: vast majority and very frequent, return in a short time (20 min)<br />
-Service jobs: very rare bun run for a very long time (months) urn out to be the majority of the computation done because they run for so long<br />
They try to guess what kind a job belongs in.</p>
<p>They do some experiment and observe something surprising about Mesos: A big proportion of jobs do not get scheduled by the scheduler. The reason is that when a single scheduler is given the right to schedule, all others are pretty much given no resources to play with if they try to schedule something concurrently.</p>
<p>They do evaluation through simulation. Google turns out to have a high fidelity simulator that uses real workload so they can evaluat it fully this way. The outcome is that there are a lot of conflicts (which result in a high time to schedule tasks) in scheduling eal workloads, so they do 2 optimizations:<br />
-Fine grained conflict detection<br />
-Incremental commits: when jobs have many tasks, and some tasks fail let the other ones go through.<br />
They show that with these optimizations they are very close to the optimal scheduling time that they would get without conflicts  <em>(vd)</em></p>
<p>&nbsp;</p>
<p><strong>Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints </strong></p>
<p><em>Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica (UC Berkeley) </em></p>
<p>Large datacenters are heterogeneous. Why, because its very hard to keep machine configurations identical. There's two types of diversity: more and more specialised hardware. Ex: lots of gpus, high memory machines, hardware accelaroes. Specialisation also happens on the software side.<br />
More and more jobs have placement constraints. Two types of constraints: hard constraints, which must be satisfied fo rthe job to run on a particular machine. Soft constraints express a preference only.</p>
<p>How to fairly allocate machines when users have hard job placement constraints. Objective is to generalise max min fairness to extend constraints. What policyt to use. How to optimally achieve policy offline? How to approximate policy online? How well does it work? How to genearlise scheduling with constraints?</p>
<p>Policy:<br />
traditionally, have a sharing incentive where each user gets 1/n or resources. Develop constrainted sharing incentive CSI asusme each user i provides k machines, then user i should be able to get at least i machines in any allocation. Shapley Value. give 1/k machine to each k users who want it. But this violates CSI unfornately.</p>
<p>Proposed policy: constrained max min fairness (CMFF) recursively maximises the allocation of the user that has fewest machines. An allocations is CMMF iff it is not possilbe to increase the minimum allocation for any subset of users by reallocating machines in the subset. Only policy that satisfies CSI. CMFF is strategy proof. Lying about demands can only hurt your application.</p>
<p>The intution is that basically doing water filling: 1) initally mark all users as non frozen. Increases the allocation of all the non frozne users the same rateuntil bottleneck hit. Freeze share of users that cannot get more without hurting others. Then repeats process until all users are fozen. This only freezes shares, not actual assignments (of machines) which might shift. To determine share, solve linear problem with machine and fairness constraints. Once done, to determine users, add another constraint with frozen users which must receive exactly m (the value maximised i the previous equation) shares. This determines which suers are frozen.</p>
<p>To determine frozen users in an allocation, fix ever yall but user i's current shares. Freeze user i iff its allocation cannot be increases when everyone else is frozen.</p>
<p>Try to approximate this in a more efficently way, because cloud schedulers make decisions online so jobs may consist of many thousands of tasks, need to schedule tasks quickly on the fy. Aprroximate an offline scheduler that cannot preement or migrate running tasks.</p>
<p>Evaluation: how different is Choosy to optimal scheduler previously mentioned? How does it compare? How different are the allocation vectors and the job completion times. 90% of the time, difference in allocation root mean sqaure erorr is less than 0.2% (a tiny bit worse if you allow scheduler to migrate and preempt). For job completnation times: choosy has almost optimal job completion time.</p>
<p>Why is Choosy working? ramp up time is fas for Choosy, users quickly get their fair share. ramp up time depends on pickiness of user (nb of machines user needs / nb of machines user can run on). Whats happening is that, if not picky, don't have to wait for outliers, hence why time is fast.</p>
<p>How can generalise ?<br />
-&gt; soft constraints like data locality. Existing techniques like Delay scheduler, Mantri, can be combined with Choosy.<br />
-&gt; Multi Resource Fairness. Choosy very similar to DRF. Schedule user with min dom share satisfying constraints.<br />
-&gt; Hierarchical Scheduling: most hierarchical schedulers support compositions.</p>
<p>Conclusion: constrained max min fairness only policy providing sharing incentive, optimsal offline calculation based on iterative linear programming. Choosy, online system close to offline version.  <em>(nscc)</em></p>
<p>&nbsp;</p>
<p>Large datacenters are heterogenous. 2 types of differences:<br />
-Different specialised hardware (e.g. GPUs)<br />
-Different software (e.g. kernel version)</p>
<p>2 types of constraint:<br />
-Hard constraints: I need to have a public address (they focus on this)<br />
-Soft constraints: data locality</p>
<p>What policy to use: They generalise max min fairness with Constrained max-min fairness =&gt; recursively maximise the allocation of the user that has the fewest machines. They seem to have some cool proofs on this being the only way to satisfy some max-min like properties, but it's only in the paper. The CMMF works in the basic water filling algorithm way for min max: keep on maximising the resources allocated to the more unlucky ones. This can be turned into a linear programming algorithm that needs to be computed iteratively.</p>
<p>But doing this is expensive so they approximate it in the folling way: Approximate an offline scheduler that can not preempt tasks.</p>
<p>What Choosy does: when ever a resource frees up, allocate it to the more unlucky ones.<br />
They compare Choosy with optimal schedulers, with different definitions of optimal. Choosy seems to perform very well compared to optimal. I am confused to what they call optimal. Do they mean the optimal max-min fair allocation? If a job is better at parallelising than another, max min allocation should be different than optimal (i.e. utility sum maximization) allocation.</p>
<p>After questions: turns out they were using the optimal CMMF allocation.</p>
<p>Personal point of view: it is a shame they didn't have a look at the optimal allocations, and even further, at the tradeoff between alpha-fairness (optimal: alpha=0, min-max: alpha = infinity) and performance. Even if just from a theoretical point of view without considering the complexity of the scheduling algorithm. That would have given great insight to the heterogeneity (or not) of workflow: how different are the job performance under parallelisation characteristics? Instead they made the assumtpion min-max is best with no backup and no insight to whether it is.  <em>(vd)</em></p>
<p>&nbsp;</p>
<p><strong>CPI2: CPU performance isolation for shared compute clusters </strong></p>
<p><em>Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, vrigo Gokhale, and John Wilkes (Google, Inc.)</em></p>
<p>[no blog coverage]</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h2>Awards</h2>
<p>PC members nominated papers based on merit, PC members listened to talks attendees, to refine list. PC chairs made final decision.</p>
<p><strong>Best Paper Award:   </strong>BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data<br />
<strong>Best Student Paper:   </strong>Omega: flexible, scalable schedulers for large compute clusters</p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2013/04/17/live-blog-from-eurosys-2013-day-3/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>﻿﻿Liveblog from EuroSys 2013 &#8211; Day 2</title>
		<link>http://www.syslog.cl.cam.ac.uk/2013/04/16/live-blog-from-eurosys-2013-day-2/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2013/04/16/live-blog-from-eurosys-2013-day-2/#comments</comments>
		<pubDate>Tue, 16 Apr 2013 07:02:11 +0000</pubDate>
		<dc:creator>Ionel Gog</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Networks]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Parallelism]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Research Agenda]]></category>
		<category><![CDATA[Storage]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1144</guid>
		<description><![CDATA[Hi from all of us here in Prague -- this is day 2 of Eurosys and we'll be running the live blog as usual! Your friendly bloggers are Natacha Crooks (nscc), Ionel Gog (icg), Valentin Dalibard (vd) and Malte Schwarzkopf (ms). Session 1: Large scale distributed computation II Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat, [...]]]></description>
				<content:encoded><![CDATA[<p><img alt="EuroSys Logo" src="http://eurosys2013.tudos.org/wp-content/themes/eurosys/images/supporters/Eurosys_logo.png" width="160" height="45" />Hi from all of us here in Prague -- this is day 2 of Eurosys and we'll be running the live blog as usual!</p>
<p>Your friendly bloggers are <a href="http://www.cl.cam.ac.uk/~nscc2/">Natacha Crooks</a> (nscc), <a href="http://www.cl.cam.ac.uk/~icg27">Ionel Gog</a> (icg), Valentin Dalibard (vd) and <a href="http://www.cl.cam.ac.uk/~ms705/">Malte Schwarzkopf</a> (ms).</p>
<p><span id="more-1144"></span></p>
<h2>Session 1: Large scale distributed computation II</h2>
<p><strong> Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing</strong></p>
<p><em> Zuhair Khayyat, Karim Awara, and Amani Alonazi (King Abdullah University of Science and Technology), Hani Jamjoom and Dan Williams (IBM T. J. Watson Research Center, Yorktown Heights), and Panos Kalnis (King Abdullah University of Science and Technology)</em></p>
<p>&nbsp;</p>
<p>Natacha Crooks - Mizan a system for dynamic load balancinc in large scale graph processing</p>
<p>Researchers use graph to abstract application specific alogrithms into generic problems represented as interctions using vertices and edges. Although they all have similar computation behaviour, each application has its own computation requirement. Key application is Pregel, which is based on vertex centric computation and is BSP. The main idea of PRegel is to be in memory, and message massing. Pregle has a set of supersteps and a barrier which marks the end of the super step. Balanced computaiton and communication is fundamentla to Pregel's efficiency as performance depends on slowest worker. Existing work focuses on optimising for the graph structure. The claim is that the users know what their data use like, so should know how can partition it well. But they believe that none consider the algorithm behaviour, and that looking at static properties (like grah structure) are not enough. Need dynamic optimisations. Two types of alogrithm categories based on their behaviour: stationary and non stationary. Page Rank is an example of a stantionary algorithm (same sets of vertices remain active) whereas DMST is not stationary. The computation imbalance in non stationary algorithms is caused by the fact that in a superstep, some workers will be doing the bulk of the ocmputaiton work, and receive most of the messages, and this is will change over time (given have different set of active vertices).</p>
<p>Mizan is a BSP based graph processing framwork. Which uses runitme fine grained vertex migrations to balance computaiton and communication. Key objectives of Mian are to be decentralised, simple, and transparent: no need to change Prege's API and do not assume any a priori konwledge to graph structure or algorithm. Mizan also consissts of supersteps. They add an extra layer: migration layer. Mizan does all of its planning/migratioin in this layer. Computes statistics, do the new planning, and then do the migrations.</p>
<p>To plan the migrations, they look for the source of the imbalance, by comparing the worker's execution time against a normal distribution and flagging outliers. Mizan monitors statisitcs for vertices: remote outgoing messages, all incoming messages, and response times. These statistics are broadcast to each worker. Mizan then tries to find the strongest cause of workload imbalance, by comparing statisitcs for outgoing messags and incoming mesages of all workers with the worker's executions. They then pair over utilised workers with a single under utilised worker and pair with it. Afte rpairing, look at which evertices to migrate. Then do the migration: key wquestions 1) know new locations 2) how do to fast migration 3) how to recompute state 4) how to broadcast info. They use a DHT to implement a distributed lookup service, where a vertex can execute at any worker, but there is a notion of a home worker. Workers ask the home worker of V for its current location, andthe home worker is notified on changes. For migrating vertices with large message ssize, they introudce something called delayed migration for very large vertices. They move the ownership of the vertex one superstep before actually moving the vertices.</p>
<p>Q: did you try the single machine approach?<br />
Yes we ran some of the algorithms on the single machine but iddn't really look at the overhead.<br />
Q: but this fits in memory (300 million nodes)<br />
Yes you can run it on a single machine but not interesting for us because wanted to see how would work on a distributed setting and we think that would scale up.</p>
<p>Valentin:</p>
<p>The talk start by describing the BSP model and its Pregel implementation. The discussionn then goes on the partitionning method used for splitting the vertices among machines.<br />
They split graph algorithms into two classes: Stationary and non-stationary. It refers to which vertex programs needs to be executed. Algorithms that are non stationary execute different vertices at different BSP iteration, like Minimum spanning tree. Stationary algorithms execute the same ones everytime like PageRank.<br />
=&gt;Conclude that graph structure is not all (I think he is refering to PowerGraph optimizing to the power law distribution) but that the way the run time works is also important<br />
Mizan uses a BSP model, exactly the same API as Pregel.</p>
<p>Mizan adds to the execution model a Migration Barrier after the BSP barrier. The aim is to balance the partitionning then. What they do at this barrier:<br />
-Identify the source of imbalance (by looking at runtime)<br />
-Select the migration objective<br />
-Pair over utilized workers with underutilised one in a decentralised way<br />
-Migrate vertices</p>
<p>There is the problem of the fact that now, we don't know where vertices are located to send messages. They solve this by using a DHT mapping vertices IDs to workers. They move large vertices in multiple goes. Basically some of the messages are sent to the old worker which forwards them to the worker that now holds the vertices.</p>
<p>My personal point of view: The evaluate on graphs no bigger than 2.5GB which fits into the memory of a single computer. They showed no scaling for the sort of graphs that you would hope to perform on for what they called "1000s of machines". And yet they didn't look at the overhead when compared to single machine implementation.</p>
<p>A nice question was also the fact that when they repartition, they are one step behind in terms of workload, and it is not obvious that succesive BSP iterations correlate in workload.</p>
<p><strong>MeT: Workload aware elasticity for NoSQL</strong></p>
<p><em> Francisco Cruz, Francisco Maia, Miguel Matos, Rui Oliveira, Joao Paulo, Jose Pereira, and Ricardo Vilaca (HASLab / INESC TEC and U. Minho)</em></p>
<p>&nbsp;</p>
<p>Natacha Crooks - Elasticity specific to cloud computing paradigm. Elasticity is growing resource computations according to demand. For NosQL database, manage the bulk of data from modern web applications, scalable and dependanble systems, data paritioning acorss several computing nodes. An external system is required to know when to scale out/up, add/remove nodes. Correctly configuring a NoSQL database is a difficult task because there are many configuration parameters. For example, in HBAase, bock cache size and memstore sizes are the parameters that most affect cluster performance. Block cache size favours read requests, memstore size favours write requests. But reconfiguring this parameter implies that have to restart system.</p>
<p>There is a heterogeneity in how access data. Different applications have different access patterns, which may change over time, or witihn applications can have data access hot spots. But claim that locality of accesses is no longer relevant. To validate their hypothesis, use random data placement with homonegeous configuration, manual and homogenoeus, manual and heterogeneous where they classify paritions per access pattern, juse a manual data load balancer.</p>
<p>MeT is a cloud management framework which can do autmoaticmanagement of NoSQL database clusters and reconfigure automatically based on data access patterns. The decision maker is two fold: decision algorithm, which is based on the resource usage metrics decides whether to add or remove nodes, and the second the distribution algorithm, which firsts classifies regions, then groups them together, then assigns them to the correct configuration.</p>
<p>Claims that by configuring databases heterogeneously, throughput can be improved by 35% , both in multi tenant and single tenant scanrios. Data partitions can be specifically configured nodes considering their acces matters.</p>
<p>Q: the decision algorithm decides whether the system is in a suboptimal state. But the distirbution algorithm is separate from it. So doesn't depend on the first one. How does this work? In the distirbution algorith, can you have oscilliations where the migration cost is expensive? And yes what do you do to prevent this?<br />
A: In the firs titeration, have no infromation about the first cluster. Just look at what the monitor says, which is CPU utilisations, etc. etc. and by adding new nodes decide can lower that. Second question: currently working on a cost function whether should or should not alternate between two states.</p>
<p>Q: how would met would do for real application worklaods (highly skewed workload)?<br />
MET would look at that load and decide that a particular configuration should be assigned to an entire region server.</p>
<p>&nbsp;</p>
<p><strong>Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices</strong></p>
<p><em> Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (University of Chicago), and Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber (HP Labs)</em></p>
<p>&nbsp;</p>
<p>Natacha Crooks -<br />
Trend in the large sclae processing frameworks: data parallel frameworks where you rpocess each record in parallel. Graph centric frameworks address limitaitons of the previous. The last is array-based frameworks (MadLnik) which process blocks of array in parallel. For example, compute PageRank using Matrices. There are a number of algorithms that can be written as linear algerba operations on sparase matrices, for which array-based frameworks are very well suited for.<br />
Presto enables large scale machine learning and graph processing on spare matrices. Their approach is to extend R amd make it scalable / distributed. Problems is that there is a tremendous imbalance in terms of amount of data that is located per block. So this leads to computation imbalance. The second challenge is how to share data efficiently. Sparse matrices suffer signifcantly from communication overhead. Ca'nt just share data through pipes/neworks bceause is both time inefficient (send copies), and space inefficient (multiple copies).</p>
<p>Add to primitives to R. darray is one. If have a large data, darray gives you a handle into the distribution you can have (row/column based partition). "foreach" lets teh users specify a particular function which can be executed on particular parts of the array. The foreach funciton then exectes the funciton in a cluster. Prgrammer has the flexibility to specify what are the partitions that are gonna be accessed and compute this function on the cluster.</p>
<p>The presto architecture is a master slave architecture. The master is linked with the user shell. Each worker consists of a number of R instances, which are managed by a worker process. Allt he darrays are stored in DRAM on each machine.</p>
<p>They provide an online repatitioning system. They do a profile execution based on amount of time takes to compute particular partition. They can then check whether there is an imbalance between epartitions. This is an iterative process. They do this online parittioning in an iterative manner as well. One of the big challenges is how do they deal with multiple distributed arrays, and how to maintain size invarariants between them.</p>
<p>In order to share distributed arrays efficienctly, the objective was to do zero copy sharing across cores. The intution is that if there is immutable parittions then it is safe/easy to share. They create versioned distirbuted arrays, so as soon as change it creates a new version, which can then assemble to create full version. This makes it a lot easier to share. Other challences include garage collectiona nd header conflicts (linked to having multiple instances of R sharing objects). So they override R's allocator. They allocate process local headers, and map data in shared memory.</p>
<p>Q: have you talked with people who write R?<br />
They are interested in things of the sort, but they are interested in doing high level things like "do linear regression", so whats left to do is build these libraries on top.</p>
<p>Q: what about matrix/matrix multiplication? (as opposed to matrix vector partition)?<br />
Yes, also apply (see Netflix Collaborative Filtering slide)</p>
<p>Valentin:</p>
<p>With big data, a trend of machine learning + graph algorithms. He studies the trend in what is computed:<br />
MapReduce 2004<br />
Pregel 2010 to do graph computation<br />
MadLINQ 2012 process blocks arrays in parallel</p>
<p>Presto extends R to be scalable and distributed for sparse matrices. The first issue comes from how to split the matrix. If you just do it at random, the power law mean some partitions are much denser than average. The second issue is how to send messages efficiently</p>
<p>They add 2 primitives to R:<br />
-darray: distributed arrays<br />
-foreach(): function that can be executed in parallel on the darray.</p>
<p>Presto architecture: Master/worker<br />
Master controls what goes where, but workers talk between each other to pass messages.</p>
<p>Dealing with partitioning. They repartition at runtime by profiling the time it takes to do iterations on each machine. They repartition if max_time/median_time &gt; delta.</p>
<p>Dealing with the sharing efficiently. They override R's allocator. Not too sure what went on there.</p>
<p>Evaluation:<br />
They first thing they show is how easy it is to use, it has all the advantages of R (e.g. plotting). It looks very neat. They then show the efficiency of the dynamic repartitioning and it does seem like they get very good results.</p>
<p>A good question: isn't linear algebra restrictive? A: yes it is, but for a lot of linear algebra stuff is used so may as well be efficient.</p>
<p>&nbsp;</p>
<h2>Session 2: Operating System Implementation</h2>
<p>&nbsp;</p>
<p><strong>RadixVM: Scalable address spaces for multithreaded applications</strong></p>
<p><em> Austin T. Clements, Frans Kaashoek, and Nickolai Zeldovich (MIT CSAIL)</em></p>
<p>&nbsp;</p>
<p>Natacha Crooks - Paralell applications use VM intensively which puts stress on memory system becuase has to serialise all memory applications to huge scalabilty problems. Indeendent VM operaitons operate on non overlapping regions, but memory system can't</p>
<p>get this to scale. Goal is to have perfectly scalable mmap munap and page fault operations on non overalapping address space regions.</p>
<p>The problem is that most os have a big lock around their page table. The OS don't know which CPUS has page so have to broadcast shootdown resuts. There's also heavy cache contention on TLB misses. The big issue here is cross core communication. Radix Vm addresses this by elminiating communication between cores, using concurrent memory mapreresation, targetting TLB shootdowns.</p>
<p>Need to store OS level metadata about all memory mappins in memory. MOst OS use a balanced tree of region objects, which introduces unnecessary commnucation even if it is memory efficient. There is still a need to transfer ownerships of tree nodes just for reads. Could used array based memory map, operations on non overalpping regions are concurrent and induce no commnuicaiton, so avoids transferring ownship. But the space use is gigantic. The problem also is that operations take time proportional to the size of the data they're operating on.</p>
<p>Solution proposed is a range oriented radix tree, which enables good compression. Fold constant valued chunks unto parent, and recursively. It's only 2/3 times the size of a normal tree whilst having better performance.</p>
<p>The TLB shootdown: in the common case, there is a little or no sharing. A software managed TLB would make this easy, because could implement a trap and track mechanism. Could similate this by having per core page talbes , and interpose on page faults. When the CPU misses, would go down and record the fact that this CPU now knows about this particular mapping, so can target TLB shootdowns.</p>
<p>The last scalability issue is reference counting for phsyical pages and radix nodes. The refenrece counter has to be scalable. There is a need to limit ocmmunication between increment and decrement operations (if appear on the same radix node). So use a distributed counter, and give each CPU each slot. But its very expensive to detect when counter has gone down to 0. So the idea is that start with a distributed counter but build a tree on top of that so only need to look at the root of the tree to know if 0 or not. So their solution gives up immediate zero detection: they use a shared counter with a per core cache of changes to that counter. On the CPU, they keep a cache of changes, rather than values. When a CPU performs an operation on this refcount, goes to its local cache. This means that the true value of the refcount is the sum of its global count and the sum of its local count stored in its reference counches. So the question is, when is the rue count is 0. Make one assumption: when the true count is zero, it will stay zero. So what they do is divide time into epochs. Each poch, all CPUS fush their delata caches, If an object's global count stays zero for all epoch, then it's true count is. So the claim is that refache enabls time and space efficient scalable reference counting with minimal latency.</p>
<p>So the big picture, with the radix tree memoy map, the per core page talbes, and there ference couned physicla pages are the three structures that they use. Page faults lock the faulty pages, recod the faulting CPU, and allocate the backing page, increment the reference counted physical pages for example. munmap also sends down targeted shootdowns, decrements the count in the local cache, and then removes backing pages.</p>
<p>For Metis MultiCore MapReduce bencharmk, Linux fails to scale because of page fault lock contention. RadixVM performs signifcantly better, and fails only because of pairwise sharing. Refcache avoids cache line sharing, they repeatly map/unmpap a sared phsyical page, and demonstrate that RadixVm scales almost linearly.</p>
<p>Claimed contribution: radix trees, per core page tables, refcache for scalble space efficient</p>
<p>Q: looks very similar to guarded page table?<br />
A: not faimilar with those but will take a look<br />
Q: 80 cores, do you have 80x overhead for page tables?<br />
A: in simple approach, worst cast is 80x, But find that most applications not actually have worst case.</p>
<p>Ionel:</p>
<p>- Stress on the kernel Virtual Memory system.<br />
- Every popular OS serializes mmap =&gt; malloc/free have scalability problems.<br />
- Goal: Perfectly scalable mmap, page fault, nummap on non-overlapping address space regions.<br />
- Why doesn't scale? 1) TLB Shootdowns broadcast, 2) Locking and 3) Cache contention =&gt; All three involve cross-core communication.<br />
- Metadata management: popular OS uses a balanced tree of region objects. However, this involves unnecessary communication.<br />
- How about array-based memory map? Good: Operations on non-overlapping regions are concurrent. Bad: Space usage, time is proportional to region size.<br />
- Range-oriented radix tree in which we fold constant-valued chunks into parent, recursively. In practice the tree ends up only being 2x-3x the size of the balanced region tree.<br />
- TLB shootdown - Which CPUs have a mapping cache? OS doesn't really know that.<br />
- A software-managed TLB would make this easy. Trap &amp; track.<br />
- Can simulate software-managed TLB via per core page tables<br />
- Reference counting for physical pages and radix nodes<br />
- Limit communication between inc/dec counters.<br />
- Solution: Refcache - gives up immediate zero-detection to achieve O(1) space and O(1) zero-detection cost.<br />
- Shared counter with per-core delta cache.<br />
- Divide time into epochs. After eache epoch each core will flush its delta caches.</p>
<p>Q: Looks similar to the garded page tables from L4? What's the difference?<br />
A: I am not familiar with that work. I'll check it up</p>
<p>&nbsp;</p>
<p><strong>Failure-Atomic msync(): A Simple and Efficient Mechanism for Preserving the Integrity of Durable Data</strong></p>
<p>Stan Park (University of Rochester), Terence Kelly (HP Labs), and Kai Shen (University of Rochester)</p>
<p>&nbsp;</p>
<p>Key question is how to maintain consisteny over application failures. They biuilt is to allow the programmer to evolve durale state fialure atomically, all or nothing, always consistent despite power outages, process crashes, and fail stop kenel panic. With msync, can play well with POSIC: MS_INVALIDATE: rollback functionality for failed transactions.</p>
<p>To implement, need to keep state consistent between msync and keep state consistent during msync. They leverag a journaling baed approach. The journal is a redo log. Each entry is checksummed. Write file updates to journal, out of place write keeps file consisstent until the full upate transaction is durable, and once it is durable, then applyit to the file system. Two potions: eager vs async journaled writeback. Eager writeback will flush all fise system layer direty pages including previously journaled pages. Async writeback distinguishes between unjournaled dirty and journaled dirty pages, which can defer non critical work as a result.</p>
<p>They extend the CFS interface where its possible to have multple noncontiguous pages in a given range. They can support richer journaling in the file system, where can encalsuate all work for failure atomic operations (multiple non contiguous block updates) in a sinle transaction. So its written as a single journal entry. There's issue with the syze of msnc (2mb with default journal, at least 16 mb with 3gb journal). There's the issue with isolation of multi threaded code, and memory pressure.</p>
<p>&nbsp;</p>
<p><strong>Composing OS extensions safely and efficiently with Bascule </strong></p>
<p>Andrew Baumann (Microsoft Research), Dongyoon Lee (University of Michigan), Pedro Fonseca (MPI Software Systems), and Jacob R. Lorch, Barry Bond, Reuben Olinsky, and Galen C. Hunt (Microsoft Research)</p>
<p>Ionel:</p>
<p>- Extensions:<br />
- change the runtime behaviour of an app/OS.<br />
- must be safe to admit in the system even if buggy/insecure<br />
- composable at runtime<br />
- In today's software stack there are limited opportunities/options for adding extensions. No "thin waist" in the stack =&gt; Goal: to introduce one.<br />
- Bascule: libos - full os personality as user-mode library.<br />
- Narrow binary interface of primitive OS abstractions sits between LibOS and the host OS.<br />
- Extensions loaded in-process interpose on the Bascule ABI.<br />
- Drawbridge: provides secure isolation of existing apps via picoprocesses and the Windows LibOS.<br />
- Why not support extinsibility by modifying the LibOS?<br />
- may not have the source code<br />
- may not be amenable to customization. OSes are not static. They constantly receive updates and patches.<br />
- ABI - stateless and with fixed semantics.<br />
- Two Guest LibOS implementations: Derived from Drawbridge (Windows 8) and Linux 3.2 proof of concept. Demonstrates OS-independent ABI.<br />
- Host implementations: Windows 8 and on Barrelfish.<br />
- Checkpointer extension<br />
- adds migration/ft to unmodified apps and LibOSes.<br />
- track state at runtime (writable/modified VM allocations), outstanding I/O, open streams.<br />
- at checkpoint cancels pending I/O and ABI calls, open file and serializes all the state to file.<br />
- Evaluation:<br />
- Runtime overhead: the base cost is 86 cycles (negligible vs syscall)<br />
- Memory footprint: fet KB per thread</p>
<p>Q: What's the configuration story? How do you compose?<br />
A: When you want to start an app, you say I want to run this app and then the package dependency tells you on which packages your app depends.</p>
<p>Q: OS don't usually start with 1000 calls. What will not make your ABI to grow to that?<br />
A: Two things: careful design and the set of things exposed in the ABI is a set of general things. One solution can be to provide an extension that builds on the current ABI.</p>
<p>Natacha Crooks - xtensions: change the rnutime behaivour of an application/OS, developed by a htird party, but applyed by end user or system integrator. Extension should be safe to admit, even if buggin/insecure. Dififcult to achieve because of today software stack, there are limited places where could interpose the extensions. The syscall abi for example, hudreds to thousands of calls, and very tight coupling with the OS kernel implementation. popular approach today is to run on top of virtual machine, so have a virtual hardware interface. There are many level of indirections between top and bottom. There's no useful thin waist in this stack. Goal is to introduce one.</p>
<p>Bascule uses a library oS. It's a full of user mode libraries that implements the full OS personality. The interface between the libOS and the rest, is an in process interface. It's a narrow binary interface of primitive OS abstractions. Because interface is in process, can support extensions that are loaded in process, interpose on ABI, they are safe and efficient (because of interposition in the same process). One way of viewing this is ot add an extension mechanism to Drawbridge.</p>
<p>LibOS is user mode code, so why not simply modify it? May not have the same source, or may need to apply securityupdates and patches from OS vendors (os are not static), so if extending libOS then becomes difficult to apply patches to extensions.</p>
<p>Bascule ABI is a nestable in process ABI of common OS primitives. The host provides a table of function entry points, and the data structure of startup apramenters. The guest code on the other hand provides the table of upcall entry points from upcalls that come up from the host.</p>
<p>Provide thread and synhcronisation, virtual memory managemnt , io stream abstraction and exception handling. Tricky aspects include shared address space, give there's no protection. So their solution is to have extension locations fixed at startup, and must allocate within a region. There's also challenges with stach use acorss ABI calls, nested exceptio handling, and thread-local storage on x86.,</p>
<p>There are two guest LibOS implementations: Windows 8 drived from Drawbridge, and Linux 3.2.</p>
<p>What can extensions do? the bascule abi is suitable for extensions that monitor or modify execution, interpose on file/network I/O and require control over application state. It's les suitable for applications that require tight coupling with host or guest. The example given is a check pointer extension which adds migration and fault tolerance to unmodified apps and LibOSes. Other example is an architecture adaptation extension.</p>
<p>Claimed contirbutions: Basicule new thin waist OS ABI, where extensions are loaded at runtime bythe end user, guarnatees safety. It avois modifications to LibOS, but enables the extension store model.</p>
<p>&nbsp;</p>
<h2>Session 3: Miscellaneous</h2>
<p><strong>Prefetching Mobile Ads: Can advertising systems afford it?</strong><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Mohan.pdf"><br />
</a></p>
<p>Prashanth Mohan (UC Berkeley) and Suman Nath and Oriana Riva (Microsoft Research)</p>
<p>Valentin:</p>
<p>Most apps are free, most free apps use ads. Ads consume 23% of the total app energy 95% of which is from communicationn (downloading the ad). This is due to the tail energy, problem, the radio is woken up and stays up after downloading the ad.</p>
<p>Ad prefetching challenge:<br />
-Everytime an app requests an app, there is an online auction from the add exchange, this  can not be run in advance so the infrastructure has to change.<br />
-Ads have deadlines  (e.g. bid price change)<br />
-Not all downloaded adds may be shown which violates the SLA</p>
<p>Solutions:<br />
=&gt; Assume minimal change in the infrastructure<br />
=&gt; Look for a reasonable deadline: 30 min -&gt; 0.5% of ads change price<br />
=&gt;Explore the tradoff in the 3D space: energy, SLA violation, revenue loss (discuss this for the rest of the talk)</p>
<p>This depends on the predictability of ad demand. They measure this as the entropy, which turns out to be high (not very predictable). But an hour by hour model of users turns out to work well.</p>
<p>Actual system:<br />
Mobile client talks to a proxy (ask for prefetch). The proxy guesses the number of adds that should be prefetched and requests that number of slots to the add network. The result is sent back to the mobile.</p>
<p>They use an overbooking algorithm to evaluate SLA violation. I didn't really understood how that worked (or really what it did) but as far as I understand it turns the measure of SLA violation into a single parameter (the overbooking penalty called O). The SLA violation is proportional to 1/O.</p>
<p>In the end, they reduce the 3d space above into two dimensions: O and a prefetching aggressiveness k.  O spans the SLA violation&amp; revenue loss space, and k spans the energy&amp;revenue loss space (I think)</p>
<p>&nbsp;</p>
<p><strong>Maygh:Building a CDN from client web browsers</strong></p>
<p>Liang Zhang, Fangfei Zhou, Alan Mislove, and Ravi Sundaram (Northeastern University)</p>
<p>&nbsp;</p>
<p>Valentin:</p>
<p>Current solutions to distribute content on the web today:<br />
-Subscription<br />
-Ads</p>
<p>Properties of large website:<br />
-Many users<br />
-Same content viewed by many users<br />
-Content is v static</p>
<p>=&gt; Recruit web clients to distribute the content<br />
Current solution to do this:<br />
-Browser plugin<br />
-Client side software<br />
But users have no interest to do this, so their question is: can we build a system that does not require the user to install extra software.</p>
<p>They present Maygh to solve this.<br />
-Serves as a distributed cache<br />
-Content (image, CSS, javascript...) is named by content hash<br />
-The users run a javascript program that cashes content</p>
<p>To make this work, they use a Maygh coordinator, the coordinator looks for who online has the content and points to a user online that is close and has it. Clients use protocols RTMFP or WebRTC to communicate with the coordinator.</p>
<p>The coordinator:<br />
-Serves as a directory for content<br />
-Allows browsers to communicate together</p>
<p>Security issue:<br />
-can users served forged content? No, as we use content hash names.<br />
-Can user violate the protocol? (claim to have content DDos) Use the same techniques as are used today for these issues e.g. block IPs</p>
<p>The one issue the authors don't quite solve (and are very open about it) is the privacy question, can I know what other people have been watching. Their answer is that users can disable Maygh for certain content. This is weid because from the questions at the end it seemed the user didn't really have a say, it was up to the website. So not sure about this.</p>
<p>Evaluation: Additional latency is the v worse case is 1.6s. Usually around 500ms. WebRTC had significantly better performance than RTMFP (but only works on chrome and firefox). To estimate bandwidth saved they use simulation as Maygh is hard to actually deploy. They show pretty impressive results, a 75% reduction in badnwidth.</p>
<p>Q: What if the content has changed<br />
A: It will have a different name since it is a content hash</p>
<p>Q: Since browsers are not always online, do you need to use replication between browsers to guarantee good performance. What is a good replication number.<br />
A: We only cache, do not replicate by prefetch. Geographical locality of interest also means if people around don't have it, its probably best to download it from the content provide.</p>
<p>Q: How much storage does Maygh use?<br />
A: Maygh uses some space on the Web browser which is usually capped at 5MB</p>
<p>Q:What if  clients want to opt out?<br />
A: Up to the website. For example can accept at the cost of ads.</p>
<p>Q: Scaling?<br />
A: Can have multiple coordinators which communicate in clever ways. It scales.</p>
<p>Natacha Crooks -<br />
New way to distribute content on the web. Web today is fundamentally a client server and distributes content in that way. Three obtiens for content distriution: serve your own, pay content distirbutionnetworks, or rent cloud services. But this imposes a significant monetary burden on web site operator. Current options are user subscriptons or advertising to support this financial burden.</p>
<p>The idea is for the clients to help dsitirbute the content, why, ebecause typical properties of websites are many users, same content viewed by many users and content is largely static, so the idea is to recruit those web clients who see the same data to help serve content. Their motivation is to bulid a system which requires no additional software.</p>
<p>The goal is to build a CDN that sreves as a cache for static content and works with today's web rowsers. It does not require any additional changes in the client site. They do that by using recent HTML5 browser features. So can reduce bandwidth cost for site manager.</p>
<p>Maygh serves as a distributed cache and always assume content will be available from the origin. Content must be named by content hash. The key challence is that browsers are not designed to communicate directly, so eed to find a way to get client to communicate to eac hother. They use two porotocls RTMFP or WebRTC which are two peer to peer protocols for Web browsers. Maygh introduces a coordinator. The coordinator is run byt he site operator. First the user requests to the site the root html, ut when user gets content, it will irst send request to coordinator and will first look for any other user is online who has the content and will request content from that user rather than site.</p>
<p>The coordinator serves two purposes: servers as a directory for content, and keeps track of content in user's browsers. It will allow browsers to establish direct connections (via RTMFP / WebRTC). On th eclient side, bwosers use RTMFP / WEb RTC to communicate with coordinatior. This allows bi driectional communication. The online client is always connected to the coordinator. Web site operators need to include the Maygh javascropt and make a small change to the loading context in the code.</p>
<p>Security aspects: it is possible to detect forged content using content hash, to avoid users serving false content. Attacks are a risk by clianing to have content, or DoS. Current security techniques work as usual. With regards to privicay, content is secured by its hash, naming content implies access.</p>
<p>Claim contributions substantial monetary burden to host popular Web site. Site operators resort to advertising to pay bills. If you recruit web clients to hep distirbute content. Shows that using Maygh can have significant browser reduction.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2013/04/16/live-blog-from-eurosys-2013-day-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Liveblog from Eurosys 2013 &#8211; Day 1</title>
		<link>http://www.syslog.cl.cam.ac.uk/2013/04/15/liveblog-from-eurosys-2013-day-1/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2013/04/15/liveblog-from-eurosys-2013-day-1/#comments</comments>
		<pubDate>Mon, 15 Apr 2013 06:38:28 +0000</pubDate>
		<dc:creator>Natacha Crooks</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Networks]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Parallelism]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Research Agenda]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1101</guid>
		<description><![CDATA[Hi from all of us here in Prague -- this is day 1 of Eurosys and we'll be running the live blog as usual! Your friendly bloggers are Natacha Crooks (nscc), Ionel Gog (icg), Valentin Dalibard (vd) and Malte Schwarzkopf (ms). Introduction 143 papers submitted (after removing ones that violated submission guidelines), accepted 28, in line with the [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignleft" style="padding: 5px;" alt="EuroSys Logo" src="http://eurosys2013.tudos.org/wp-content/themes/eurosys/images/supporters/Eurosys_logo.png" width="160" height="45" />Hi from all of us here in Prague -- this is day 1 of Eurosys and we'll be running the live blog as usual!</p>
<p>Your friendly bloggers are <a href="http://www.cl.cam.ac.uk/~nscc2/">Natacha Crooks</a> (nscc), <a href="http://www.cl.cam.ac.uk/~icg27">Ionel Gog</a> (icg), Valentin Dalibard (vd) and <a href="http://www.cl.cam.ac.uk/~ms705/">Malte Schwarzkopf</a> (ms).</p>
<p><span id="more-1101"></span></p>
<h2>Introduction</h2>
<p>143 papers submitted (after removing ones that violated submission guidelines), accepted 28, in line with the ~15% acceptance rate that EuroSys has had for a few years. Work published at EuroSys continues to be well-cited, competitive with SOSP/OSDI. Heavy/light PC this year (26/14 members), with an emphasis on diversity in topics, gender and geography. 3 reviews in first round;  76 papers got to second round (ca. 50%), and got another 4 reviews, then rebuttal phase. All second-round papers discussed in PC meeting, with discussion lead being someone who had not read the paper, and at least 5 attendees had read it.</p>
<p>Geographic diversity of accepted papers: 21 from North America, 4 from Europe, 2 from Asia and 1 from Australia (poorer balance than previous years). Best paper awards will be done at the <strong>end</strong> this time, allowing the PC to take into account reactions at the conference. <em>(ms)</em></p>
<h2>Session 1: Large scale distributed computation I</h2>
<p><strong><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/1-Qian.pdf">TimeStream: Reliable Stream Computation in the Cloud</a></strong></p>
<p><em>Zhengping Qian (Microsoft Research Asia), Yong He (South China University of Technology), Chunzhi Su, Zhuojie Wu, and Hongyu Zhu (Shanghai Jiaotong University), </em><em>Taizhi Zhang (Peking University), Lidong Zhou (Microsoft Research Asia), Yuan Yu (Microsoft Research Silicon Valley), and Zheng Zhang (Microsoft Research Asia)</em></p>
<p>Natacha Crooks - This is a system designed for reliable computaiton for big streaming data for a coud environment. Key motivating examples are: 1) network infrastructure monitoring in a datacentre. Software agents in a datacentre attract various performance counters which need to support queries in real time, such as real time heap map of latency between various racks. So you need a system which can do near real time computing, which must also be scalale and easy to program.</p>
<p>The second scenario relates to online search adds. If you input a query in sa search engine, a number of adds will be displayed. There's a complex model that matches the query to the adds. They are trying to improve add matching model using the history of the past user queries/clicks. This has to be done in real-time (or near real-time). The system must also be reliable, as failure causes loss of money. It must be highly resilient to load variance and scale dynamically. There's also the requirement that it should be easy to program. As a summary, the key challenges for scalable near real time computing are scalability and reliability.<br />
Traditional systems stem from database community:<br />
- Streaming Datamabase (StreamInsight), but is implemented at too small a scale. Ex: not possible to use simple node replication<br />
- TimeStream: declarative, has scalability, fault tolerance, easticity, preserves single node model but transparently ships computation to the cloud.</p>
<p>Motivating example: continuous word count over an infinite stream of tweets. Important to note that grouping on an infinite stream is different from MapReduce. Need to introduce window operations to enable aggregation. TimeStream introduces hash partitions to express how can partition strings so that can know how can run this in parallel. TimeStream allows you to dynamically change the number of partitions to respond to load changes in the future.<br />
Time Stream tries to automatically ship sequential query written by the client to the cloud to run reliably.</p>
<p>In distributed runtime, mdoel computaiton using a dataflow graph. Each vertex is a determnistic ocmputaiton, which could be statement. Each edge is an ordered data channel. The data are just opaque values to the runtime. When the data arrives at the vertex it triggers some computation, and may produce output. The whole execution is determinsitic</p>
<p>Failures are handled by subsituting a vertex with an identical computation. If a channel is overloaded, they replace the whole subgraph with an equivalent computation, which may include more parititons (substitution for dynamic partitioning). Subsistution is used as a mechanisms for both failure recovering and dynamic partitioning. This may cause loss of state. But TimeStream does resilient subsistiton. Which means applying equivalent subsititons at sub dags at runtime. As long as the input is the same, the system will guarantee that hte output is not affected (again, by relying on the fact that the system guarantees determinism)<br />
After subsistion they do stabilision, which guarantees no loss, no dupliation.</p>
<p>TimeStram does dependency tracking, in order to do efficient subsistution. For each output, we track the same portion of input that the output depends on. Output input data dependency is often bounded. They ensure that dependency tracking is lightweight and fine grained. The process of dependency tracking is hidden from the users in the operator library. Users need not wory about the dependency tracking.</p>
<p>Evaluation: Abacus counting for Ads Click Prediction</p>
<p>Requires multiple node to compute in parallel an expensive filter computaiton. Use hash partitions for the filter operation as a result. This is to get rid of the fraudulent queries (bot detection). Group a set of events as a single evnet in the computation, so first try to understand how group size affects computation perforamnce. With larger batch sizes, the latency goes up. With smaller batch sizes, the latency is high as well (revels dependency checking overhead). For throughput, larger batch sizes are better.<br />
Show comparable performance to Storm but with stronger guarantes. When a failure occurs, Storm results are incorrect until the loss state is completely moved out of aggregation operators. Not in TimeStream. Estimate the dependency tracking to represent an approximate 10 percent overhead.</p>
<p>As conclusion, key contribution is resilient subsistion which argue is a unified mechanism to support fault tolerance and dynamic subsistution.</p>
<p>Questions<br />
Q: Since computations represented as a graph, have you thought of using Trinity (Microsoft) instead of your own.</p>
<p>A: Main goal is reliable stream processing. Trinity mostly focussed on performance .But don't do anything to improve fault tolerance /</p>
<p>Q Peter Pietzuch: Did you consider streaming algorithms for which output depends on entire state of the stream<br />
A. Dependencies is gudiance for when to start recomputing. When ened to recompute such operator fail, then the checkpointing mechanism will</p>
<p>Q (?) Showed Scalability nodes up to 16 nodes, what scale do you run system in production?</p>
<p>A: Currently push up to 32 nodes because this is what our particular scenario requires. Haven't tried to run with more nodes. Currently working on settting</p>
<p>Q Malte Scharzkopf :whats the bottleneck that requires you to scale to multiple nods rather than 32 cores on a single node.<br />
A No particular reason, Computation needs to hold a large window so state doesn't hold in memory of one node.</p>
<p>Ionel Gog-</p>
<p>- Motivation:<br />
- 1st scenario: Software agents in the DC that store performance counters. Hence, they are interested in a system that can query in real time this data.<br />
- 2nd scenario: Microsoft's advertising team: When a user enters a query we also want to display some ads. They want to improve the ads based on the previous user behaviour. Hence, looking for a system that can do near real-time queries.<br />
- Challenges: near real-time computing, scalability, reliability (fault tolerance, resiliency to load changes).<br />
- Extends Microsoft's StreamInsight by introducing a new HashPartition.<br />
- Model the computation using a dataflow graph. Each vertex is a deterministic computation that can be stateful. The whole execution is deterministic.<br />
- When a failure occurs they substitute the vertex with an identical vertex. If there is a channel overload then they replace the channel with another graph that partitiones more.<br />
- Resilient substitution = change vertices to equivalent subdags. Equivalent means that they're conducting the same computation.<br />
- Observation: in streaming computation the output-input data dependency is often bounded. (dependency tracking).<br />
- Lightweight, fine-grained dependency tracking is done at runtime to obtain minimized substitution-induced recomputation.<br />
- Groups single events as a batch. They are then feed in to the system<br />
- Claims that it scales linearly even though evaluation was only conducted on at most 16 nodes.</p>
<p>Q: Have you thought of using Microsoft's Trinity?<br />
A: I don't know how it handle failures. I think they just focus on performance.</p>
<p>Q(Peter Pietzuch): Did you consider streaming algorithms where the computation depends on the entire history of input?<br />
A: Yes. The dependency tracking is just a guidance. However, checkpointing would be used in your case.</p>
<p>Q(Malte Schwarzkopf): What is the bottleneck that requires you to scale to multiple machines opposed to just using a powerfull machine?<br />
A: Sometimes the computation needs to hold a large window.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><strong><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/15-Ke.pdf">Optimus: A Dynamic Rewriting Framework for Execution Plans of Data-Parallel Computation</a></strong></p>
<p><em>Qifa Ke, Michael Isard, and Yuan Yu (Microsoft Research Silicon Valley)</em></p>
<p>Natacha Crooks -<br />
Distributed exeuction plan generated by query compiler are used. Distributed execupion plan represented as a DAG representing computation and dataflow of data parallel program. Box is vertex representing computation. The data flow is represented by edges.<br />
First key problem Map Reduce: data parittioning, which is a basic operaiton to achieve data parallelism. But this has two issues: data skew (eg popular keys, etc.), and the number of paritions, there's a tradeoff between better load balancing but more IO overhead. Striking a good balance requires statitics of reducers. But this is not available at compile time, which suggets need some form of dynamic partitioning at runtime.<br />
Second key problem: Matrix Computation. Key challenge is that spare matrises may compute intermediate dense matrices. But this is only known at runtime not compile time, so there's a need to change/adapt algorithms at runtime.<br />
Third key progblem: Iterative Computation. The key problem is that don't know the stopping condition.<br />
Fourth key problem: Fault tolerance. Usually replicate data, but for intermediate data, don't use replication because very expensive, so choose reexecution instead. But two issues: 1) if vertex is compute intensive, very expensive to reexecute 2) critical cahins: a long chain of vertices reside in same machine due to data locality, so when trying to reexecute a vertex, find out that many of its inputs also lost because int he same computer, so this creates a long chain. The key challenge here is : how to identify and protect important intermediate results at runtime, and replicate those.<br />
Fifth problem: Currently do compile time query optimisation using data statistics at compile time, but want to be able to do this at runtime as well. Thought: aren't there already many systems which do some version of this (FlumeJava, Nectar amonst others? )<br />
Present Optimius: dynamically rewrite EPG based on data statistics collected at runtime and compute resources avaiable at runtime. The key goal is for it to be extensible. Collect statistics at the data plane, and sends a message to the graph rewriter which rewrites queries dynamically.<br />
The systme is built on top of DryadLinq. They add a rewriter module to the Dryad Job Manager. Provide User defined Rewrite Logic and User defined Statistics. The query compiler was extended to be able to generate code for user defined rewrite logic/user defined statistics. This is shipped to the Rewrite Logic in the Job Manager, which will rewrite the graph accordingly.<br />
They minimise overhead to collect data statisitcs by piggy backing collection onto existing vertices. This is extensible, as the statistics estimator is defined at the language layer via user defined functions.<br />
This is all done at the data place, which avoids saturating the control place.</p>
<p>The graph rewriting module is specified as a set of primitives to query the EPG and modify the graph. The rewriting operation depends ont he state o fth evertex. If the vertex is inactive, then can rewrite everything. If running, then is killed and partial results will be inactive. If the vertex has already been executed, then Optimus only allows rediction of IO input/outputs.<br />
Dynamic Data (Co-)Paritioning. Co parititoning means using a common paramenter is set to partition multiple data sets, which is used by operators which take multiple streams as imput.<br />
Co range paritioning can be used to prepare the data for joins. But may detect skew at runitme. So can recovert from that thanks to graph rewriter module.<br />
For matrix multiplication, there are multiple ways to achieve results. They are different ways to partition matrix and for each, different algorithms to compute result. For Optimius, the exensibility allows intregating matrix computation by using a library. The library makes several runtime decisions: data partitioning: subdivide matrices, data model: sparse or dense: And finally how can we choose the right algoirhtm for the matrix operator.<br />
They add a reliability enhancer for fault tolerance. They use replication graph to procted important data generated by a given node.</p>
<p>Evaluation:<br />
Optimus has significantly less computation time because has excellent cluster utilisation. Because of data skew, others don't have nearly as high a utilisation.<br />
For matrix multiplication, use movie recommendation by collaborative filtering of the Netflix Challenge data. Optimus can choose best way to compute matrices and can change runtime dynamically. Outperforms MetaLynq.</p>
<p>Key contributions argued: flexible/extensible framework to modify EPG at runtime. Enable runtime optimisations that are difficult to achieve in other systems through rich set of graph rewriters.</p>
<p>Questions:<br />
Q (?) : in previous talk hear about subsition mechanism that also did substitution. whats the link between the two?<br />
A: these are two systems. DryadLinq is a batched system vs via streaming computation.<br />
Q (follow up): But are they not similar:<br />
A: Main contribution is extensibility. The user specifies the rewrite logic.</p>
<p>Ionel Gog -</p>
<p>- optimize Execution Plan Graphs (EPG) at runtime.<br />
- Problem 1: Data partitioning: it's difficult to partition without getting data skew =&gt; we need dynamic data partitioning.<br />
- Problem 2: Matrix computation: at compile time we don't know the density of the intermediate matrices =&gt; dynamically choose data model and alternative algorithms.<br />
- Problem 3: Iterative computation: We don't know the stopping condition.<br />
- Problem 4: Fault tolerance: Intermediate results are expensive to regenerate when lost =&gt; How to identify and protect important intermediate results at runtime?<br />
- Problem 5: EPG Optimization is usually done statically =&gt; Can we do it at runtime?<br />
- Optimus - dynamically rewrite EPG based on data statistics collected at runtime.<br />
- Modules of Optimus:<br />
- Data Statistics Collector:<br />
- piggy-back into existing vertices<br />
- statistics collector defined by the user<br />
- the aggregation of statistics is done in the data plane<br />
- Graph Rewriting Module:<br />
- 3 states (Inactive, running, completed). If inactive all rewriting can be applied, if running then rewriting has to consider intermediate data loss.<br />
- Example of graph rewriting:<br />
1) Dynamic Data Partitioning<br />
- compute histogram at each partition, then merge histograms ...<br />
2) Hybrid Join<br />
- avoid data skew at runtime. Detect a partition that is big than others and divide it into smaller partitions.<br />
3) Matrix Computation<br />
- there are multiple way of multiplying two matrices. Different ways of partitioning each matrix. Which one we use depends on the statistics of the input metrics.<br />
Reliability enhancer for fault tolerance:<br />
- Add an extra node (on a different machine) to the graph to which the output of a vertex is sent as well.</p>
<p><strong><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/29-agarwal.pdf">BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data</a></strong></p>
<p><em>Sameer Agarwal (University of California, Berkeley), Barzan Mozafari (Massachusetts Institute of Technology), Aurojit Panda (University of California, Berkeley), Henry Milner (University of California, Berkeley), Samuel Madden (Massachusetts Institute of Technology), and Ion Stoica (University of California, Berkeley)</em></p>
<p>Natacha Crooks - Presented with lots of data that must be processed in timely fashion. Online Media Websites are good example. Must make decisions on that data. Other examples include log processing for finding anomalies in system, etc. The key problem that try to address it hte need to compute aggregate statistics over massive sets of data. Goal is to support interactive analysis on these data sets.</p>
<p>BlinkDB would like the user to provide a query and a time bount, and the result would be a result, with an error bound. If user isn't happy, can reply with a larger time bound.<br />
Target Workload: exploration is had-hoc. The only assumptions they make is that the set of columns that appear in queries are stable over time (ak: query schema doesn't hold). Traces from Facebook data, show that this assumption holds. There's a very small set of query templates that most queries map to.<br />
Goal is for systems where real time latency is valued over high accuracy. There is no need for query parameters to be known in advance.</p>
<p>Overview of Sampling in Databases. Can uniformly smaple data. But doesn't support certain times of query due to anomalies introduced byt the sample rate. Can use stratified samples. This adds the requirement that all unique keys are represneted in the sample.</p>
<p>In BlinkDB, an offline sample creates an potimal set of samples (offline) which are iether cached in memory or placed on disk. When the query comes in, a query plan gets created and the sample selection module then trie to figure out what the possible sample model to answer the query. The new query plan gets created accordingly, and this is executed on a parallel computing cluster framework. Curretnly support Hadoop and Spark. And the result (with an error bound) is then returned. There are two key sets of challences: what shouldbe the optimsal set of samples to maintian in order to spuport ad hoc exploratory queries. And 2) given a query, what shouldbe the optimal saple type and size that can be processed to meet the constraints. They use the concept of query coverage, which is a probability that the attribute filtering on will be in the stratisfied sample, and the queyr converage is then the sum of the probabilities. For multiple predicates, a query qhich has one of them is more likely to include result then one which has none. So define a query coverage probability.<br />
Not all strastified samples that are in theory possible have the same cost. The cost of sampling depends on the number of values the attribute that are stratifying on. They formulate a MILP program to represent this, where try to maximise coverage for a given cost (the cost of all samples). They also include a notion of sparsity function to model density and uniformity of a column. (ak: if highly uniform then uniform might be good, but if hsa a long tail, then need to use the stratified model). In order to determine optimal sample siz, they define an "error latency profile" which is a relationship between the relative error and the time taken to execute the query, as a function of the sample size. Most functions used in the system have the property of having closed forms, which means that the error is proportional to an equation. So they can use linear cost models to determine the "optimal" point (via extrapolation - run queries on small sample, and use that to extrapolate for larger sample sizes). For each possible sample, they generate an error latency profile, and see which one gives the highest accuracy for a particular time bound, and use the sample with the lowest error.</p>
<p>Key contributions: argue that approximate queries is an important means to achieve interactivity in rocessing large datasets. Ad Hoc exploratory queries on an optimal set of multi dimensional stratisfied samples converges to low error bounds significantly faster than non-sample.</p>
<p>Q (Jacob Eriksson UIC) : you mentioned that you run small queries first, how can you estimate the error for large properties.<br />
A: We used the closed form property that can estimate the corresponding error (which is proportional for the sample size).</p>
<p>Q (?): when processing data, what is the effect that BlinkDB gives you, how do you tolerate data changes?<br />
A: Don't support data that is updated repeatedly. Offline sampling. So need to refresh our data very often. But samples are offline. Do support bulk loading.</p>
<p>Ionel Gog -</p>
<p>- Need to compute aggregate statistics over large data seconds<br />
- it takes 1-2h to process 100TB on 1k machines if data stored on HDDs. 25-30 minutes if data stored in memory.<br />
- propose to have small samples.<br />
- users specify for how should the query run and then the system returns a results with and a error interval<br />
- 90% queries map to only 20% templates (Facebook)<br />
- 17437 queries map to 108 query templates. 90% map to only 10% of the templates (Conviva)<br />
- Offline-sampling: set of samples across few dimensions<br />
- samples stored in memory or stripped over HDDs<br />
- What is the optimal set of samples?<br />
- Create samples for each tuple of columns of a table.<br />
- Query coverage: the guarantee that a sample contains the columns on which a query is trying to filter on.<br />
- Cost of the stratification: not all samples have the same cost.<br />
- Tries to make sure that the cost of all samples is smaller then a user defined value.<br />
- What is the optimal sample type and size to meet accuracy/response time requirements?<br />
- Decide based on Runtime vs Sample size values computed.</p>
<p>Q: You mentioned you're running this little queries. How can you approximate the error without first running on the entire data?<br />
A: There is a statical formula that will estimate the error of the specific query we're going to run. (Looks like the speaker expected this question. He had a slided prepared to answer it).</p>
<p>Q: How do you react to dynamic data?<br />
A: We do not support data that is updated frequently because we would have to update our samples frequently.</p>
<p>&nbsp;</p>
<h2>Session 2: Security and Privacy</h2>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schultz.pdf">IFDB: Decentralized Information Flow Control for Databases</a></p>
<p>David Schultz and Barbara Liskov (MIT CSAIL)<br />
Valentin Dalibard - Decentralised information flow control for databases<br />
Information leaks are a problem. So they propose to tackle it.<br />
Information flow control:<br />
-track information as it flows<br />
-control what can be released<br />
IFDB: tracks information in the database (information flow control had be done before but not in the context of a database)<br />
3 key concepts: principals (the people), tags (the security acesses you have) and labels (sets of tags)<br />
information can flow from A to B if the labels of A are a subset of the labels of B.<br />
Declassification removes a tag from the process =&gt;unsafe but necessary in some cases. in IFDB, declassification needs to be explicit</p>
<p>Writes: Tuples are written with exactly the label of the process</p>
<p>IFDB features:<br />
-Declassifying views: add WITH DECLASSIFYING to query<br />
-Constraints: uniqueness and foreign keys<br />
-Transactions: can abort, processes can label transaction</p>
<p>Implementation<br />
Based on PostgreSQL and a web server in PHP</p>
<p>Evaluation:<br />
They show that the system is useful by applying it to a location recording database (e.g. where is Alice now? Where did she drive last week?) It does seem neat<br />
They managed to use it to find bugs in existing systems e.g. HotCRP, anyone can download anyone's personal details.</p>
<p>Eval:<br />
Database is not the bottlneck, the web server is. But they were more concerned about the database so evaluated it seperately.<br />
Database performs really well: 1% drop in performance with 10 labels (which is more than is realistic) in the worse case</p>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Vijayakumar.pdf">Process Firewalls: Protecting Processes During Resource Access</a></p>
<p>Valentin Dalibard - A malicious process can create symbolic files and gives wrong names to fool an application into giving access to information. Basically, a malicious process creates a symbolic link file to a secure file (e.g. passwords), then ask a process with higher sercurity (e.g. web server) to access that file. The web server doesn't notice it is a symbolic link and returns the secure file. The issue is that the access control system gave access to that file to the web server, despite the fact it was actually accessed by an application that didn't have the right level of security.</p>
<p>What is needed: to know which processes are actually accessing resource (to avoid resources accessible by malicious processes)</p>
<p>Their solution: adding an additional check after access control to check ether this particular system call has access to the resource or not. They call it the process firewall.<br />
To do this, they do process introspection in which they figure out what the process is actually trying to access.<br />
They identify different system calls by using the process context, like its stack, entry point, system call history....</p>
<p>They need to gather context efficiently because it can be a lot of overhead. So they designed some optimisation to<br />
I)Gather the context (using caches)<br />
II)Check the rules that were designed (by grouping them)</p>
<p>Where do they get there rules? 3 ways:<br />
-Can be manually specified<br />
-Automatically generate rules from known vulnerabilities<br />
-Automatic generate rules from runtime traces (which will have false positives) The mostly use the entry point</p>
<p>Hayawardh Vijayakumar (The Pennsylvania State University), Joshua Schiffman (Advanced Micro Devices), and Trent Jaeger (The Pennsylvania State University</p>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Setty.pdf">Resolving the conflict between generality and plausibility in verified computation</a></p>
<p>Srinath Setty, Benjamin Braun, Victor Vu, and Andrew J. Blumberg (UT Austin), Bryan Parno (Microsoft Research Redmond), and Michael Walfish (UT Austin)</p>
<p>Valentin Dalibard - The authors present Zaatar, a sytem that implements Probabilistically Checkable Proof (PCP). This allows one to check that the computation performed by a server is the one it was asked to do. This comes from recent cryptographic work. The system is not at all usable yet (up to 2 000 000X overhead) but is order of magnitudes faster than the previously proposed solutions. One of the main cryptographic contribution is the ability to batch together many jobs with the same computation but different inputs.</p>
<h2>Session 3: Replication</h2>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Almeida.pdf">ChainReaction: A Causal+ Consistent Datastore based on Chain Replication</a></p>
<p>Sergio Almeida, Joao Leitao, and Luıs Rodrigues (INESC-ID, Instituto Superior Tecnico, Universidade Tecnica de Lisboa</p>
<p>Natacha Crooks -</p>
<p>Large scale applications are deloyed word wide. These applications have a huge number of users creating a large amout of data and result in a high system load. They are geo replicated data storage systems. These are conflicting goal: high performance vs consistency. There exists several consistency models and multiple replication mechanisms. Are trying to provide causal+ consistency which is causality and convergence. They focus on chain replication. It's a simple replication mechanism, provides linearisability for a single object. When a client issues a put request, the request will be forwarded to the head of the chain, and will forward it to the rest of the chain. Once it reaches the tail, it will return the put operation to the client. For Get requests, the Get should laways be executed at the tail.</p>
<p>In their work, provide a new system that is causal+ consistency (single site and in a geo replicated scnario). Its a specialised version of Chain replication which removes the tail bottleneck by distirbuting reads and can relax the consistency requirements (replication factor) if necessary.</p>
<p>The client is a local library that manages metadata. The data servers are organised as one hop DHTs. The client library communicates via an application proxy to the correct dc in the DHT.</p>
<p>There are three key principles&gt; allow writes to return before reaching the tail. Support reads on all nodes of the chain. And trade write efficiency for metadata efficiency.</p>
<p>They relax the number of nodes that have to process a write operation, controlled by a parameter K (how far you go down the chain before finishing and send it back to the client). This reply is tagged with the id of the last node in the chain, and stored in the client metadata. They also allow read operations to be distirubted across all nodes in the chain. But can no longer read from any node in the chiain otherwise might violate causal history of the node. It uses the metadata to determine which indices of nods contain the most up to date data. If the write that was only replicated to K node reaches (in between operations) they indicate that the write has become stable and the client can forget all metadata associated with this object. (Own question: is this information broadcast to all clients ever? What are the overheads for high write worklaods and low values of K). They trade metadata efficiency for write efficiency via a stabilitisation procedure.</p>
<p>When extending to a georeplicated solution, they replicate the same architecture to all data centres, so you have as many one hop DHTs as datacentres. They allow read operations to be processed ina single datacenter. Write operaitons will return when applied to K nodes in a single datacenter and will be replicated asynchronously. Conflicting versions are merged by using a last-writer-win conflict resolution strategy.</p>
<p>The systems is built on top / modifying FAWN-KV.</p>
<p>Evaluation:<br />
1) evaluating the throughput in a single datacentre. For a 50/50 read/write worload, don't get much of an improvement because trying to optimise read, But as the percentage of read increases, the throughput impoves a lot. For the georeplicated scenario, throuhgput (for read/write 50/50) is not as good as other systems, but as soon as read percentage increases performance outperforms other systems.</p>
<p>Q (?) : different between your system and eventual consistency is new variant of the chain replication? What are the differences between you and Dynamo?<br />
A: Dynamo does'nt use chain replication. You are using the topology of the replicas ot do operations in a more efficient way. Chain replication was initially designed to provide linearisibility. Chain replication leverages the topology of the replicas to be more efficient.</p>
<p>Ionel Gog -</p>
<p>- Applications have huge number of users =&gt; high system load.<br />
- There are conflicting goals: high performance vs consistency.<br />
- Write operation - send put request, the request goes through the chain until the tail.<br />
- Read operation - always processed by the tail of the chain =&gt; bottleneck at the tail.<br />
- ChainReaction - provides Causal+ consistency.<br />
- removes the tail-bottleneck by distributing reads.<br />
- relaxes consistency.<br />
- Architecture: - Client library that manages metadata .<br />
- Data servers organized in a one-hop DHT.<br />
- Allows writes to return before reaching the tail of the chain.<br />
- Supports reads on all nodes of the chain. Use information enclosed in the metadata s.t. App Proxy knows where to send the request.<br />
- Writes metadata efficiently.<br />
- To make everything Geo-replicated, we have to extend the metadata. Operations return when applied to K nodes in a DC and in the background the operations are copied to other DCs (async).<br />
- Evaluation: Implemented Optimized version of FAWN-KV and then compare Chain reaction with FAWN-KV and Apache Cassandra<br />
- Runs over 10^6 objects of 1KB size (This looks small to me, it's only 1GB).</p>
<p>Q: I find very difficult to find the advantages of your system compared with Dynamo? They have something very similar to what have you just described. What are the main differences?<br />
A: Dynamo doesn't do chain replication at it's core. Chain replication uses the topology of the replicas to be more efficient.</p>
<p>Q(Roxana Geambasu): Traditional models have a well defined definition based on history and so on. Can you provide a similar model for Causal+?<br />
A: Causal+ is very similar to causal consistency.</p>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Padilha.pdf">Augustus: Scalable and Robust Storage for Cloud Applications</a></p>
<p><em>Ricardo Padilha and Fernando Pedone (University of Lugano, Switzerland)</em></p>
<p>Natacha Crooks - Fudamental conflict between availability and consistency. There is a tention between ACID v BASE or NoSQL vs SQL. BASE states that should have scalabliity at the cost of strong consistency and avaiability. ACID: strong consistency simplies system design. The problem is that the cloud offers unreliable commondiyt hardware This means that there's a very broad failure specturm. Augustus offers a key value store with short transactions, byzantine fault olerance, and strong consistency.</p>
<p>Clients can submit short transactions to parittioned storage. Operations include compare, read, read_range, write, insert, delete. Transactions execute a batch of operations at once (one round of communication) The criteria for execution is 1) acquire all the locks 2) the compares are successful. Partitions are state machine fault tolerant machines. They execute the transactions in the same order. THey support both local and global transactions (where affect multiple partitions). Strong seriliazability is based on nowait locks. Key locks: writes are exlcusive and reads are shared, and partition locks, Inserts, deletes are shared, read_ranges are shared. Lock acquisition is non blocking</p>
<p>They assume that both clients and replicas can be Byzantine, and that bizantine nodes cannot subserts the cryptographic primitives.</p>
<p>Clients assemble transactions and submit it to all the partitions that should be involved with this transaction. The protocol relies on the non byzantine behaviour of clients. Honest clients mediate recovery, and only when conflict arises. When an honest client conflicts with an unfinished transaction, then will lead to the unfinihsed the transaction will terminate. Local transactions will commit immediately. Global read only tranasctions commit without signatures, byzantine clients can forge read only commit cerificates, forgery may as a result cause non serializable read only Byzantine transactions. These transactions cannot make honest transactions non serializable.</p>
<p>They achieve lineary scalabiltiy with a number of partitions. Read only mutli partition transactions better performance than updates. They also have an evaluation based on a Retwis port, and Apache Derby with an Augustus based SQL storage engine</p>
<p>Claims that achives full scalability for all transction types. And currently working on the perofrmance of global update transactions and contention.</p>
<p>Q (Allen Clement): Claim achieves scalability for all types. How do you mix contention and global read transactions?<br />
There is a problem of contention. Need to keep them apart. If you add more paritions to increase the throughput. Compares guarantee that we have a consistent snapshot. T</p>
<p>Q(?) Any transaction, especially the compare, can only execute inside of a partition?<br />
Yes. A transaction only commits if all the parittions if all the partions vote to commit. The compares act as a way to guarantee a certain state.</p>
<p>Q(Cheng Li). Do you have any principles to partition data? Do you add redundant data?<br />
When you generate the data, had a 50/50 chance of the person that you were person that you were following being on the same partition as you. They choose arbitrary placements, make no assumption about the structure of the social graph.</p>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Kraska.pdf">MDCC: Multi-Data Center Consistency</a></p>
<p><em> Tim Kraska, Gene Pang, and Michael Franklin (UC Berkeley), Samuel Madden (MIT), and Alan Fekete (University of Sydney) </em></p>
<p>Natacha Crooks -<br />
Big data in the data center which is distributed over many machines. Brings you relability availability, etc. Even though data is distirbuted over different machines entire datacenters can also fail. So we have unreliable datacenters. The solution for that is to georeplicate. There is also a high network latency so expensive to communicate.<br />
ay the result.<br />
The objective of the project is to provide strong data consistency across datacenters, with low latency and no data loss. MDCC tries to do read committed without lost updates, multi record transactions and only one RTT.</p>
<p>MDCC optimises for two key observations on workloads. Either conflicts are rare, where everyone updates their own data. Or conflicts commute where order of concurrent udpates isn't important.</p>
<p>Two phase commit is a standard method for distirbuted transactions, which has a prepare phase and a commit phase. Coordinator makes a final decision as to whether commit or abort can commit only when all votes are yes. Which means that all nodes must respond for commit, so any node failure may limit progress. The coordinator still has the power to change its mind, and the decision is stored at the coorditor. So the coordinator is a single point of failure.</p>
<p>MDCC Trnasactions. MDCC uses a fine grain paxos instance per record. Coordinoator proposes options (potential updates), to the appropriate Paxos instnaces. They check with all permutations of updates. They also check write write conflicts and violation of integrity constraints. Nodes tag the options as accepted or rejected. The coordinator cannot change transaction outcome. Once learned, options are never unearned. The state is stored at the nodes, no longer on the coordinator. It would even be possible for the nodes to talk to each other and bypass the coordnator. Options enable read committed isolation. So MDCC makes distirbuted commit decision, sotres distirbuted commit decision, and can tolerate node failures delays, by constraint, node failures and delays can block the protocol.</p>
<p>MDCC optimises for when conflicts are rare. It uses fast paxos for transactions (bypass the master for reducing the round trip for masters). Clients can propose directly to nodes and bypass the leader. Fast Paxos is sometimes slow because concurrent updates can prevent consensus, so the leader must step in to resolve conflicts, and consensus may take 2 additional message rounds to resolve (take 3 rounds rather than 2). When conflcits are rare, transactions can bypass the leader, which means that transactions can commit in 1 round trip.</p>
<p>Second optimisation relies on commutative updates. Commutative updates do not need to be totally ordered. In this case, they use generalised paxos. Instnaces agree on command structures, which are sequences of commands, not just a single command (and derive appropriate partial orderings for them). So they can use Fast Paxos (fast commits) without worrying about the order. However, order matters when integrity constraints play a part. MDCC relies quorum demarcation for tihs by adjusting the cosntraints to keep replicas consistnt without coordination and quorums.</p>
<p>Evaluation.</p>
<p>Write responde time CDF performs similarly to quorum write (K=4), signifcantly outperforms by Megastore* (because all transactions in a partition have to be serialized). Provide three different versions of MDCC: with fast paxos, with generalised, and with commutative updates. For lower level of conflicts, MDCC still performs well. But for high conflict rates, MDCC performs badly.</p>
<p>In conclusion, claim new commit protocol base on observation that conflicts are rare or commute, and provide 1 round trip durable commits.</p>
<h2>Session 4 Concurrency and Parallelism</h2>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Merryfield.pdf"><strong>Conversion: Multi-Version Concurrency Control for Main Memory Segments </strong></a><br />
<em> Timothy Merrifield and Jakob Eriksson (University of Illinois at Chicago) </em></p>
<p>Natacha Crooks - Used to be that with each generation of processor performance would scale. But now, you can't necesary expect faster cores, but you get more of them. The problem is that multi htreaded programming is hard. So you get race conditions, deadlock, atomicity violations. Some of the porposed solutions include having a more disciplined shared memory pgoramming models, with deterministic concurrency. Most approaches force threads to execute in isolation for a limited period of time, limiting interaction.</p>
<p>Conversion is an implementation of version controlled memory. Conversion provides kernel support for multi versioning of shared memory sgments and only needed a few small changes to the kernel.</p>
<p>Each segment is backed by a repository. Conersion uses processes as threads. each thread operators on its own local copy. Have a checkout, which creates a new Conversion segmnet, or map in an existing one (anynymous or filed backed). Have commit functions, commits your local copy's changes to the responsitory. And finally have updates.</p>
<p>Applications that can work with a slighlty out of date local copy would get peformance benefits from doing that. Concurrent data structures could be read even though constantly updating: ex snapshot isolation for long running reads. And finally, deterministic concurrency systems: want isolation between threads, and the isolation is created by having a local copy.</p>
<p>5 steps to build Conversion: 1) bulk copying. Have a complete copy of the repository, and then commit entire segment to memory. This works but too slow and uses too much memory.<br />
2) Use virtual memory.<br />
The local copy is just a page table, which makes that can share phsyicla memory between local copies. The repository is now basically a copy of the page table. This is better because sharing memory between threads. But still have the proble that copying an entire page table.<br />
3) USe dirty lists.<br />
When get a copy on write page faults keep this in a dirty list, so when perform a commit, can commit only modified page to the repository. The problem is that when T2 wants to perform an update, no way to know whats been boidifed so need to pull entire page table .<br />
4)V Version list.<br />
Repository becomes a list o fdirty list. When get a copy on write page fault, update dirtly list, commit the diry list into the repository. When T2 performs an update, update traverses version list to check what the modifications are since it last called updated. But the oproblem is this contains redundant entires.<br />
5) Faster udates. Only keep one version of the page every time, so no more redundant entries.</p>
<p>The key requirement is that want to do these operations concurrently. Updates do not acquire any locks. Commits can mostly be done concurrently as long as update disjoint set of pages.</p>
<p>Conversion adds a nano second of overhead on copy on write page fault.</p>
<p>Case Study: Dthreads w/ Conversion</p>
<p>Dthreads is a deterministic threading library which avoids race conditions, non deterministic deadlock . Guarantees race free execution. Dthreads processes communicate using shared memory. In the Dthread memory mode, there's no analogous operation to update(). Dthread's memory model uses fences, which ensures that everyone is "done" with the parallel phase. The token determines the order in which threads can communicate with one another. There's a lot of time spent waiting because you don't want to commit anything to shared memory because becomes immediately visible. With conversion, allow to write to shared memory in parallel with thread execution. They rely on the token to maintain determinismn. If a thread owns the tocken, then just ocmmits changes, andre ly on the token to maintain determinism. Argue that this is a simpler model to program with. Key difference is that with Conversion, when commit to memory, others can't yet see changes, so no need for others to wait, only when token comes back and call lock. Has the additional version that Conversion is in the kernel. One point to notice is that conversion spends a lot of time waiting for the token, and more times on commit (because you have this update, and potentially have to merge, which can be slow).</p>
<p>Valentin Dalibard - Instead of getting faster cores, we get more of them. But multithreading is hard.</p>
<p>Is an implementation of version controlled memory (svn like) which they built as a linux kernel module. Each thread operates on its own local copy with functions like: checkout, commit or update.<br />
When doing a commit on an out of date, an implicit update or merge is done.<br />
Who needs conversion: Applications that can work with a slightly out-of-date local copy</p>
<p>Design: (5 incremental improvements)<br />
Bulk Copying =&gt; not efficient enough<br />
Paging (finer granularity) =&gt; still not efficient enough<br />
Dirty list =&gt; still not efficient enough<br />
Version to only get the changes since instead of copying everything on update, still not efficient enough<br />
Final optimization to avoid identical versions (I think)</p>
<p>They then do a case study: Dthreads<br />
D threads are deterministic threads e.g. race condition will happen the same way everytime<br />
For the v parallel stuff, Conversion performs just as optimal.<br />
For the not so parallel stuff with locks, Dthreads on top of conversion are about 25% faster than normal Dthreads</p>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Nanavati.pdf" > <strong> Whose Cache Line Is It Anyway? Operating System Support for Live Detection and Repair of False Sharing </strong> </a><br />
<em> Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield (University of British Columbia) </em> </p>
<p>Natacha Crooks - Page tables are great because provide transparent translation layer between physical and virtual address space. Pages are really large. The moment you want to do translation at subpage granularity, then there's no more hardware support. Objective is to build a byte granularity software only remapping and isolation mechanism. Key point is that do it in software only. Motivating example used is false sharing.</p>
<p>Architecture looks like a target system, with a control VM on top of Xen, but no specific reliance on virtualisation, could build straight into the hardware.</p>
<p>Objective is to dynamic detection and mitigation of false sharing. The problem is that difficult to avoid threads that write to the same cache lines. One of the reasons is that the C structure, with malloc (allocator) may have metadata, so end up straddling cache lines unintentionally. Currently two techniques to limit failr sharing: modify access locations and modify access frequency.</p>
<p>The idea is to split the page up into an isolated page (which contains the contended regions). And the remainder of the page is mapped to an underlay page. How do you know which regions are contended?</p>
<p>Unless false sharing has really high frequency, tend not to affect performance. This allows them to use a low frequency sampling based approach. Start off with something which is really fast but imprecise, then narrow the focus as go along. The first stage is performance counters. They say whether there is any contention in the system, no more detail than that. Once that need to infer regions/ pages of memory which are contended. So use log page reads for that. Then finally, using instruction emulation, what are the bytes ranges being accessed. And once this is dnone , analyse the log and determine whether there is any contention.</p>
<p>Use fault driven redirection. Use data faults as a trigger for everything that they do. Catch all access via data paths. Avoid code trampolines, and amortise page fault cost.</p>
<p>Can optimitisically apply remapping when possible. Copy the isolated pages and underlay pages back to the original page.</p>
<p>Evaluation demonstrates that overhead is low for systems which have little false sharing.</p>
<p>Claimed contributions: low overhead runtime detection, and byte level file sharing.</p>
<p>Believe that this system is good for performance optimisations, but still need to do security enhancements as future work.</p>
<p>Questions<br />
Q: what happens to the case where the access is withing a loop function?<br />
Can detect that is a function, there are techniques to find call points but haven't implemented that.</p>
<p>Q: what security enahcnements are you talking about?<br />
Protect subpage regions. (rest not heard)</p>
<p>Q: how much weasier would it be with Vmware where you could do binary rewriting all the time?<br />
Not sure would make a significant difference. (rest not heard).</p>
<p><a href="http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Jeon.pdf"> <strong>Adaptive Parallelism for Web Search  </strong> </a><br />
<em> Myeongjae Jeon (Rice University), Yuxiong He (Microsoft Research), Sameh Elnikety (Microsoft Research), Alan L. Cox and Scott Rixner (Rice University) </em> </p>
<p>Valentin Dalibard - Performance of search queries is dependent on: latency and quality.<br />
To improve latency:<br />
Early termination: only do the query on the highest rank pages since the low rank ones are unlikely to be useful anyway. They also show how queries can be parallelised.<br />
Change execution depending on system load: parallelise it on light loads, and execute sequentially on heavy loads. Bing doesn't do so well with parallelisation, they have a formula to check whether it sould be parallelised or not and by how much (number of parallel threads). That's the adaptive bit. Basically minimise single query execution time + latency impact on waiting queries. They have a nice graph showing how the optimal number of threads decreases with the load.<br />
Overall not very novel, but nice engineering.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2013/04/15/liveblog-from-eurosys-2013-day-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eurosys 2013 &#8211; Doctoral Workshop</title>
		<link>http://www.syslog.cl.cam.ac.uk/2013/04/14/eurosys-2013-doctoral-workshop/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2013/04/14/eurosys-2013-doctoral-workshop/#comments</comments>
		<pubDate>Sun, 14 Apr 2013 22:26:26 +0000</pubDate>
		<dc:creator>Ionel Gog</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Networks]]></category>
		<category><![CDATA[Operating Systems]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1094</guid>
		<description><![CDATA[I am at the Doctoral Workshp at EuroSys in Prague today, ahead of the main conference. Below are some notes on the 5 minutes presentations in the workshop.  There are no notes on the first quarter of the workshop because I talked about my work as well. Ludi Akue - Online Configuratin Checking for Network and [...]]]></description>
				<content:encoded><![CDATA[<p>I am at the Doctoral Workshp at EuroSys in Prague today, ahead of the main conference. Below are some notes on the 5 minutes presentations in the workshop.  There are no notes on the first quarter of the workshop because I talked about my work as well.</p>
<p><span id="more-1094"></span></p>
<p>Ludi Akue - Online Configuratin Checking for Network and Service Management<br />
- Network management as a sequence of operations (control loop): observe -&gt; decide -&gt; adjust.<br />
- The changes must be autonomous and validated.<br />
- Add runtine validation to the control loop: observe -&gt; decide &lt;-&gt; validation -&gt; adjust . (arc from observe to validation as well).<br />
- Lots of generic configurations =&gt; validation must be generic, enabled at runtime and flexibile (modifications at runtime).<br />
- Designed a higher level specification language (MeCSV metamodel) to design a reference model.<br />
- Reference model - the administrator defines the configuration management of the system (e.g. config structure, state parameters, constraints).<br />
- The reference model is used at runtime by the validator (also known as online checker).<br />
- MeCSV implemented in UML profile and tested in Eclipse MDT.<br />
Q1(Dushyanth): What is the domain? Network config (e.g. switches)?<br />
A: Network &amp; middleware configuration. We are focusing on middleware. Tested the system on Common Information Model (CIM) standard.</p>
<p>Q2(Kim Keeton): What's the form of the constraints? Could you please give some specific examples?<br />
A: We want to avoid bad configurations. We're not working on generating good configurations. Just want to check that the system is reliable.</p>
<p>Q3(Allen Clement): Comments on the presentation. The big thing what was really missing is: Why was this techincal work necessary? What's the problem that necessitated this technique? Framing the technical specific problem is important.</p>
<p>--------------------------------------------------------------------------------------<br />
Petr Hosek - MX: Safe Software Updates Via Multi-Version Execution<br />
- Updates are hard. Sometimes one step forward and two steps backward. Easy to introduce bugs.<br />
- Users are very often reluctant to update their software to a new version.<br />
- Lighttpd - fix in 04/2009 introduces a bug that was only identified 11 months later.<br />
- Goal: provide benefits of the newer version by ensuring stability provided by the old version.<br />
- Solution: run both versions side by side. Pick the correct solution at runtime.<br />
- System completely transparent to users. At the moment running two versions with small differences.<br />
- Lighttpd 1.4.22 vs Lighttpd 1.4.23. Check every syscall and its arguments.<br />
- Have sync points. Once at a sync point, the new version will fail, but it can continue executing using the result (for the section) obtained running the old version.<br />
- Future work: improving performance overhead, support for more complex code changes, support for non-crashing type of divergences.</p>
<p>Q1: How do you know when to make snapshots?<br />
A: We make them at almost every syscall.</p>
<p>Q2(Dushynath): On a crash, you run one version with the other version's state. Does that work?<br />
A: It's fragile, but it works. We're looking on reducing its fragility.</p>
<p>Q3(Allen Clement): How do you do state comparisons without coming out with a nasty bottleneck?<br />
A: Using syscalls.<br />
Allen Clement: So you're saying that as long as the functions are the same and the arguments are the same, then it's fine?<br />
A: Yes.<br />
Allen Clement: If you treat syscalls as black-box then you can still have your state diverging in a subtle way.<br />
--------------------------------------------------------------------------------------<br />
Stefan Wigert - Advanced Persistent Thread (APT) Detection<br />
- Goal: detect stealthy, target-oriented, internet-based attacks by just looking at Internet communication logs<br />
- Why is it hard to detect? Social component (employees).<br />
- If you know for each company all the subnets the company owns, then you can construct its IP-space and determine top-k communication peers<br />
- How can you find the IP-spread of a company automatically? Using community detection algorithms...<br />
- Came up with an iterative approach, start with a seed-set and crawl around it.</p>
<p>Q(Allen Clement): What community detection are you using?<br />
A: Random-walks. We now use that combines the seed-set with labour propagation.<br />
Allen Clement: How connected do you see this work with civil identities?</p>
<p>Q(Valentin Dalibard): How better are you performing than doing "cheap" techniques? E.g. consider nodes with many edges.<br />
A: I think the algorithm we're using now it's not that difficult. The algorithm is good because it allows us to process iteratively. In the paper, we took 10% analyzed it, then we took another random 10% and analyzed it.<br />
--------------------------------------------------------------------------------------<br />
Thomas Hruby - NewtOS - Reliable and efficient system for multicores<br />
- High performance fork of MINIX 3.<br />
- Smaller components with less state.<br />
- Performs better because exploiting multicores.<br />
- Developed on top of a microkernel.<br />
- No data sharing, no synchronization between components. Only uses message passing to interect among components.<br />
- A component crash does not kill the system.<br />
- The system is extremely slow: too many context-switches =&gt; bad cache usage.<br />
- NewtOs - switched from kernel communication to user level async point-to-point communication channels (e.g. shared memory queues).<br />
- Proof of concept: developed a new network stack. Chopped single block stack into multiple components/pieces. Increased Minix 3 net performance from 200Mbps to 10 Gbps.<br />
- On multicore the components can be migrated depending on the load.<br />
- We're looking on how to use the cores efficiently. We're focusing on the scheduler and on placing these components.</p>
<p>Q1(Dushynath): There are many ideas here (e.g. microkernel, BarrelFish). Are they all tied together like that? Were do you see your contribution? Do I have to use Minix to buy in your contribution?<br />
A: We need a reliable system. We showed that chopping the system into smaller pieces does that.<br />
Dushynath: Sorry, but do you have to closely relate your project to Minix? Nobody really uses Minix.<br />
A: Well, nobody really uses Barrelfish either.</p>
<p>Allen Clement: The vehicle you're using it's not a driving fact for the problem.</p>
<p>--------------------------------------------------------------------------------------<br />
Thomas Knauth - Web service consolidation<br />
- Greenpeace published a report (2009). If cloud computing would be rated as a country, then cloud computing would come 6th. It uses more energy than entire Germany? (Comment: This looks extremely sketchy to me).<br />
- There are lots of websites that don't get lots of traffic, then there's potential of switching off.<br />
- Modified Apache to keep inter-arrival request arrival time. If request are low then VM are shutdown.<br />
- How fast can we resume virtual machines? Depending on the state (1GB to 4GB of memory), it can take from 2 secs to 10 secs.<br />
- Looking at techniques to do lazy resume. For some queries you can just initially resume part of the state.</p>
<p>Q1(Allen Clement): I feel I've heard this story before (DC are inefficient). I also think I've heard that powering off machines and powering them back on is less efficient than keep them on at low efficiency.<br />
A: For Google it doesn't really work because their idle times are low. We say that there are other kinds of workloads where this may work.</p>
<p>Q2(Dushynath): There are many operational issues why people running DC to switch machines (e.g. machines are not coming back properly). Be careful about the Greenpeace statement, they actually use some different metric with which you can get those results.</p>
<p>Q3(Allen Clement): It was missing what's the problem/challenge this mechanism solves? What's the challenge that it solves? Even if I don't remember how the mechanism work, I can still remember the challenge.</p>
<p>--------------------------------------------------------------------------------------<br />
Marius Vlad - Detecting and analysing multi-stage payload attacks<br />
Missed this one. Didn't really get what it was about.</p>
<p>--------------------------------------------------------------------------------------<br />
Martin Nowack - Reducing runtime overhead of software fault checks using symbolic execution<br />
- Software typical has errors (integer overflow, of-by-1 accesses) =&gt; security risks, downtime.<br />
- How to analyze? Add instrumentation or analyze beforehand.<br />
- There are code paths which are executed most often. The idea is to use symbolic execution to go through those paths of execution and try to remove uneeded checks.<br />
- Implementation based on Klee.</p>
<p>Q1: Do you it offline?<br />
A: I'm doing it offline.</p>
<p>Q2(Allen Clement): Short description of what you're proposing: it's expensive to do all the checks, to remove checks. Hence, just do a heuristic.<br />
A: Yes.</p>
<p>Q: KLEE has lots of heuristics. How can you discard a check statically when even at runtime you can't reason about it?<br />
A: ...</p>
<p>Q: Would it work for other optimizations?<br />
A: Yes, sure.</p>
<p>Q: Your optimization can only work for very specific workflows? It looks that your optimization can only optimize for specific paths. (e.g. only the second request of a web server and not the forth one).<br />
A: Time out.</p>
<p>--------------------------------------------------------------------------------------<br />
Natacha Crooks<br />
Didn't take notes on this one.</p>
<p>--------------------------------------------------------------------------------------<br />
Qian Ge - Eliminating Timing Channels from OS Kernel<br />
- Cover channels - allow barries enforced by system protection mechanisms to be surpasssed.<br />
- Timing channel - ...<br />
- Eliminating timing channels from seL4 (formally verified microkernel).<br />
- Evaluate bandwidth of timing channels in seL4 and the to be proposed solutions.</p>
<p>Q1: Are you going to enumare all the timing channels? Will somehow seL4 help you do that?<br />
A: ...</p>
<p>Q2: How far is seL4 from being a general purpose OS, on general purpose hardware? seL4 more appropriate for use in constraint environments. It seems that in these environments you should have more control on what programs are doing. I'm thinking if the environment can help you.<br />
A: It's a good suggestion to think about it.</p>
<p>--------------------------------------------------------------------------------------<br />
Jens Kehne - BeeHive: A distributed operating system for cloud platforms<br />
- Heavily virtualized, coarse-grain allocation, duplicated state, difficult to split/merge VMs and distributed applications. All these waste resources<br />
- Processes as basic unit of management instead of VMs<br />
- Assign resource on a much finer grain, faster migration, less state duplication<br />
- Beehive Model: microkernel on multiple machines, cross-node communication, transparent process migration, continous reassignment of resources<br />
- Challenges: How to make IPC fast enough? How to get the IPC and migration to be transparent? How to make good decisions about migration? How to make all this compatible with existing applications?</p>
<p>Q1(Dushynath): Processes were around before VM. There must be some reasons why people choose to use VMs instead of processes?<br />
A: One large problem used to be Kernel state. It's difficult to extract state of a process and move it to a different machine.</p>
<p>Q2(Allen Clement): VM introduced to get easy isolation between processes on different VMs. What are giving back if we drop VMs?<br />
A: Using a microkernel you can still provide fairly strong isolation between processes. Somehow weaker than VMs, but still good enough.</p>
<p>Q3(Steve Hand): What's your killer experiment to show that the system is good?<br />
A: I want to make reconfiguration as fast as possible (i.e. move one process to a different machine when the ex-location gets busy).<br />
Steve/Dushynath: You can do that today very fast with VMs.</p>
<p>--------------------------------------------------------------------------------------<br />
Simon Gerber - Memory management for heterogeneous multicores<br />
- Memory management on single core is well-understood, but we don't know what to do on heterogenous systems.<br />
- Want to find a way of building a fast MM for homogenous systems, extend it to simple heterogenous systems and finally complex heterogenous systems (e.g. x86 and ARM).</p>
<p>Q(Nickolai Zeldovich): What is memory management?<br />
A: The most basic way of isolating processes. =&gt; (Zeldovich) It basically is virtual memory.</p>
<p>Q(Steve Hand): What is your target hardware?<br />
A: Probably an x86 that talks to an ARM core.</p>
<p>--------------------------------------------------------------------------------------<br />
James Snee - Cross-layer instrumentation for deep layered software stacks<br />
- Energy consumption for phones.<br />
- Looking at outliers of measurements. Are our measurements ok?<br />
- Must identify the source of the outliers.<br />
- How do we trace application behaviour through the stack and detect the source/reason of the outlier.<br />
- Detect anomalies with runtime tracking.<br />
- Call graph built from a sliding window of event characteristics. With this graph we can say where the divergence happens.<br />
- Working on how to fine grained to go with the trace. Iterative process, start coarse and in each iteration go more fine grained.<br />
- Similar to ftrace but for Android. ftrace can be turned on at runtime.</p>
<p>Q: So you haven't figured what the outliers are?<br />
A: No, not yet. People usually say it's JVM GC, but something else is going on.</p>
<p>Q(Steve Hand): The screen backlight and the right primary sources of energy drains. Are you looking at that?<br />
A: Sure, but we can also look at other concepts (e.g. if you 3G an aggressive battery user).<br />
--------------------------------------------------------------------------------------<br />
Stanko Novakovic - Scale-Out NUMA Systems<br />
- In large-scale graphs work done per node is small and there's a lot of communication.<br />
- Scale-up: single machine. Low-latency, but does not scale.<br />
- Scale-out: multiple machines. Scalability, but high remote access latencies.<br />
- We want Scale-out that can provide Scale-up performance.<br />
- Scale-out system based on NUMA, where nodes can directly r/w access each other's local memory.<br />
- Working on a prototype based on Xen and ccNUMA.</p>
<p>Q(Steve Hand): Why has nobody done this yet? Why doesn't RackScale do this?<br />
A: There's something similar, microserver cluster. However, they're different because they rely on special connection and TCP/IP.<br />
--------------------------------------------------------------------------------------<br />
Stefan Kaestle - Distributed Computing on Hybrid Multicore Machines<br />
- Multicores will have more and more hybrid interconnects.<br />
- To overcome problems with NUMA awareness and architecture we need to program as a distributed system.<br />
- Must think how distributed algorithms change on multicore.<br />
- Ideas: model multicore machines, find relevant characteristics of an algorithm and map them on multicore architectures.</p>
<p>Q(Allen Clement): Which of the classical problem statements matter? Which of the traditional problems still makes sense in this multicore system matters?<br />
A: It depends on how much you want to scale. If you only have one machine then it doesn't matter (assume no failures). However, we have the same problems if we scale on multi-machines.<br />
--------------------------------------------------------------------------------------<br />
Tomasz Kuchta - Document Recovery<br />
- Imagine the case when a file gets corrupted.<br />
- Why is the application crashing? It may be that the program is using an unsual execution path.<br />
- Symbolic execution to obtain constraints on the input then create a new document that will not crash the application.<br />
- How to obtain? Modify and solve the constraints and choose the document which is closest to the initial file.<br />
- Challenges: What heuristics should be used? (i.e. which constraints to we want to negate/remove, focus on the bytes and constraints that trigger the bug)<br />
Q(Allen Clement): What's the difference between a buggy file and corrupted text?<br />
A: My focus is on the files that cause crashes/abnormal program termination.<br />
Allen Clement: Do you see any possibilities to tackle with corrupted data, when you don't realize that something is wrong and just accept that file?<br />
A: Difficult because we base our work on constraints.<br />
--------------------------------------------------------------------------------------<br />
Valter Balegas - Improving Data Consistency in Cloud Infrastructure<br />
- Problem is that the service has to provide latency and serve many requests at the same time.<br />
- Many consistency levels have been proposed. Mostly force execution respecting a total order.<br />
- Proposes: Add a reservation manager. Nothing can happen unless you request a reservation from the manager.<br />
- A consistency model based on data semantics.<br />
- Causal+ consistency with invariants.<br />
--------------------------------------------------------------------------------------<br />
Valentin Dalibard - Optimising Graph Computation: Trading Computation for Communication<br />
- Assume you have a graph, if small enough then you just put it in the memory of a machine and run the computation.<br />
- If graph too big then you probably use distributed BSP.<br />
- The problem is that in BSP there are the synchronization steps which take a lot of time.<br />
- PowerGraph spends about 90% on the synchronization.<br />
- Lesson: CPU time is very cheap, but communication time is very expensive.<br />
- New approach: Don't need to go through BSP, just want to converge as fast as possible using all the resources you have.<br />
- Have constant computation and constant communication.<br />
- Changes in the computation model of BSP: - allow stale inputs on vertex functions =&gt; iterate more often.<br />
- assign different priorities to messages =&gt; send data from more important vertices.<br />
- allow to iterate over vertices at different frequencies.<br />
- Challenges: Doesn't work for all computations. Check which fixed-point computations fit this model.</p>
<p>Q: Is that going to be more difficult for the person writing these jobs?<br />
A: Yes, it's going to be more difficult. It's part of the problem of finding the computations that work.<br />
Q(Dushynath): Are you talking about bandwidth and latency? It's a good idea to run a back of the envelope numbers.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2013/04/14/eurosys-2013-doctoral-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>EuroSys workshops: Systems for Future Multi-core Architectures</title>
		<link>http://www.syslog.cl.cam.ac.uk/2013/04/14/eurosys-workshops-systems-for-future-multi-core-architectures-sfma/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2013/04/14/eurosys-workshops-systems-for-future-multi-core-architectures-sfma/#comments</comments>
		<pubDate>Sun, 14 Apr 2013 14:45:34 +0000</pubDate>
		<dc:creator>Malte Schwarzkopf</dc:creator>
				<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Parallelism]]></category>
		<category><![CDATA[Workshop]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1086</guid>
		<description><![CDATA[I am at the SFMA workshop at EuroSys in Prague today, ahead of the main conference. Below are some notes on the keynotes and papers in the workshop, including the keynote by our own Steve Hand! Keynote: Challenges for M*-core Systems Steven Hand, University of Cambridge Transistor counts double every 18 months, as we all [...]]]></description>
				<content:encoded><![CDATA[<p>I am at the <a href="http://sfma13.cs.washington.edu/workshop-program/">SFMA workshop</a> at EuroSys in Prague today, ahead of the main conference. Below are some notes on the keynotes and papers in the workshop, including the keynote by our own <a href="http://www.cl.cam.ac.uk/~smh22/">Steve Hand</a>!</p>
<p><span id="more-1086"></span></p>
<h2><strong>Keynote: Challenges for M*-core Systems</strong></h2>
<p><i>Steven Hand, University of Cambridge</i></p>
<p>Transistor counts double every 18 months, as we all know. But we are no longer building single-core chips out of these, but multi-cores. We've got dozes of cores now, but this is likely to increase. The reason for going multi-core in the first place was hitting the "power-wall": couldn't clock CPUs fast enough, but also "walls" in memory access and ILP. Multi-core ~= 2-16 cores, many-core ~= &gt;16 cores; can be homogeneous or heterogeneous. In theory, should get ideal parallel speedup for parallelizable sections (cf. Amdahl's Law), but in practice, there are parallelization overheads that cause diminishing returns or even degradation as we go more parallel.</p>
<p>So how can we go faster when parallelizing? Maybe make serial portion faster (dynamic overclocking etc.); improve synchronization and communication primitives to reduce parallelization overhead; reduce straggling in parallel portions. The latter sounds suspiciously like MR-style stragglers -- but typical solution a bit different: use gang-scheduling for parallel threads to minimize imbalance. But this only works if the threads make equal progress given the same opportunity! This isn't always true -- many data-parallel applications aren't too regular (e.g. graph computations). Might be a good idea to partition into more numerous tasks and use work-stealing to balance; but problem with this is that the per-task overheads come back!</p>
<p>Memory access is also a big challenge: with NUMA, remote memory and deep cache hierarchies, getting the placement of memory objects right is crucial for balanced performance. Furthermore, having multiple parallel applications leads to a bin-packing problem for placement, while the parallel width of programs is also a variable. Starting to get a bit complicated, as we'd ideally like to optimize system-wide utility. BOPM experiment shows that, given sufficient work available, can get parallel speedup up to 60 threads (overcommitting the machine a little), but we get diminishing returns much earlier, so it makes sense for overall utility maximization to give some space to other applications. Scheduling this becomes tricky -- gang-scheduling goes some way, but still has trouble dealing with small gangs, localization and cost of scheduling, as well as microarchitectural interference.</p>
<p>Some indicative interference results, e.g. SPECCPU 2006: degradation in performance happens when sharing caches (as expected), but for some benchmarks it even happens if they are running on separate sockets and thus far away from each other. This also holds for higher-level macro benchmarks, such as typical data centre applications (again, 1.6x to 2x degradation between pairs on otherwise idle machine). Co-locating cache-sensitive and memory-intensive (streaming) workloads might be a promising approach (since the streaming benchmark does not benefit from the cache). Of course, if we're not doing anything clever, the cache-sensitive benchmark gets screwed over as its data is evicted from the cache. Solution: use page colouring to partition the cache and contain the streaming workload. Surprising result, however: the cache-sensitive benchmark improves, but the streaming benchmark runs with terrible performance. This turned out to be a result of the implicit page colouring, as this partitions the address space! Can be fixed with a hack (use EPTs to remap memory dynamically), at which point both workloads work well (with some caveats, preliminary work).</p>
<p>But what about new applications? Many attempts and concepts to avoid the programmer having to think much about parallelization. Might use threading libraries, or task-parallel data flow frameworks, which maybe also let us deal with heterogeneity. This is great, but running all this scaffolding on top of a standard OS might be a little inefficient, as it makes system-wide optimization hard to impossible. Answer: Mirage-style unikernel OS on top of Xen, highly specialized, but single-process/single-CPU. This means boot and live migration become very quick -- nice! No paging needed inside the unikernel, as all memory is pre-allocated at bootup time. Good performance on DNS server, and high-level language features in OCaml enable improvement on legacy systems (bind, NSD).</p>
<p>Now, clearly there won't be a mass-migration to Mirage-style unikernels any time soon. So can we do something interesting for existing programs that would like to exploit many-core? One interesting approach is trying to use speculation to extract parallelism from single-threaded applications. Initially looked at specialization in order to avoid waiting for I/O, which can lead to much idleness especially on many-core machines. But much of I/O, especially for desktop applications, actually just reads the same data every time. We could run a specializer on the binary that includes e.g. configuration settings in the binary, and thus avoid said I/O entirely. However, it is not always possible to specialize ahead of time, which gives us opportunities for speculation. For example, at a point of control flow depending on an unclear value, we could run ahead with threads assuming plausible values and then continue on the strand that turned out to be the correct one.</p>
<p>&nbsp;</p>
<h2>Asymmetry-aware execution placement on manycore chips</h2>
<p><em>Alexey Tumanov, Joshua Wise, Onur Mutlu, Gregory R. Ganger</em></p>
<p>With many-core chips, we tend to have fewer memory controllers than cores. Since they have non-uniform distances to different cores, placing execution threads close to their memory controller becomes an important problem (ANUMA, = asymmetrical NUMA). The main difference here is that ANUMA systems have gradually changing memory access latencies to different memory controllers (much more fine-grained than previously). Micro-benchmarks show a worst-case latency differential of 14% on a Tile64 ANUMA chip. Does this matter to real-world workloads? Yes, get the same 14% difference on a single-core GCC benchmark. This is likely to get much worse once other applications use the other cores, and the NoC becomes contended. Classical static NUMA partitioning is not a great answer, as it is not contention-aware. Possible solutions: move the data to the computation (or at least allocate physical frames cleverly), or move the execution to the data, or hybrid.</p>
<p>What they did is to instrument the VM subsystem to collected information about page access, and then places threads appropriately. The placement algorithm is simple: threads are ordered by memory intensitivity and placed in decreasing proximity to the best memory controller. Ties broken by gradient descent to second choices. Their instrumentation is fairly heavyweight, so even the optimized version does not outperform the non-instrumented baseline (but does better than baseline with instrumentation turned on). In future work, they're planning to look at heterogeneous workloads, and extend the same ideas to caches ("ANUMCA"), and look at application-level goals.</p>
<p><strong>Q (Simon Peter):</strong> Do you get the contention only to the memory controller, or also due to IPC on the NoC?<br />
<strong>A:</strong> No-ish. Looked at interconnect throughput volume, which is shared between IPC and MC traffic.</p>
<p><strong>Q (Malte Schwarzkopf):</strong> Why do you need per-process HW memory access counters?<br />
<strong>A:</strong> Don't need to; was mistake, should be per-core.</p>
<p><strong>Q (?):</strong> How do you know that your placement stuff does not hurt cache locality?<br />
<strong>A:</strong> The benchmarks we looked at have a poor cache locality, so that the total number of memory accesses remained the same.</p>
<p>&nbsp;</p>
<h2>Supporting Iteration in a Heterogeneous Dataflow Engine</h2>
<p><em>Jon Currey, Simon Baker, and Christopher J. Rossbach</em></p>
<p>There is increasingly much heterogeneity in systems, due to the prevalence of various accelerator chips (GPGPU, SoCs, FPGAs). Data-flow is a good model to program these things, as it expresses the minimal data movement, implies parallelism and leaves crucial decisions to the runtime engine. But data has expressivity limitations, especially with iterative and data-dependent workloads, conditional routing and stateful computations (e.g. accelerator buffers too small to contain all data). The classical data-flow ISA solutions of distributor and selectors does not work, as they are designed for fine-grained data flow (I'm not sure why? allegedly scheduling complexity?).</p>
<p>Their IDEA engine extends PTask, and enables iteration without requiring any additional nodes to be added to the data flow graph. A lot like CIEL, but unlike it, they do not extend the graph, but add predicates to the channels between data flow nodes. These predicates indicate specifically when an iteration should finish. Control signals piggy-back onto data blocks, and these signals are used for conditional routing. Iteration needs some special support: there is a special iterator function (can be user-defined) that decides what control signal to issue. The iterator has some kind of scope binding, which seems to be necessary for some distributed consistency properties.</p>
<p>Evaluation using optical flow workload, three-fold: CPU, GPU-with-driver-program, and GPU-with-IDEA. As expected, there is a huge speedup as a result of doing work on the GPU, and as the data size increases, the benefit of IDEA (which reduces the launch and load overhead) decreases. System works well and has good generality, but is a bit hard to program by hand. No support for dynamic data flow graph extension or generation.</p>
<p><em>[missed questions as I was asking one myself]</em></p>
<p>&nbsp;</p>
<h2>Supporting efficient aggregation in a task-based STM</h2>
<p><em>Jean-Philippe Martin, Christopher J. Rossbach, Derek G. Murray, Michael Isard</em></p>
<p>TM is easier than locks, threads and classic parallel programming. Assertion: under low contention, TM has better performance/contention ratio (especially write sharing). Focus on one special kind of write-sharing: aggregation. Insight: we do not always have to serialize aggregations! Their system, Aggro, replaces tasks with threads, which can be spawned dynamically and equate to transactions. Reads and writes are on objects, which can be RW-shared, provided serialization guarantees hold.</p>
<p>The guarantees are: (1) for any run, there exists a serial execution consistent with the task graph, yielding the same result; (2) non-opacity [missed description]. Internally, Aggro has a bunch of expected data structures (TX contexts, read sets, write sets, object contexts etc.). Consider the example of a shared aggregation counter: modifying it is commutative, so TX may run in any order. It is only on read that the various writes that commited (in any order) in the mean time are "collapsed" (for which exclusive locking is required). Aggregation operators must be associative, commutative and side-effect-free.</p>
<p>Evaluation using k-means, wordcount, triangulation. Aggro scales best (up to 24-ish cores) on kmeans; more results in the paper.</p>
<p><strong>Q (?):</strong> What happens in the read-after-write case in Aggro?<br />
<strong>A:</strong> [missed details]</p>
<p>&nbsp;</p>
<h2><strong>Keynote: The multicore evolution and operating systems: scalability by design</strong></h2>
<p><em>Frans Kaashoek</em></p>
<p>Reducing serial sections is crucial to good multicore performance. Typically, we fix OS scalability issues by profiling a target application, fixing bottleneck, repeat. But this is a little inelegant, as we cannot really tell if the bottlenecks we're fixing are fundamental, or workload-dependent phenomena. Instead, consider an interface-driven approach. Basic idea is the commutativity rule: if two operations commute, they can be implemented scalably -- this helps reasoning about interfaces. Their COMMUTER tool finds opportunities for commutativity and auto-generates test cases. Adopt a scalability definition that assumes that operations scale if they access disjont memory, OR merely read. The idea of commutativity came out of looking for an implementation-independent principle of determining if operations *can* work with disjoint memory. Might be able to change interfaces if they do not commute (as commutativity is desirable).</p>
<p>Turns out in practice that only very few operations commute unconditionally. Adopt a notion of "legal histories", which is essentially a way to derive if interleavings are legal under commutativity. Some theory about how this works; net result is that we can tell if sequences of operations are commutative and where the boundaries are. As a result, we can either change operation orderings (e.g. delay non-commutative ops) to get better scalability, or change the implementation/interface semantics.</p>
<p>Three classes of non-commutative operations ins POSIX: complex return values, unnecessary orderings, complex operations. They built a tool that takes a Python model of syscall behaviour, generates test cases and runs them on top of Linux in modified QEMU, which reports scalability violations (shared writes). Tried this with 10 POSIX FS calls, and found that only ~500 combinations scale; toy ScaleFS gets to ~1500. These gains scale to real-world micro-benchmarks, and also to higher-level workloads (mmap-heavy Metis MapReduce on 80-core machine).</p>
<p>Conclusion: the commutativity rule is quite useful in practice, and extends beyond OS APIs.</p>
<p>&nbsp;</p>
<h2>Heterogeneous Multicores: When Slower is Faster</h2>
<p><em id="__mceDel">Tomas Hruby, Herbert Bos, Andrew S. Tanenbaum</em></p>
<p>Breaking an OS into many components (classic µkernel spin) is great for dependability, but typically slow. NewtOS (their thing) is a high-performance version of MINIX 3, which avoids context switching and kernel boundary crossing for IPC (kernel does setup only). Proof-of-concept: disaggregated network stack in NewtOS, which goes from 200 Mbps in MINIX 3 to 10G for TCP in NewtOS.</p>
<p><em>[missed the rest of the talk; something about heterogeneous cores and resource efficiency]</em></p>
<p><em>[I missed most of the final session; sorry]</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2013/04/14/eurosys-workshops-systems-for-future-multi-core-architectures-sfma/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Live blog from OSDI 2012 &#8212; Day 3</title>
		<link>http://www.syslog.cl.cam.ac.uk/2012/10/10/live-blog-from-osdi-2012-day-3/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2012/10/10/live-blog-from-osdi-2012-day-3/#comments</comments>
		<pubDate>Wed, 10 Oct 2012 16:13:13 +0000</pubDate>
		<dc:creator>Malte Schwarzkopf</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Networks]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Parallelism]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1065</guid>
		<description><![CDATA[This year's OSDI in Hollywood is entering its final day; as usual, we will be covering the sessions live on syslog. Continue reading below the fold for talk-by-talk coverage. &#160; Session 8: Debugging and Testing SymDrive: Testing Drivers without Devices Matthew J. Renzelmann, Asim Kadav, and Michael M. Swift, University of Wisconsin—Madison Driver stability is [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignleft" title="OSDI logo" src="https://www.usenix.org/sites/default/files/osdi12_going.png" alt="" width="162" height="67" />This year's OSDI in Hollywood is entering its final day; as usual, we will be covering the sessions live on syslog.</p>
<p>Continue reading below the fold for talk-by-talk coverage.</p>
<p>&nbsp;</p>
<p><span id="more-1065"></span></p>
<h1>Session 8: Debugging and Testing</h1>
<p><strong>SymDrive: Testing Drivers without Devices</strong><br />
<em>Matthew J. Renzelmann, Asim Kadav, and Michael M. Swift, University of Wisconsin—Madison</em></p>
<p><em> </em>Driver stability is critical. Many approaches to improving it have been proposed, but few widely deployed. The prevalent methodology is simply testing the driver and performing code reviews. Testing does, however, require access to the actual physical hardware, and many drivers cover dozens of devices. Kernel evoluation also occasionally necessitates changes to many drivers, all of which then must be tested again. Existing approaches all fall foul: formal specifications are too large an effort, though finding many bugs; static analysis scales well and requires little effort, but only catches some bugs; testing and code reviews are somewhere in between on both effort and effectiveness dimensions. Their system, SymDrive, aims to achieve high effectiveness at low effort. This is achieved using symbolic execution of driver code.</p>
<p>Three main goals: (1) find deep, meaningful bugs, including those spanning multiple entry points and involving pointer/object queues; (2) do this without significant developer effort; (3) enable broader patch testing and apply to many classes as driver. To achieve this, they require a model of the device behaviour, so that access to the real hardware is not required. SymDrive builds on S²E, a symbolic execution engine that provides a symbolic device, and kernel modules for symbolic buses etc. The challenge with symbolic execution, as always, is avoiding path explosion. They use "special, invalid x86 opcodes" generated statically that instruct the execution engine to favour certain paths over others. For example, success paths are prioritized. Loops are instrumented with opcodes informing the engine of the fact that it is entering an iterating loop, which can then be elided. They also have a special "high coverage mode", which will try to explore all paths, and forks on control flow branches. All of this magic is inserted into the driver source by a static analysis tool called SymGen.<br />
Another challenge is to define what the correct driver behaviour is. For this purpose, SymDrive provides "checkers", which are essentially sophisticated assertions. Somehow, these are chosen automatically by something called the "test framework" (I did not fully understand how this works). Checkers, however, are limited to verifying properties at the kernel-driver interface, and cannot check if the device works as expected.</p>
<p>Evaluation: 39 bugs found across 26 Linux/FreeBSD drivers on five buses; all of these verified, some patched. Of the 39, 22 were found by checkers, and 17 by symbolic execution in the kernel. Code coverage is high, with the median &gt;80%, and only 1 annotation on average (max: 7) was required. Median runtime for SymDrive is ~25 minutes, although massive outliers (&gt;8h) exist.</p>
<p><strong>Q (Philip Levis, Stanford U):</strong> How do you handle the path explosion problem for interrupts, which can happen at any point in the execution stream?<br />
<strong>A:</strong> This approach will miss some bugs, since we do not model interrupts at arbitrary points, but only fixed time intervals.</p>
<p><strong>Q (someone from EPFL):</strong> What if you have multiple loops, and want to go through some of them multiple times, rather than taking the shortest exit? How do you deal with this?<br />
<strong>A:</strong> The loops that SymDrive is most concerned with are those that iterate repeatedly and generate new paths. Most of the loops we found that generate states could be broken out of early. Worst case example here is a checksum loop, which prevents the driver from executing correctly if it does not fully execute, but this will cause a warning asking for annotation to be printed.</p>
<p><strong>Q:</strong> Why do you favour success paths? It seems that bugs are more likely to linger on poorly tested error handling paths...<br />
<strong>A:</strong> True. For this, we provide high coverage mode, and the LED bug we showed actually used that. The Linux driver development model is one of continuous patching and patch testing, and this matches SymDrive's behaviour well.</p>
<p><strong>Q:</strong> False positives?<br />
<strong>A:</strong> Could occur if you have incorrect checkers, or hardware-dependent bugs (i.e. depending on unexpected behaviour). We strive to minimize false positives at all costs, and we have not really found this to be an issue.</p>
<p>&nbsp;</p>
<p><strong>Be Conservative: Enhancing Failure Diagnosis with Proactive Logging</strong><br />
<em>Ding Yuan, University of Illinois at Urbana-Champaign and University of California, San Diego; Soyeon Park, Peng Huang, Yang Liu, Michael M. Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage, University of California, San Diego</em></p>
<p>Errors in production settings are notoriously hard to diagnose, since they can only be reconstructed if execution environment and inputs are exactly replicated. However, customers are often reluctant to give this information to vendors. Logs, however, are considered less sensisitive and often more readily available. That said, a real-world problem today is that software is not producing enough diagnostic log output, and log messages are often only added reactively (in response to hard-to-debug error reports). Hypothesis: there are many missed opportunities for developer to add logging output. Indeed, these opportunities map to a small set of classes, and code can automatically be instrumented with appropriate log messages.</p>
<p>They gathered 250 bug reports from various open source software repositories. Found that only 43% of them have error log messages associated with them -- i.e. more than half fail silently. However, 77% have "easy-to-log" opportunities. Example: Apache developers fail to log errors returned by open() syscall as it occurs in wrapper function. However, good practice ways of dealing with this exist (e.g. SVN's SVN_ERR macro). There is something called the Fault-Error-Failure model, which formalizes how errors should be detected, handled and communicated. They claim that conservatively "over-logging" is the right action, and present manual analysis that shows that developers in many of the 250 cases considered missed such opportunities.</p>
<p>Errlog (their system) automatically detects exception patterns (e.g. syscall returns) in source code and inserts log messages where they do no exist. They can also learn "application specific" error conditions, and instrument those that do not have log messages. But there can be false positives: e.g. error return from stat() when used to verify that a file does not exist. They use a heuristic to deal with these: only log each 2^nth occurrence of the error to avoid logspam. Errlog can successfully add log messages covering 65% of the identified failures; the remaining 35% still fail silently. To do so, they add 60% more log messages. When compared to existing manual log messages, Errlog covers 83% of these cases. Of course, there is some overhead: a few noisy error messages during normal execution are introduced, but this is only 1.4% on average, and under 5% in the worst case (OpenSSH). Also did a user study with 20 programmers to evaluate Errlog. Gave them some bugs, partly reproducible, partly not. Found that Errlog reduces diagnosis time by 61% on average (though with large error bars). Got positive feedback from users.</p>
<p>Limitations: maybe not representative, since only looked at five projects, all written in C/C++. Still have 35% silent failures. Semantics of auto-generated log messages not as good as manually written ones.</p>
<p><strong>Q:</strong> How many of the silent failures would disappear if you turned on the "verbose" option (which is usually turned off in production)?<br />
<strong>A:</strong> Indeed. Logging overhead with verbosity on is 90%, but would help. Undesirable, though, since developers have to ask users to turn it on and this involves an extra round of user interaction.</p>
<p><strong>Q:</strong> How far could you get without access to source code? (e.g. LD_PRELOAD tricks etc.)<br />
<strong>A:</strong> Probably could cover some bits, such as syscall error conditions, but others require more sophisticated static analyses, so must have source code.</p>
<p><strong>Q (Margo Seltzer, Harvard U):</strong> Shouldn't we be teaching people to write besser log messages?<br />
<strong>A:</strong> Yes!</p>
<p><strong>Q:</strong> Can you tool add too many log messages and overwhelm the developer?<br />
<strong>A:</strong> This is not about debug messages, but about error messages, which are by definition rare, as they only occur if error conditions are present. Still, could happen -- but could use frequency-based ranking or similar techniques.</p>
<p><strong>Q:</strong> Any suggestions on better tools to help developers logging?<br />
<strong>A:</strong> Ideally, would use something like Errlog as an IDE plugin, and automatically generate template error messages.</p>
<p>&nbsp;</p>
<p><strong>X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software  (Best Student Paper)</strong><br />
<em>Mona Attariyan, University of Michigan and Google; Michael Chow and Jason Flinn, University of Michigan</em></p>
<p>Performance problems are hard to debug, especially by end users. Tools for end users are different from tools for developers -- particularly in that they do not want to (or are able to) look at source code. Often, configuration files have a great influence, but are complex to understand. What is missing from current tools is an explanation of "what" went wrong, and what the root cause of a problem is. Profilers are a brute-force approach to this, by attributing costs to everything. X-ray improves on this by automatically attributing costs to root causes, and presenting a list of these. To do so, it uses a combination of deterministic replay. Recording overhead at runtime is low, so could have this on all the time. When performance problem occurs, the user can take the recorded log and send it to an offline analysis machine, which will then return a prioritized lists of root causes.</p>
<p>They found existing deterministic replay systems to be unsuitable for their purpose, as the recoded and the replay execution different (why? surely the whole point of deterministic replay is for it not to do that?). They solved this by modifying the replay tool to be "instrumentation aware" (it did not become entirely clear what this means). X-ray can work at different granularities: entire execution, timeslice, or individual requests. For the last option, especially, a challenge is to identify what code relates to which request. They have various methods of doing so, some of which are based on taint-tracking for data. Once they have established a mapping from basic blocks to requests, they attribute costs to them, and finally map costs to root causes.</p>
<p>To find out why two requests differed in performance, they use differential performance analysis. X-ray then extracts the control flow graphs for each request, merges them and figures out where the execution paths differ. This is at conditionals, and they quantify the extra cost of taking a different path by comparing it to the cost of the shortest path to exit. Various bits of cleverness deal with comparing many requests; ultimately, the costs are attributed to root causes and a list of root causes is presented.</p>
<p>Note that X-ray is limited to identifying root causes which are configuration settings or inputs. Evaluation: looked at 4 applications (Apache, Postfix, PostgreSQL, lighttpd). For 17 selected performance bugs, X-ray ranked the correct root cause first or tied first-second in 16 our of 17 cases. These results are better than they actually expected themselves. Runtime for all of the applications is on the order of &lt;10 minutes.</p>
<p><strong>Q (someone from Harvard): </strong>Experienced any situations in which certain performance bugs mask other bugs?<br />
<strong>A:</strong> They will all turn up in the performance summarization list, ordered by their impact. Differential analysis helps excluding issues that the user is not interested in.</p>
<p><strong>Q:</strong> Is this work also applicable to multi-component and distributed systems?<br />
<strong>A:</strong> Yes, can run on multiple machines, but X-ray itself does not communicate.</p>
<p>&nbsp;</p>
<h1>Session 9: Isolation</h1>
<p><strong>Pasture: Secure Offline Data Access Using Commodity Trusted Hardware</strong><br />
<em>Ramakrishna Kotla and Tom Rodeheffer, Microsoft Research; Indrajit Roy, HP Labs; Patrick Stuedi, IBM Research; Benjamin Wester, Facebook</em></p>
<p><em> </em>[I missed the first half of this talk due to having to check out of the hotel; this is something about using crypto and TPMs to ensure that data remains private and accessible even when offline. Goal appears to be to be able to give data to people, and be able to audit if they accessed it, even if they did so while off-line. Fairly heavy on crypto; they also support access revocation. Application appears to be something like secure movie rental, with logged access. The Pasture library can be linked into any application, and they show an example of integration into Outlook. Common operations are fast, uncommon ones take on the order of seconds, key generation is most expensive with ~5s.]<br />
<strong>Dune: Safe User-level Access to Privileged CPU Features<br />
</strong><em>Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis, Stanford University</em></p>
<p><em> </em>Privileged CPU features can actually be quite useful to applications. For example, garbage collection, intra-process privilege separation and safe native code execution in browsers could all benefit from having access to privileged instructions. One way to expose them is to patch the kernel -- but this does not scale, as the patches are application-specific. How about using an extensible OS, like Exokernel? Works, but need to rewrite entire OS. What about virtualization with custom kernels? Strict partitioning and isolation of VMs means that inter-application communication is inhibited. What Dune does is to provide safe user-level access to such privileged features, while maintaining the standard POSIX process semantics and the same kernel-process interface. This is achieved by using existing virtualization hardware, and existing kernel features to access it. It is used in a very different way to how it's used in virtualization, though.</p>
<p>Let's consider the example of GC. The application will have its own page table, implemented using the guest PT hardware support for virtualization, while the kernel continues to use the separate host PTs. They leverage VT-x, which gives them access to four types of privileged CPU features: privilege modes, virtual memory, exceptions and segmentation. Note that this enables user processes to receive e.g. exception notifications and traps in a very efficient way (delivered by hardware). The kernel runs in host mode (VMX root mode on Intel; need access to the VT-x instructions), while processes run in guest mode (which does give access to privileged instructions as it would for VMs). 2.5k LOC kernel module for Dune, which manages the virtualization hardware and provides a process abstraction and forwards syscalls/page faults etc. from guest mode to host mode kernel. Processes link libDune, which is 10k LOC, but untrusted code. It's essentially a utility library that helps leveraging the privileged instructions.</p>
<p>Let's look into the kernel module. First, as part of the process abstraction, it needs to provide memory management. Address translation works as guest virtual -&gt; guest physical and then uses the EPT to ensure safety of the guest physical memory (restricting access; fairly standard virtualization stuff). Unlike virtualization, they configure the EPT to reflect the entire process address space. For syscalls, the Dune processes will trap back into themselves (using libDune's syscall handler), and then invoke a VMCALL to the kernel system call code. This means that we can run untrusted code inside the process, and have it use syscalls to interact with the outside world (and leverage things like supervisor mode PT bits to protect parts of the process's memory from the untrusted code). Singal handling injects interrupts into the process, effecting a switch to ring 0 for delivery.</p>
<p>There were many challenges when implementing this -- reducing overheads, dealing with insufficient EPT space, reconciling POSIX process semantics and VM semantics, etc. -- details in paper. Overhead caused by VMX transitions and EPT translations -- for example, getpid syscall now takes 895 cycles instead of 138. Page faults are also twice as expensive. However, this is partly cancelled out by the opportunities for optimization in application code that can now use privileged features: ptrace from user mode is now ~27x faster than on normal Linux (due to fewer context switches). Exception/trap deliverly 587 cycles instead of ~2.8k, virtual memory manipulation ~7x faster. Macro-eval using three example applications: app sandbox, GC and privilege separation system (multiple protection domains within a single process). Sandbox overheads are low on SPEC CPU2000 (&lt;10%), except for outliers due to TLB misses, which can be fixed by adding large page backing of large memory allocations. This actually transforms the slowdown into a speedup! For lighttpd, the slowdown is around 2% in connections/second; this is way higher than e.g. running inside VMware player, where two network stacks are used, whereas Dune just uses a single kernel network stack. 40% improvement in GC performance, 3x faster privilege separation (as compared to using multiple Linux processes).</p>
<p><strong>Q:</strong> Thought about making IO devices that support virtualization and exposing them through Dune?<br />
<strong>A:</strong> Yes, not done for this paper, but it could be done. Could safely expose network devices, storage controllers etc. to user programs, and even partition them safely between different Dune processes.</p>
<p><strong>Performance Isolation and Fairness for Multi-Tenant Cloud Storage</strong><br />
<em>David Shue and Michael J. Freedman, Princeton University; Anees Shaikh, IBM T.J. Watson Research Center</em></p>
<p>Predictable performance in cloud storage is hard. Co-located tenants (sharing the same K-V store in this thought experiment) lead to resource contention, and heterogeneity in tenant workloads can stress different parts of the system. Their system, Pisces, basically provides weighted fair sharing of a distributed K-V store. Related work is Amazon's DynamoDB, which assumes per-tenant provisioned rates, uniform object popularity and a single uniform resource request format. Pisces makes none of these assumptions, and yet provides max-min-fairness (and is work-conserving). As their guarantees are system-wide, they need a mechanism to translate global weights into local (per-machine) weights.</p>
<p>Randomly placing data partitions (of the key space) on different machines, they can be placed according to fairness constraints and bin-packed onto machines (avoiding to overload any machine; how do they know this in advance? or is this reactive?). Allocating equal local is not ideal -- instead, they make reciprocal swaps between nodes so that global weights still match, but local weights may differ. Replica routing is also integrated with the weighting system, aiming to saturate allocations. Since resources are multi-dimensional, they employ dominant resource fair queuing/sharing. [I missed some detail here]</p>
<p>Evaluation: does Pisces provide the fairness it's designed to provide? Even system-wide fairness, weighted system-wide fairness, local dominant resource fairness? Experiment: 8 tenants, 8 clients, 8 storage nodes, membase. Unmodified fairness is pretty poor, in Pisces, almost everyone gets the fair share almost all the time. If the tenant request rates are unbalanced, fair queuing along does not reach fairness, bbut the combination of all of their features (+partition placement, weight allocation, replica selection) does. The overhead added by Pisces is small -- &lt;5% for 1kB requests, but ~19% for 10B requests (CPU-bound). For different tenant weights, fairness is mostly good, although low-weight tenants get somewhat poor fairness. Fixing this is WIP, workaround is to set their weights higher (?). Local dominant resource fairness is also provided; finally, Pisces adapts well to different demand shapes (constant, diurnal, bursty).</p>
<p>[missed two questions here]</p>
<p><strong>Q (someone from Google):</strong> How do you deal with caches on the server; do you partition them?<br />
<strong>A:</strong> Don't explicitly deal with this kind of resource.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2012/10/10/live-blog-from-osdi-2012-day-3/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Live blog from OSDI 2012 &#8212; Day 2</title>
		<link>http://www.syslog.cl.cam.ac.uk/2012/10/09/live-blog-from-osdi-2012-day-2/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2012/10/09/live-blog-from-osdi-2012-day-2/#comments</comments>
		<pubDate>Tue, 09 Oct 2012 17:00:32 +0000</pubDate>
		<dc:creator>Malte Schwarzkopf</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Networks]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1047</guid>
		<description><![CDATA[Here we are, reporting back from OSDI 2012 in Hollywood today. Today's live-blog coverage continues below the fold. Note that some of the coverage is a little spotty due to our blog machine being overwhelmed by the number of requests. Session 4: Distributed Systems and Networking Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE Zhenyu [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignleft" title="OSDI logo" src="https://www.usenix.org/sites/default/files/osdi12_going.png" alt="" width="162" height="67" />Here we are, reporting back from OSDI 2012 in Hollywood today.</p>
<p>Today's live-blog coverage continues below the fold. Note that some of the coverage is a little spotty due to our blog machine being overwhelmed by the number of requests.</p>
<p><span id="more-1047"></span></p>
<h1>Session 4: Distributed Systems and Networking</h1>
<p><strong>Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE<br />
</strong><em>Zhenyu Guo, Microsoft Research Asia; Xuepeng Fan, Microsoft Research Asia and Huazhong University of Science and Technology; Rishan Chen, Microsoft Research Asia and Peking University; Jiaxing Zhang, Hucheng Zhou, and Sean McDirmid, Microsoft Research Asia; Chang Liu, Microsoft Research Asia and Shanghai Jiao Tong University; Wei Lin and Jingren Zhou, Microsoft Bing; Lidong Zhou, Microsoft Research Asia</em></p>
<p><em> </em>This looks at typical MR-like computations, with multiple phases and network-intensive data motion in between procedural compute phases. However, the computation phases often discard a lot of data; if this is only done in the reduce phase, we end up moving a lot of useless data. Similarly, things can be moved down to a later phase if they cause inflation in intermediate data. The challenge for such optimizations is obviously to maintain correctness. Current systems like SCOPE, DryadLINQ, Pig Latin and Hive compile bits of code into job phase binaries, after running a query optimizer over them. When starting with procedural code, it is hard to make whole-program (job) optimizations. One option is to map bits of the procedural code to relational constructs and then use a query optimizer, but this is hard in the general case.</p>
<p>PeriSCOPE builds an inter-procedural flow graph (across the shuffle phase), adds safety constraints and then applies optimizations. These include reducing the number of columns (removing unneeded ones), reducing the number of rows, and reducing the size of each row. This is done by permuting the data-flow graph (the added safety information adds dependencies to prevent unsafe optimization). Row size reduction is done by labeling dependencies with type and field name and size information (I think?) and then doing graph cuts (cut such that no safety-critical dependency is broken).</p>
<p>A simple coverage study on a trace of 28k jobs in 2010/2011 shows that their optimizations can affect up to 22% of jobs, with column reduction being the most important one (~14%). Evaluation on a set of eight jobs shows that they can massively reduce the I/O volumne (often by more than 50%), and also reduce job runtime (latency) in many cases. The effectiveness of the individual optimizations is highly job-dependent, though -- sometimes a particular optimization (early filtering) has almost no effect, sometimes it reduces I/O by 99%.<br />
The design of PeriSCOPE is such that it should be generally applicable. However, the programming model has a major influence on the options and effectiveness of optimizations, but this can also mean a trade-off against easy-of-use. Ideally, a more informative interface than the MapReduce-style computations considered in this work would be used.</p>
<p><strong>Q:</strong> SCOPE also does query plan optimization. How does that interact with PeriSCOPE?<br />
<strong>A:</strong> PeriSCOPE takes the output of the query optimization phase as input, so they are independent. In future work, may share information between the two.</p>
<p><strong>Q:</strong> These optimizations are all based on static analysis. Could you do better if you combined this with dynamic profiling, and what would be better?<br />
<strong>A:</strong> Yes! For example, do not know the size of stream variables at compile time.</p>
<p>&nbsp;</p>
<p><strong>MegaPipe: A New Programming Interface for Scalable Network I/O</strong><br />
<em>Sangjin Han and Scott Marshall, University of California, Berkeley; Byung-Gon Chun, Yahoo! Research; Sylvia Ratnasamy, University of California, Berkeley</em></p>
<p><em> </em>MegaPipe is a new network programming API for message-oriented workloads, avoiding many of the shortfalls of the BSD sockets API. Let's consider two types of workload: (1) one-directional bulk transfer: half a CPU can easily saturate a 10G link, (2) message-oriented: smaller messages, bi-directional, higher CPU load. The second type does not play well with the BSD socket API: we need to make a system call for every I/O operation, and everyone shares a listening socket. Finally, the socket API is based on file descriptors, which means it inherits some of the overhead from that abstraction. Motivating experiment: RPC-like test on an 8-core server with epoll, performing simple hand-shake transacrions, with 768 clients. As they scale up the message size, throughput increases and CPU utilization drops. However, tiny messages are pessinmal, as the throughput barely exceeds 500 MBit/s, while using almost 100% CPU. When they scaled the number of transactions per connection, they found that throughput is 19x lower with a single TX per connection as opposed to 128 TX/connection. Does exploiting multiple cores help? No, not really: diminishing returns from adding more cores.</p>
<p>MegaPipe addresses these issues. Its design goals are three-fold: concurrency is a first-class citizen, various I/O types share the same unified interface (network connections, disk I/O, pipes, signals), low overhead and scalability to many cores. The talk focuses on the third goal. Previous work shows that limiting factors are in syscall overhead (per-core performance), VFS overhead and shared resources (multi-core scalability). Key primitives of MegaPipe are a "handle" (like an FD, but only valid within a channel), and "channels", which are per-core point-to-point connections. Handles are automatically batched into using a single channel if they communicate with the same remote endpoint. Finally, unlike globally-visible FDs, handles in MegaPipe are lightweight sockets. Note that MP semantics have different semantics: MP requires the programmer to explicitly "dispatch" aggregated I/O requests on a channel, while the asynchronous BSD API requires setting up wait primitives and then making requests (without any aggegration). MP can also batch together multiple requests of different types and pass them down to the MP kernel module together. The multi-core listen/accept optimizations look a lot like the stuff MIT people presented at EuroSys (per-core accept queues). Lightweight sockets are ephemeral and only converted into a FD when necessary; MP handles are based on such lwsockets.</p>
<p>Evaluation: some micro-benchmarks, and adapted two popular applications (memcached, and nginx). On the same micro-benchmark as discussed in the motivation, MP manages to improve throughput by up to 100% for small (&lt; 1KB) messages. MP, unlike BSD sockets, scales linearly and independently of connection length, to multiple cores. In the marco-benchmarks, memcached has limited scalability from the outset, as there is a global lock on the object store, while nginx is already designed as a scalable shared-nothing application. Accordingly, MP only benefits out-of-the-box memcached for short connections with few requests. Moving memcached to fine-grained locking, however, leads to a major improvement, with MP adding 15% extra throughput over BSD sockets. With nginx, they see similar results: 75% increase in throughput when using MP.</p>
<p><strong>Q (<strong>someone from </strong>Stanford):</strong> People who complained about sockets overhead in the past gave up on it and bypass the API. Are your lwsockets good enough to help these people?<br />
<strong>A:</strong> lwsockets are an opportunistic optimization, avoiding most of the overhead most of the time, but still giving the full sockets API when needed.<br />
<strong>Q:</strong> What happens if the user process having a MP handle/channel forks?<br />
<strong>A:</strong> Not duplicated when forking.</p>
<p><strong>Q:</strong> Batching affects short messages. This will delay them, and that may be an issue with delay-sensitive systems (commonly using small messages). Do you somehow allow users to control the batching?<br />
<strong>A:</strong> Network cards already have deep queues, so latency is already there. Have some results that show MP latency on memcached is actually same or lower than baseline.</p>
<p><strong>Q:</strong> How do you schedule requests to per-core handler threads, e.g. accepts?<br />
<strong>A:</strong> Just normal user-level application, no special OS scheduling.</p>
<p><strong>DJoin: Differentially Private Join Queries over Distributed Databases</strong><br />
<em>Arjun Narayan and Andreas Haeberlen, University of Pennsylvania</em></p>
<p>[to be added]</p>
<p>&nbsp;</p>
<h1>Session 5: Security</h1>
<p><strong>Improving Integer Security for Systems with KINT<br />
</strong><em>Xi Wang and Haogang Chen, MIT CSAIL; Zhihao Jia, Tsinghua University IIIS; Nickolai Zeldovich and M. Frans Kaashoek, MIT CSAIL</em></p>
<p><em> </em>Integer overflows can have disastrous consequences, such as buffer overflows or other logical bugs (e.g. trick OOM killer into killing innocent processes). Indeed, integer errors account for Linux the #2 OS vendor advisory topic (according to CVE). Some options to avoid interger overflow: use arbitrary precision integers (performance not good enough), trap on every overflow (also bad performance, and some code relies on overflows).</p>
<p>Found 114 bugs in Linux kernel, 9 of which were independently found by others. Two thirds of these led to logical errors or buffer overflows, so were quite serious. More importantly, two thirds of them also had checks which were incorrect, and multiple fix attempts as a result of reporting the bug were, too!</p>
<p>KINT uses LLVM IR, and combines results from per-function analysis, range analysis and taint analysis into a list of potential bugs. In the function analysis, they simply infer constraints from the code (control flow paths and overflow conditions) and then use a constraint solver to find if any integer value will satisfy the constraints that lead to an overflow. Taint analysis is optional, as it relies on user annotations. For example, a programmer may annotate code processing untrusted user input, and the taint analysis will then propagate this uncertainty and highlight potential overflows that result (and which are not protected against).<br />
Evaluation is in terms of effectiveness, and false positives/negatives. In addition to the 114 bugs in the Linux kernel, they found five bugs in OpenSSH and one in lighttpd. To work out false negatives, they looked at 37 known integer overflow bugs from recent years. KINT found 36 of them. To look at false positives, they look at the patched code for these 37 bugs. KINT reports one false positive, and found two incorrect fixes! Run on the whole kernel, KINT finds about ~125k potential bugs, of which 724 are classified as "critical". Running only takes a few hours, even for a large code base like the kernel. They did not have the resources to inspect all of the potential bugs, but skimming found a few hundred.</p>
<p>One contribution as a result of this work is kmalloc_array(), which is a helper function to avoid the frequently-used, dangerous malloc(n * size) paradigm in the kernel. As a generalized approach, they propose a "NaN" special integer value, for which they have added support to Clang ("nan" keyword and "is_nan" call). Overflows will result in such a special NaN value, which can be contained. The advantage of this is that bounds-checking code can largely be elided, as the checks are automatic.</p>
<p><strong>Q (<strong>someone from </strong>Harvey Mudd):</strong> Have you checked KINT's source code using KINT?<br />
<strong>A:</strong> Nope, it's C++, and KINT only supports C.</p>
<p><strong>Q (<strong>someone from </strong>UCSD):</strong> How many annotations did you use when checking the Linux kernel?<br />
<strong>A:</strong> Details in paper, about 40 for untrusted input and 20 for annotation sizes.</p>
<p><strong>Q (<strong>someone from </strong>NICTA):</strong> You are changing the C semantics with KINT. Why no simply change the semantics of addition and multiplication?<br />
<strong>A:</strong> Some programs, e.g. crypto code, rely on overflow and modulo semantics. KINT will report many false positives for this kind of code.</p>
<p><strong>Q (<strong>someone from </strong>UCSD):</strong> What do the annotations for untrusted user input look like, what extra code is required?<br />
<strong>A:</strong> Specify untrusted function parameters, a little more difficult with macros.</p>
<p><strong>Q (<strong>someone from </strong>UCSD):</strong> Do you also cover signed/unsigned mismatches? Infrastructure seems to work for this.<br />
<strong>A:</strong> Yes.</p>
<p><strong>Q:</strong> Nasty code exists. What happens if your solver is faced with something that it cannot solve, or which would take very long? Will you err towards false positives or false negatives?<br />
<strong>A:</strong> Details in paper; solver has issues with divisions. Implemented a bunch of re-write rules that do not change semantics, but make the solver's work easier.</p>
<p>&nbsp;</p>
<p><strong>Dissent in Numbers: Making Strong Anonymity Scale</strong><br />
<em>David Isaac Wolinsky, Henry Corrigan-Gibbs, and Bryan Ford, Yale University; Aaron Johnson, U.S. Naval Research Laboratory</em></p>
<p><em> </em>This work allows dissemination of information without fear of reprisal from authorities or peers. The core challenge is the trade-off between anonymity and scale (resistant to timing analysis, thousands of participans and churn tolerance). Existing work on weak anonymity (e.g. Tor) scales well, but is not resistant to timing analysis: if someone can measure the timing of messages going into Tor, and coming out again, they can statistically de-anonymize the originator. Another alternative is DC-nets, but they do not scale to strong anonymity at large scale (since everyone is talking to everyone else). Dissent (their work) uses a mix-net topology with the anonymity semantics of DC-nets. Imagine M servers with N clients, where N &gt;&gt; M. Servers have N shared secrets, which they distribute to clients, which each have M secrets that they can combine with the server secrets (this reduces computational complexity for generating and distributing the secrets). For a practical example case, the number of messages to generate and distribute goes from ~10k to around 215 in Dissent. Then, the servers collaborate by generating the ciphertexts from their connected users ciphertexts, exchanging them, and then performing XOR on the M ciphertexts. However, DC-nets are not churn tolerant, since all participants' ciphertexts are needed to decrypt the message (by XOR'ing). In the system proposed here, since there are servers involved, they can simply time out participants that drop out before computing the ciphertext.<br />
There is also a fairly complicated routine that deals with disruptors in the system, and can identify them. They can maintain anonymity as long as at least one honest server exists, and the clients need not to know which of the servers they talk to it is.</p>
<p>Evaluation: unlike previous systems, which usually scaled to around 40 clients, they can scale to 5,000 clients, while not exceeding a message latency of 10 seconds. In a trace-based evaluation using a Twitter trace, Dissent can keep up with the disemmination rate of real-world Twitter, while other systems do not. For churn resistance evaluation, they used PlanetLab, and found that there are some decent heuristics they can use to time out dropped-off participants. Disruption detection is bottle-necked on key shuffle and blame shuffle, taking on the order of hours, so there is room for improvement there.</p>
<p><strong>Q:</strong> Are O(thousands) of people really a large-enough anonymity set? It is very clear that someone is a member of Dissent.<br />
<strong>A:</strong> No ground truth on this, really, but thousands of people are certainly harder to reprimand.</p>
<p><strong>Q:</strong> Can you compose what you have done with another mechanism that allows users to hide their participation?<br />
<strong>A:</strong> For example, Tor is making progress on masking traffic as innocent web traffic. Could use that kind of thing, but that would still degenerate into an arms race.</p>
<p>&nbsp;</p>
<p><strong>Efficient Patch-based Auditing for Web Application Vulnerabilities</strong><br />
<em>Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich, MIT CSAIL</em></p>
<p><em> </em>This is an auditing system. Consider the example of Github authentication using public keys. There was a vulnerability that led to an attacker being able to modify peoples' public keys. As a response, Github asked users to audit their own public keys. It would have been better if they had been themselves able to find out what keys had been attacked, but the scale of Github logs is too large to make that pratical. In the particular example, the vulnerability was a result of using a user ID provided as part of the request, rather than the current user's ID.<br />
Their auditing system is based on replaying historic requests, running the code once with and once without the patch fixing a vulnerability applied, and then watching for different results. This is a known methodology, but their contribution is that they can do this much faster, replying a month of traffic in a few hours. During normal execution, they record intial, non-deterministic and external request input in an audit log. The naive auditing approach then involves replaying this and comparing the results. Optimization opportunities: patches may not affect every request, the two replay instances will share a lot of code, and requests are similar. They address all three of these points. For the first one, they track control flow and identify the basic blocks that diverge as a result of patch application. During normal execution, they record the control flow trace (CFT) for each request. At replay time, they will use the CFTs and information about which basic blocks were affected by the patch to filter out requests not affected at all. For the second point (shared code between instances), they use function-level auditing. The two instances (patched and unpatched) start running as a single instance, and fork immediately before calling the patched function. While said function is running, any side-effects must be intercepted (global variables, output, database queries). If the side effects are the same, there was no exploitation, so skip this request. Finally, they memoize a lot of execution detail, meaning that similar requests (identified as same control flow, modulo different input) will be able to re-use the bits of the CFT that are independent of the patch and  template variables affected by the input.</p>
<p>Their system is called POIROT, and based on a modified PHP runtime. It does not require any changes to application code. In evaluation, POIROT successfully detected five different types of attacks on MediaWiki in real-world Wikipedia traces, and various information leak vulnerabilities in HotCRP using synthetic traces. For examples of real CVE vulnerabilities on MediaWiki, the naive replay strategy for 100k Wikipedia requests (~= 3.4h) would have taken on the order of several hours, but is down to minutes with POIROT (12-51x faster than original execution). Their use of templates helps cutting the amount of code to run very significantly. For the logging in normal operation, POIROT adds ~5KB of logging data and ~15% increase in latency and throughput to each request.</p>
<p><strong>Q (someone from Princeton):</strong> You seem to record a lot of information for each request. How much?<br />
<strong>A:</strong> See results in eval; 5KB are all input required for a Wikipedia request, including cookie.</p>
<p><strong>Q (<strong>someone from </strong>NICTA):</strong> How does your overhead compare to existing work? How could you reduce it?<br />
<strong>A:</strong> Record a lot of non-deterministic input (e.g. random numbers generated), which may not be necessary for replay, as attacker often cannot exploit it.</p>
<p>&nbsp;</p>
<h1>Session 6: Potpurri</h1>
<p><em>[N.B.: I was dealing with fixing syslog during this session, since the interest in this live-blog essentially DDoS'ed it; hence the coverage is a little sparse.]</em></p>
<p><strong>Experiences from a Decade of TinyOS Development</strong><br />
<em>Philip Levis, Stanford University</em></p>
<p><em> </em>TinyOS started in 1999, as an OS for very small embedded micro-controllers. This talk is about design principles from embedded software, technical results found during the project, and things they should have done differently. First lession: minimize resource usage. Micro-controllers have very little resources: single or double digit numbers of RAM and ROM. Why not use low-power embedded ARM processors? Battery lifetime! With embedded micro-controllers, the system can run for years off a battery, while with the lowest-power ARMs, it is a matter of days. Debugging this stuff is very, very hard, especially in the wild, since it is not possible to simply attach a debugger to these tiny systems. A technique that helps is static virtualization. This is basically about compiling application and OS together, and merging at build time, making as much as possible static (even memory allocation and function calls). This enables whole-program analysis and optimization, as well as dead code elimination.<br />
A non-technical lesson is that the researchers focused a lot on making increasingly complex applications possible, but at the same time, made it more difficult to implement simple, basic applications. island syndrome: increased barrier to entry</p>
<p><strong>Q (someone from Harvard):</strong> How difficult would it be to re-architect the interface to the form that you would prefer, and that might make it more accessible for novice users?<br />
<strong>A:</strong> Probably not that hard, but at the same time, Contiki has filled that role. It would definitely be possible, though.</p>
<p><strong>Q (someone from NICTA):</strong> Your analysis seems to make this assumption that having lots of users is a good thing. Is that really the right metric?<br />
<strong>A:</strong> I probably would go the same way and focus on research if I could go back. The point was more that we never thought about it, and I wanted to make people aware of it.</p>
<p><strong>Q (Eric Sedlar, Oracle):</strong> Any insights on how you could publish something that says "this is more usable"?<br />
<strong>A:</strong> Tricky, since the typical HCI/usability venues have much, much higher bars to what counts as "usable". There is some scope for publishing easier-to-use programming models, though.</p>
<p><strong>Q:</strong> Attribute high learning curve mainly to the lack of tools on top of it. Maybe this would be different if there were better tools, like e.g. in the Java ecosystem?<br />
<strong>A:</strong> Indeed, tools might help, but then again, TinyOS makes the fundamental assumption that a "novice" knows C. Also, developing tools is beyond the scope of a research community and does not gain research credit.</p>
<p><strong>Automated Concurrency-Bug Fixing</strong><br />
<em>Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu, University of Wisconsin—Madison</em></p>
<p><em> </em>Bugs are important, and it would be nice if we could fix them automatically. But this is hard in the general case, as we need ground truth on what the correct behaviour is, and what counts as incorrect behaviour. If we restrict ourselves to the class of concurrency bugs, though, this becomes a little more tractable, since the fix often "just" amounts to inserting the correct synchronization primitives into the program. Their system, CFix, fixes concurrency bugs in six steps. First, they feed bug reports, buggy binary and input data in, then develop a fix strategy, determining if this is an atomicity or an odering problem. Synchronization enforcement, patch testing and selection, patch merging, run-time support.Two major contributions: OFix, a new technique that enforces order relationships, and a framework that ties together various existing tools for analysis and bug fixing. They leverage a bunch of existing bug detectors, and develop a set of template "fix strategies", which largely seem to be different ways of interleaving thread executions in order to avoid bugs (at least for the atomicity violation class of bugs). For order enforcement, OFix provides "allA-B" and "firstA-B" strategies, which seem to be about detecting when all work of type "A" is completed, then synthesizing signals to other threads, and having all threads running work of type "B" wait for these signals. This is somewhat more tricky when threads can spawn child threads that also run work of type "A", but they have some counter-based strategy that keeps track. Of course, OFix could introduce deadlock by synthesizing waits. As far as possible, they try to predict this and give up if it happens, or use timed waits to avoid it. Some cleverness exists to avoid signals if they are unnecessary (because e.g. B does not ever execute on this control flow path). The "firstA-B" strategy is similar, but signals after the first execution of work of type A. At later stages, they prune incorrect patches (fix strategies that do not fix the root cause, e.g. because they make the bug occur deterministically, rather than non-deterministically), and perform some optimizations.</p>
<p>In evaluation, they find that using a combination of four different bug detectors, they can find a large number of known bugs (standard set used for evaluating bug detectors -- not really surprising that they find them!).</p>
<p><strong>Q (Florentina Popovici, Google):</strong> [missed this]<br />
<strong>A:</strong> CFix can work with any bug detector.</p>
<p><strong>Q (someone from MSR):</strong> You need to synthesize condition variables in order to perform your wait. Do you add them statically or dynamically?<br />
<strong>A:</strong> There is one mutex/condition variable per bug report, and they are allocated statically.</p>
<p>&nbsp;</p>
<p><strong>All about Eve: Execute-Verify Replication for Multi-Core Servers</strong><br />
<em>Manos Kapritsos and Yang Wang, University of Texas at Austin; Vivien Quema, Grenoble INP; Allen Clement, MPI-SWS; Lorenzo Alvisi and Mike Dahlin, University of Texas at Austin</em></p>
<p>[I sadly missed most of this talk due to working on fixing the blog. It seems that this is essentially about running parallel state machines (for dependability/fault tolerance) on multiple cores with clever synchronization. Key insight: as long as the result of a non-deterministic execution order is the same, we do not care. A lot of  the talk was spent talking about how divergence can be detected at low overhead; as a result, independent transactions can be executed in parallel. If divergence is detected, Eve rolls back and executes serially as a fallback. This has the benefit of "masking" concurrency bugs by replacing them with serial execution. In evaluation, they find that Eve is 6-7x faster than traditional state-machine replication, and only a little slower than an unreplicated parallel execution. A higher fraction of false positive conflict (divergence) events leads to performance asymptotically approaching traditional state-machine replication.]</p>
<p>&nbsp;</p>
<h1>Session 7: Replication</h1>
<p><strong>Spanner: Google’s Globally-Distributed Database (Best Paper Award)</strong><br />
<em>James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford, Google, Inc.</em></p>
<p>Spanner is a project that has been going on for 4-5 years at Google, and which now runs in production, serving Google's ad database. It's a key-value store and a full database, with general purpose transactions, SQL-like query language, etc., but also fully geo-replicated across data-centres on different continents. Data is also replicated and sharded in various ways. One of the key features is the ability to run lock-free distributed read transactions at global scale. Necessary for this is the property of global external consistency of distributed transactions, and Spanner is the first system to support this. A major enabling technology for this property is the TrueTime API, which provides tight global clock synchronization.</p>
<p>At the simplest possible granularity, we would like to generate output from a consistent snapshot of the database. However, if it is sharded, this becomes non-trivial, as we need to take the snapshots at the exact same time. As a consequence, data should be versioned using timestamps. For a fully consistent snapshot, we want not just serializability, but also external consistency, i.e. agree on global commit order. This requires a notion of global wall-clock time if we use timestamps as our ordering primitive. This can be achieved using strict two-phase locking (i.e. all locks must be acquired before any writes happen).</p>
<p>The TrueTime API provides a notion of global wall clock time with explicit uncertainty. A timestamp becomes an interval: the actual wall clock time must be within this uncertainty interval (this is a bit like the notion of interval arithmetic). For transactions with 2PL, we now set our commit time stamp to TT.latest (the last possibly timestamp) and cannot release them again before TT.latest has passed. Wait time is "waited out" by a logical spin called "commit wait". If distributed consensus between multiple participants in a transaction must be achieved, the overall timestamp chosen is the maximum of the participants decided commit timestamps. There are a lot more details in the paper about different read modes and atomic schema changes.</p>
<p>So how does the TrueTime API work? In different data centres, they have GPS receivers and atomic clocks attached to a set of machines. Machines frequently synchronize against several of these, and in the mean time model time uncertainty by assuming linear worst-case clock drift (200µs/sec). Of course, there is some network-induced uncertainty in the time synchronization, but this remains in the low single-digit millisecond ranges, meaning that the worst case uncertainty is ~10ms (4 network induced + 6 linear worst-case). In future work, they hope to get this down to &lt;1 ms.</p>
<p><strong>Q:</strong> Could you comment on the difference between external consistency and strict serializability?<br />
<strong>A:</strong> The two are equivalent.</p>
<p><strong>Q:</strong> Could you, instead of fixing a commit timestamp as at the "safe time", represent commit time as an interval?<br />
<strong>A:</strong> Yes, but our data representation necessitated a single time.</p>
<p><strong>Q:</strong> Is every epsilon based on the global GPS or atomic clock time, or do you have some kind of local reference epsilon?<br />
<strong>A:</strong> Some data centres may have higher values than others; in that case, spanner will slow down when running in a data centre with higher uncertainty.</p>
<p><strong>Q:</strong> The chance of the clock going rogue is slim, but what is the worst case scenario if this happens?<br />
<strong>A:</strong> This is equivalent to assuming that the computer has stopped working. In that case, we need to eject it from the system, or otherwise timestamps will be chosen incorrectly.</p>
<p><strong>Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary</strong><br />
<em>Cheng Li, Max Planck Institute for Software Systems; Daniel Porto, CITI/Universidade Nova de Lisboa and Max Planck Institute for Software Systems; Allen Clement, Max Planck Institute for Software Systems; Johannes Gehrke, Cornell University; Nuno Preguiça and Rodrigo Rodrigues, CITI/Universidade Nova de Lisboa</em></p>
<p>Why does latency matter? Experiments from ad systems at Bing show that, as latency increases, revenue per user goes down significantly. At the same time, geo-replication is necessary to keep latency down. Replication traditionally necessitates a decision on strong consistency (with limited performance), or eventual consistency (higher performance). This talk is about how one can build a system that has both properties.</p>
<p>In their model, there are some operations that require total ordering (strong consistency), and some that are happy with eventual consistency. They call the combination "RedBlue consistency", which is partially ordered, maintaining strong consistency when necessary, but goes for eventual otherwise. A local site can always accept a "blue" (eventually consistent) operation without any coordination with other sites, while red operations require coordination as they must be serialized. Their "Gemini" coordination system is based on a special token (the "red flag") that must be held in order to execute strongly consistent operations, and can only ever be held by one site in the distributed system (although it can be passed around). Challenge is now to make sure that all sites converge in this scenario. One key insight that allows "blue" operations to be used in systems that otherwise require strong consistency is that operations can often be split into several sub-operations (e.g. deciding on a value and actually applying the change) that need not all be strongly consistent in order to produce the same end result (deciding the value is "red", but applying the change can be "blue"). A blue "shadow operation" must commute with all other operations AND break no invariants; otherwise, it must be a red operation.</p>
<p>Evaluation: three key questions -- (1) How many blue operations can we extract from workloads? (2) Does RedBlue consistency improve user-observed latency? (3) How does throughput scale with an increased number of sites? For (1), looked at TPC-W, RUBiS and Quoddy (some kind of social networking app). While none of the existing operations in these could be made blue, there were 4-17 extractable shadow operations that could, and these accounted for &gt;90% of workload runtime in all cases. To see the latency improvements, they ran some TCP-W experiments on EC2, and found that latency goes from thousands of ms for a remote site access when using only red consistency to &lt;100 ms at all sites when using RedBlue. Peak throughput was also improved in a multi-site setup, with more sites adding more throughput (due to parallelism in blue operations).</p>
<p><strong>Q:</strong> Can shadow operations be identified automatically?<br />
<strong>A:</strong> Future work; currently doing it manually.</p>
<p><strong>Q:</strong> How much effort to transform existing code into using RedBlue consistency?<br />
<strong>A:</strong> About a week to familiarize with code base.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2012/10/09/live-blog-from-osdi-2012-day-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Blogging OSDI 2012 &#8212; Day 1</title>
		<link>http://www.syslog.cl.cam.ac.uk/2012/10/09/blogging-osdi-2012-day-1/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2012/10/09/blogging-osdi-2012-day-1/#comments</comments>
		<pubDate>Tue, 09 Oct 2012 15:59:54 +0000</pubDate>
		<dc:creator>Malte Schwarzkopf</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Networks]]></category>
		<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Storage]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1043</guid>
		<description><![CDATA[For the next couple of days, I am attending OSDI in Hollywood. However, due to various scheduling constraints on both sides of the Atlantic, I only arrived there at lunch time on Monday, and missed the first session. Fortunately, in addition to my  delay-tolerant "live blog" from the plane, where I read the first session's [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignleft" title="OSDI logo" src="https://www.usenix.org/sites/default/files/osdi12_going.png" alt="" width="162" height="67" />For the next couple of days, I am attending OSDI in Hollywood. However, due to various scheduling constraints on both sides of the Atlantic, I only arrived there at lunch time on Monday, and missed the first session. Fortunately, in addition to my  delay-tolerant "live blog" from the plane, where I read the first session's papers, <a href="http://research.microsoft.com/en-us/people/derekmur/">Derek Murray</a> was kind enough to take some notes on the actual talks. Normal live-blogging service of the talks will be provided for the other days! :)</p>
<p><span id="more-1043"></span></p>
<h1>Session 1: Big Data</h1>
<p><strong>Flat Datacenter Storage</strong><br />
<em>Edmund B. Nightingale, Jeremy Elson, and Jinliang Fan, Microsoft Research;  Owen Hofmann, University of Texas at Austin;  Jon Howell and Yutaka Suzue, Microsoft Research</em></p>
<p><strong>Malte's summary:</strong></p>
<p>"Flat storage", a.k.a. a simple network file server, is simple and neat. In conventional data centres, however, we see more complex approaches that try to move the computation to the data, since data motion is expensive as a result of the tree topology of the DC. Many common jobs however inherently require data motion, and MR et al. do not cut it for those. However, we can now build full bisection bandwidth DCs, so the locality constraint can go: in FDS, "all compute nodes can access all storage with equal throughput". It is a clean-slate re-think of DC storage, and the guiding principle is statistical multiplexing of I/O across all disks and network links in the cluster. Data is structured in blobs and "tracts": a blob is a sequence of mutable tracts, which are small units of data (~8MB) named by a 128-bit GUID. Tracts are stored directly via raw block device access (i.e. there is no file system), and all meta-data is kept in memory (!). Simple non-blocking API with atomic writes, but no guarantee on write ordering (i.e. weak consistency in the presence of failures). FDS does not use explicit meta-data; instead, it has a Tract Locator Table (TLT) that deterministically maps GUID + tract ID tuples to tract servers (but not data on disk). It also has no durable state, and its state can be entirely reconstructed on the fly from others. Only interaction with the TLT is on process launch, then long-term caching of tract servers. Per-blob meta-data (e.g. size) is in special "tract -1" and accessed just like data. Blobs also have an atomic "extend" operation (cf. GFS's "append").</p>
<p>Tracts are replicated, and a mechanism similar to RAMCloud is used to recover data on lost disks or machines. Writes are sent to all replicas by the application library (after TLT lookup, if necessary), and only complete when all replicas have acknowledged. Meta-data changes are sent to a "primary" replica, which executes 2PC to update everyone else. Replication level is a per-blob setting. Fault-tolerance is built around the key ingredient of version numbers on TLT rows, which are used to reject stale requests referring to the pre-failure conditions. Story on network partitions is a bit weak -- they rely on only a single meta-data server running, and manually configure it (though looking into Paxos). TLTs are generall O(n^2) for n disks, representing all pairs, plus some random extra replicas. This can get quite big, but they have some optimization that decrease size.</p>
<p>Network is using a lot of shiny 10G hardware, with 5.5Tbps bisection bandwidth at cost of $250k. They found it hard to saturate 10G with a single TCP flow, since a single core cannot keep up with the interrupt load. Spreading them, using multiple short flows (design characteristic of FDS) and zero-copy architecture all help with this. They use RTS/CTS notification to avoid incast and receiver-side collisions. Evaluation test cluster is heterogeneous; they say that dynamic work allocation was key to utilizing it efficiently. They find ~1GB/s write and read throughput per client at nearly linear scalability. Random and sequential read and write performance at tract-level is identical; with triple-replication, maximum write performance goes down to ~20GB/s (from ~60GB/s). Max throughput they could reach was ~2GB/s per client (using 20Gbit network connections). Failure recovery is fast: 3.3s for 47 GB on 1,000 disks; 6.2s for 92 GB; 33.7s for compensating a whole machine failure (~655GB). Using FDS, they managed to claim the Daytona and Indy sort records, sorting around 1.4TB in 60s.</p>
<p><strong>Derek's take:</strong></p>
<ul>
<li>"Little-data" is a solved problem, when you have a tightly-coupled setup with some processors and a RAID array directly connected.</li>
<li>Dynamic work allocation is an old idea that has been lost in the move to big data.</li>
<li>FDS = blob storage that does metadata management and physical data transport, and can scale to a whole datacenter.</li>
<li>FDS has no affinity -- that's why it's called flat.</li>
<li>Uses a CLOS network with distributed scheduling.</li>
<li>High read/write performance (2 GB/s, single-replicated from a single process).</li>
<li>Fast failure recovery, and high application performance -- the example used will be disk sorting (also web index serving and stock cointegration).</li>
<li>Unit of data is an 8 MB "tract" -- basic unit of reading or writing.</li>
<li>API: CreateBlob, WriteTract and ReadTract.</li>
<li>Component: tractservers that respond to read/write requests, and a metadata server. API hides tractservers' existence.</li>
<li>Metadata management is distributed, with no centralized components on common-case paths.
<ul>
<li>Spectrum of existing ideas: from GFS/HDFS (totally centralized, big bottleneck, too-large extents (64MB)) to DHTs (fully decentralized, but multiple trips over the network to do read/writes, and slow failure recovery).</li>
<li>FDS is in between on this spectrum.</li>
<li>There is a centralized metadata server, but the client has an oracle that maps Blob_GUIDs and Tract_Nums to tractserver addresses (consistent, pseudorandom mapping). Reads and writes don't generate traffic to the central server. Oracle is a table of all disks in the system (tract locator table) that is distributed to all clients -- (H(BlobGUID) + Tract_Num) % Table_Size -&gt; locates the appropriate servers in the system.</li>
<li>Special metadata tract, numbered -1 -- spreads metadata pseudorandomly across the system.</li>
<li>FDS supports atomic append, by doing a 2PC on the metadata tracts.</li>
</ul>
</li>
<li>Networking: assume an uncongested path from tractservers to clients. Traditionally datacenter networks are oversubscribed, but building a CLOS network is much smarter. FDS provisions the network sufficiently for each disk.
<ul>
<li>Largest testbed has ~250 machines.</li>
<li>Full bisection bandwidth is only stochastic, which creates a problem for long flows. FDS generates a lot of short flows, which is ideal for load balancing in a CLOS network. CLOS networks push congestion to the edges, so you still need to do some traffic shaping. Short flows are not great for TCP, so there is some virtual circuit management described in the paper.</li>
<li>Can read 950 MB/s/client and write 1150 MB/s/client. 516 disks saturates with &lt;50 clients, and 1033 disks saturates with ~200 clients. (Single replicated, 3-replicated reduces the write throughput naturally.)</li>
</ul>
</li>
<li>Fast recovery because all of the transfers can happen in parallel. As the cluster gets larger, recovery gets faster. The disk table is constructed so that all disk pairs appear in the table (giving it size O(n^2) in the number of disks).
<ul>
<li>- Recovery at about 40 MB/s/disk -- all the way back to stable storage. A 1TB failure in a 3kdisk cluster recovered in ~17s.</li>
</ul>
</li>
<li>Sorting application -- 2012 world record for disk-to-disk sorting. This is based on MinuteSort, which is how much data can you sort in a minute? Using &lt;1/5 of the previous (Yahoo!) record, sort almost 3x the data (1470GB vs 500GB) in 59s. Also beat the UCSD sorting record, which however is a bit more CPU-efficient.</li>
<li>Dynamic work allocation -- ignore data locality constraints to allow everyone to pull work from a global pool (mitigating stragglers). Works well for sorting on FDS.</li>
<li>Q: When extending a file, you need to contact replicas before extending the file -- isn't that a bottleneck? No because we distribute this across the tractserver for block -1. Also tracts are lazily allocated, so there's little cost to extending more than you need.</li>
<li>Q: Did you compare your system to commercial products, and how do you think it might scale to orders of magnitude larger? Couldn't afford to buy a large commercial cluster. At the scale of the network we built, you could just buy a big Arista switch with guaranteed full bisection bandwidth, so we think the CLOS network will scale to 55k or more servers. Everything we've seen has been strikingly linear.</li>
<li>Q: How well does this work when the map phase of a job will typically reduce the amount of data five-to-one? Today, a lot of clusters are built where you can only extract good performance as a data-parallel ship-computation-to-storage job. Sort is a classic example of this as an I/O torture test. FDS gives the programmer more flexibility to express the computation in the most appropriate way for their job.</li>
</ul>
<p><strong>PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs</strong><br />
<em>Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University; Carlos Guestrin, University of Washington</em></p>
<p>[to be added; please check back]</p>
<p>&nbsp;</p>
<p><strong>GraphChi: Large-Scale Graph Computation on Just a PC<br />
</strong><em>Aapo Kyrola and Guy Blelloch, Carnegie Mellon University; Carlos Guestrin, University of Washington</em></p>
<p><em> </em>[to be added; please check back]</p>
<p>&nbsp;</p>
<h1>Session 2: Privacy</h1>
<p><strong>Hails: Protecting Data Privacy in Untrusted Web Applications<br />
</strong><em>Daniel B. Giffin, Amit Levy, Deian Stefan, David Terei, David Mazières, and John C. Mitchell, Stanford University; Alejandro Russo, Chalmers University </em></p>
<p>[to be added; please check back]</p>
<p>&nbsp;</p>
<p><strong>Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels<br />
</strong><em>Alan M. Dunn, Michael Z. Lee, Suman Jana, Sangman Kim, Mark Silberstein, Yuanzhong Xu, Vitaly Shmatikov, and Emmett Witchel, The University of Texas at Austin</em></p>
<p>People want to run programs without leaving any traces. Claim: current "state of the art" is private browsing. But this doesn't really work, as there is no OS support for privacy. There remain traces in the OS, and the application cannot do anything about it. Buffers remain (e.g. Pulseaudio, X server framebuffer stuff, network packets etc.). People have fixed this by zeroing memory on deallocation, but for some reason that I missed, that is not enough (no deniability?). So the goal is to make a system that has forensic deniability, and imposes low overheads and only on the "private" programs. Their system is called "Lacuna" and based on Linux+KVM. Applications are unmodified. First step: create "erasable program container". IPC can be contained by running the program inside a VM, but that's not enough, as the program needs access to peripherals (e.g. GPU), and the graphics driver and X server potentially have access to the data. Let's look at storage first. We can encrypt, so that only encrypted data passes through the OS (this feels like the floor-hitting fruit to me). Graphics card a little trickier; use "ephemeral channels". Two possible types: 1) leave no traces, using a hardware channel, giving the guest VM full control of HW, 2) ensure traces are not readable, encrypt data and have lightweight decryption proxy in the driver. They provide the first type of channel by exploiting virtualization support in hardware (e.g. NICs); this also seems to be a bit of a low-hanging fruit to me. The second type is based on something they built; e.g. for graphics this is based on an emulated graphics card and a modified driver, decrypting using CUDA. Modified VMM provides hardware channels for e.g. USB, Audio etc. Storage is also based on encryption, but need to make sure that the key is erased, and any pages in buffer cache are encrypted.</p>
<p>Evaluation: show that Lacuna protects privacy. To show, they inject "random tokens" instead of keyboard input, and then scan memory for those tokens. "Almost always" found without Lacuna, never with. They measure the number of LOC that handle sensitive data -- it's small (low hundreds). The overhead on switching between private and non-private mode is low (e.g. due to switching USB drivers). Runtime performance of typical desktop applications is unchanged, but CPU load is higher due to encryption overhead, although hardware AES support already helps with this.</p>
<p><strong>Q:</strong> You are leaving encrypted data on the drive, right? Cannot decrypt that, but there are legal ways of enforcing decryption (cf. court cases). Any ways of dealing with that?<br />
<strong>A:</strong> We don't hide the fact that we used encrypted channels, but we do destroy the key, so nobody could actually get to the data (eh? Surely that means that any persistent storage is pointless?)</p>
<p><strong>Q:</strong> What do you do with unencrypted data in device (not OS) buffers?<br />
<strong>A:</strong> Quite possibly there are device-level HW buffers. But this isn't our focus; we try to do as good a job as we can for any "publicly accessible API".</p>
<p><strong>Q:</strong> Cheating on graphics, since surely you can just reverse the encryption?<br />
<strong>A:</strong> No, we zero out the memory afterwards. (?)</p>
<p><strong>Q:</strong> How hard was it to modify the drivers?<br />
<strong>A:</strong> Often can just modify generic subsystems; however, for graphics, we don't currently support 3D (obviously).</p>
<p>&nbsp;</p>
<p><strong>CleanOS: Limiting Mobile Data Exposure with Idle Eviction<br />
</strong><em>Yang Tang, Phillip Ames, Sravan Bhamidipati, Ashish Bijlani, Roxana Geambasu, and Nikhil Sarda, Columbia University</em></p>
<p>Mobile devices are ubiquitous, yadida. New challenges: security and privacy of data, since not protected by physical security or corporate firewalls. Devices can be lost, stolen, seized, or the user may connect to random unsecure wireless networks. Mobile OSes have not evolved to protect against this: for example, OS does not securely erase sensitive data or deleted files. Example: dump SQLite databases from app memory on Android for five out of 14 apps, they found the cleartext password this way, and 13 of 14 have "some kind of sensitive data" in RAM in cleartext. Protecting devices is hard! Encryption and remote wipe-out have issues; many users do not configure good passwords, so that devices are easy to unlock. Hence, these solutions are "imperfect stop-gaps". Their claim: we need new OS abstractions to manage sensitive data rigorously, so that devices are always "clean" just-in-case. CleanOS solves the problem by pushing the data out to a "trusted cloud". They do so by implementing sensitive data objects (SDOs), and pulling them on-demand from the cloud. Hence, a thief or attacker must then fetch the data from the cloud, where it can be more easily removed or protected using stronger passwords and encryption. Key insight: much sensitive data is in memory in cleartext, but is only used rarely (e.g. on data refresh). Applications can create SDOs in CleanOS, thereby identifying sensitive data. CleanOS then tracks these objects using taint tracking, and evicts the SDOs to the cloud "when idle" (I guess as part of GC, or something?). Actually don't push the data there, just encrypt and put the key into the cloud, and fetch it when necessary.<br />
Comparison of CleanOS vs standard mobile OS: the benefit of CleanOS in the time between attack and the user noticing is that it can audit accesses, or disable them based on heuristics for suspicious behaviour. Once the user notices, access can be completely cut off (arguably, this is also true for remote wipe-out, surely?).</p>
<p>SDOs have a unique ID (set by whom?) and a textual description for auditing. Using these is as simple as wrapping Java objects in SDO containers. The way they work is by having VM-level support (Dalvik, based on TaintDroid) and modified interpreters and garbage collectors. When an SDO is created, a random ID is generated and the SDO is registered with "the cloud" in a trusted SDO database. When SDO becomes eligible for GC, it will be encrypted and the key saved in the cloud. When the object is used again, the key must be retrieved from the cloud, which creates an audit log entry. Their new garbage collection is called "eiGC", which is more aggressive than a tradtional GC in that it will evict objects that have not been used in "some time", rather than just orphaned objects. Various bits of nitty-gritty stuff about carrying ciphertext in Java objects while maintaining their API. CleanOS can work without any app support, but works better with such support. They provide some sensible defaults (SSL, user input and password SDOs). All of this stuff of course has runtime overheads and energy implications. They do, however, include a bunch of optimizations that improve performance (not many details given).</p>
<p>Here comes the eval, or a talk-sized subset thereof. Key questions: does CleanOS limit data exposure? According to some slightly unclear metric, SDOs reduce data exposure by ~90%. "Audit precision" (?) is high, but even higher if apps support SDOs directly. Time overheads of using CleanOS are in the millisecond range on WiFi, but second range for 3G. Their optimizations help to curb this down again, though.</p>
<p><strong>Q:</strong> What is the impact on power consumption?<br />
<strong>A:</strong> Less than 9% overhead. Most energy used for screen anyway (not really a new result).</p>
<p><strong>Q (Jason Flinn, UMichigan):</strong> You discovered a fundamental trade-off between the performance benefit of caching, and the granularity of privacy. Batching could really help here. How do you balance these concerns?<br />
<strong>A:</strong> Let the user configure the eviction policy.</p>
<p><strong>Q (someone from EPFL):</strong> What happens when the device is taken offline?<br />
<strong>A:</strong> We have two types of disconnection: temporary and long-term. Different solutions: for the former, just extend the lifetime of the SDOs, for latter, user can disable the eviction.</p>
<p><strong>Q (Bryan Ford, Yale U):</strong> What percentage of objects are actually sensitive, and how does the taint propagate?A: So far, not seen a massive impact of tainting. In 24h, only about 1.8% of objects are tainted.</p>
<p><strong>Q (someone from EPFL):</strong> Do apps have some kind of API that ensures that SDOs are available, or will they just freeze if the SDO has been evicted?<br />
<strong>A:</strong> Not currently, but the SDO abstraction should be transparent to apps.</p>
<p>&nbsp;</p>
<h1>Session 3: Mobility</h1>
<p><strong>COMET: Code Offload by Migrating Execution Transparently</strong><br />
<em>Mark S. Gordon, D. Anoushe Jamshidi, Scott Mahlke, and Z. Morley Mao, University of Michigan; Xu Chen, AT&amp;T Labs—Research</em></p>
<p><em> </em>Offloading is neat, as it gives us extra resources, especially on weak mobile devices. Existing work in this area follows the "capture and migrate" paradigm, usually on a method granularity. Drawback: doesn't work well in multi-threaded environments, or with synchronization. Goals for COMET: higher mobile computation speed, no programmer effort, generalize well with existing applications, resist network failures. Implementation: modified Dalvik VM, synchronizing two devices (mobile and server) using distributed shared memory. COMET = offloading + DSM, i.e. global address space. DSM is tradtionally used in well-connected environments, but this is using mobile data connections, which are probably quite pathological for this (many round-trips for writes). The Java memory model is important to the implementation, since it specifies the consistency semantics (accesses in single thread are totally ordered; lazy release consistency locking). Simple DSM scheme: they track dirty fields (Java MM is field-granularity); [missed the rest of slide]. They sync two Java VMs, including bytecode and thread stacks. There is a "pusher" and a "puller" (directional sync). First step on app launch is to load the bytecode (usually just one file), which may be cached or may need to be sent. Then thread stacks, and finally heap updates/changes. Locks are annotated with an ownership flag, which is used to establish "happens-before" relationships as required by Java memory model. Thread migration now becomes simple: 1) push VM sync, 2) transfer lock ownership. Native methods are a challenge, since they are performance critical and often interact with device hardware. They manually white-list methods that are safe to run on the server side. The whole thing is fail-safe in the sense that we can lose the server, since there is always enough information on the client in order to just run threads locally.</p>
<p>Scheduling: currently fairly simple. They monitor the execution of threads, and migrate if T = 2*migration time, i.e. if it is "worthwhile" to migrate according to a simple heuristics. They evaluate it using a 1 GHz Samsung phone and an 8-core high-performance Xeon server. Use a set of hand-picked applications from Google Play (as opposed to own applications in previous work). Get speedups of ~2.88x on WiFI and 1.28x on 3G; energy savings are 1.51x for WiFi and 0.84 for 3G (i.e. not actually a win). On LINPACK, they get ~10x speedup, and 500+x speedup on their hand-crafted multi-threaded demo application. Also looked at web browsing and whether it could be accelerated: somewhat challenging, because web browsers usually written in C. On a Java-based JS interpreter, they get 6x speedup. Unlike previous work, they have full multi-threading support, and their field-based DSM coherency is novel.</p>
<p><strong>Q (someone from Simon Frazer U):</strong> How will applications like games on Android benefit from code offloading?<br />
<strong>A:</strong> Apps with lots of user interaction will probably not benefit much (and this includes many games). Image filtering and processing apps probably better.</p>
<p><strong>Q:</strong> Reminds of CloneCloud, how is this different?<br />
<strong>A: </strong>Full multi-threading support, can migrate in the middle of a method (different granularity), no need to block on fetching data (?).</p>
<p><strong>Q (someone from Purdue):</strong> Hand-picked apps -- speculate on the possibility of an automated analysis that determines the benefit of offloading?<br />
<strong>A:</strong> Consider doing a user study to figure out if this is of practical benefit.</p>
<p><strong>Q (someone from MSR):</strong> Which apps would benefit most if you did this commercially, and what is the incentive for an app writer to include this functionality?<br />
<strong>A:</strong> Compute-intensive apps, e.g. kernel-based image processing.</p>
<p><strong>Q (someone from UCSD):</strong> How do you deal with disk IO?<br />
<strong>A:</strong> Will force migration back to client. Unlike CloneCloud, they do no virtualize the file system.</p>
<p><strong>Q:</strong> What are the requirements for supporting a native method for offloading?<br />
<strong>A:</strong> As a first order approximation, cannot support it if it makes syscalls.</p>
<p>&nbsp;</p>
<p><strong>AppInsight: Mobile App Performance Monitoring in the Wild<br />
</strong><em>Lenin Ravindranath, Jitendra Padhye, Sharad Agarwal, Ratul Mahajan, Ian Obermiller, and Shahin Shayandeh, Microsoft Research</em></p>
<p><em> </em>People have hundreds of apps on their mobile devices. In total, there are &gt;1M apps, with &gt;300k developers. This is a lot like other software markets. Many apps, however, are too slow, causing unhappy users and bad ratings. But why do apps perform poorly in the hands of users? Diversity of ecosystem, connectivity etc. -- all of this is almost impossible to reproduce comprehensively in the lab. What a developer needs in order to get useful feedback is monitoring "in the wild", in the hand of the user, on their device. Currently, the only option is to instrument the app with bespoke monitoring infrastructure, but that is beyond what hobbyist developers can do. Their thing, AppInsight, runs with zero developer effort and instruments compiled binaries, which can then be distributed to users.</p>
<p>However, app instrumentation is challenging because you do not want to slow things down even further, and often only have limited resources at hand. Apps, to make matters worse, are very highly asynchronous due to frequent UI interactions etc.; most tasks are performed asynchronously. Tracking user-perceived delay across thread boundaries is really hard. Somehow, they can extract a "transaction graph" from real apps; results show that, on average apps have ~19 asynchronous calls and use ~8 threads. The bottleneck path through the transaction graph is called "critical path"; if it can be made shorter, user-perceived delay improves. Back to AppInsight: need to capture *just* enough information at low overhead to understand the transaction. They try to identify "upcalls" to do so: e.g. event handlers, function pointers etc. (they have heuristics). These calls can then be instrumented. Matching callbacks to threads is also very hard, partly because of "detour callbacks" and non-1:1 matching. They measure a whole lot of stuff and send it to the server.</p>
<p>Goal: optimize the critical path, and make it shorter in time. They batch and aggegate in order to reduce profiling overhead; programmers provide annotations e.g. indicating corner cases. Using similar mechanisms, they can also monitor and trace thread crashes, giving the "exception path". Everything is aggregated in a shiny web-based interface.</p>
<p>Evaluation: they deployed this in 30 Windows Phone apps, which actually do have performance problems: 15% of user transactions take more than 5s. They observe a huge variability in the wild, and measure the overhead at runtime: it is &lt;1% on compue, ~4% on network and &lt;1% on battery. Case study: "My App", which has an UI hog problem. AppInsight told them where to look for the issue. Another one had slow transactions, due to bad caching policies; which AppInsight highlighted. Also did a pilot study with a large enterprise that actually built its own instrumentation pipeline, which then turned out to make up the critical path, responsible for extra latency!</p>
<p><em>[apologies, for the bad coverage of the last talk -- I was very tired by this point, after having come to the conference straight off an 11-hour flight!]</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2012/10/09/blogging-osdi-2012-day-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RecSys 2012: few things i remember</title>
		<link>http://www.syslog.cl.cam.ac.uk/2012/09/14/recsys-2012-few-things-i-remember/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2012/09/14/recsys-2012-few-things-i-remember/#comments</comments>
		<pubDate>Fri, 14 Sep 2012 12:47:19 +0000</pubDate>
		<dc:creator>Daniele Quercia</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1038</guid>
		<description><![CDATA[random notes &#38; thoughts Workshops From the Sunday's workshops, I remember this paper "Dating Sites and the Split-complex Numbers" It uses split-complex numbers to represent dating preferences in an elegant way. It seems promising. I'd be great to connect this work on previous papers on trust and distrust and on structural balance theories... I also heard [...]]]></description>
				<content:encoded><![CDATA[<p>random notes &amp; thoughts</p>
<p><strong>Workshops</strong></p>
<p>From the Sunday's workshops, I remember this paper "<a href="http://networkscience.wordpress.com/2011/08/09/dating-sites-and-the-split-complex-numbers/">Dating Sites and the Split-complex Numbers</a>" It uses split-complex numbers to represent dating preferences in an elegant way. It seems promising. I'd be great to connect this work on previous papers on trust and distrust and on structural balance theories... I also heard that two presentations were quite good: 1) <a href="http://lnkd.in/4Zv7Ui">Content, Connections, and Context</a> 2) Joseph Konstan talk abt the different decision strategies ppl have in different contexts.</p>
<p>On Thursday, we run a workshop on  mobile recommender systems. Francesco Calabrese of IBM Smart Cities gave an interesting invited talk about current projects on transportation systems. Then, we had a set of really good talks &amp; one outdoor activity. What did I learn? Well, most of the existing mobile systems assume that the recommendation process unfolds in one single step - get restaurant recommendations &amp; choose one of them. In reality, recommendations in the built environment should go beyond that. For example,</p>
<ul>
<li>To mimic humans, the task of recommending restaurants should at least return 3 different recommendations (or facets): closest restaurant, best restaurant, trade-off between the two.</li>
<li>One should understand WHY people visit certain places. How did they make those decisions? Which criteria did they employ?</li>
<li>Recommender systems need to tap into established findings in the area of urban studies. For example, in our RecSys paper "<a href="http://www.cl.cam.ac.uk/~dq209/publications/trumper12ads.pdf" target="_blank">Ads &amp; the City</a>", we exploited the fact that people are boring - they generally do not travel very far - unless what they are looking for is not readily available where they are.</li>
<li>Temporal patterns in recommender systems have not been widely studied. They have been studied on Web platforms only recently (and Neal Lathia has done <a href="http://www0.cs.ucl.ac.uk/staff/l.capra/publications/lathia_sigir10.pdf">great work</a> on that!) and have been neglected in mobile platforms. That is why we had another paper in the conference titled "<a href="http://www.cl.cam.ac.uk/~dq209/publications/sha12spotting.pdf" target="_blank">Spotting Trends: The Wisdom of the few</a>"</li>
<li>Finally, and more importantly, we need far more user studies of how these systems are ACTUALLY used! Recommendations do not matter much -the experience counts ;)</li>
</ul>
<p>And this is just scratching the surface ;)</p>
<p><strong>Conference</strong></p>
<p>I remember only few things from the conference (the industry track was pretty good):</p>
<ul>
<li> Multiple Objective Optimization in Recommendation Systems (linkedin). Nice example of A/B testing</li>
<li>Towards Personality-Based Personalization (Thore Graepel of Microsoft Research). Nice talk about how easy is to predict personal attributes of Facebook users based on their likes. if you are interested in personality and social media, you should check out our work on <a href="http://www.cl.cam.ac.uk/~dq209/publications/quercia12personality.pdf" target="_blank">Facebook</a> and <a href="http://www.cl.cam.ac.uk/~dq209/publications/quercia11twitter.pdf" target="_blank">Twitter</a> (we can predict personality traits of twitter users upon only their number of followers, following, and listed counts)</li>
</ul>
<ul>
<li>Building Industrial-scale Real-world Recommender Systems (Xavier Amatriain of Netflix). Brilliant (&amp; <a href="http://instagram.com/p/Pbfg3ng2Mk/" target="_blank">fully packed</a>) tutorial. Check <a href="http://recsys.acm.org/2012/tutorials.html#building" target="_blank">this</a> out for a summary.</li>
<li>Controlled experiments at Microsoft Bing (very good work): i encourage you to read  2009 guide [<a href="http://t.co/UcArxo6L" target="_blank">pdf</a>] ; <a href="http://t.co/blErYzJW" target="_blank">2012 kdd </a>paper; <a href="http://www.exp-platform.com/Pages/2012RecSys.aspx" target="_blank">slides</a> of the talk.</li>
<li>Pareto-efficient hybrization for multi-objective recommender systems (UFMG). Here the question is  how to combine different types of algorithms (hybrization).</li>
<li>User Effort vs. Accuracy in Rating-based Elicitation (PoliMI). What's the optimal number of users ratings for movie recommendations? It seems to be between 5 to 20.</li>
<li> TasteWeights: A Visual Interactive Hybrid Recommender System (UCSB). Visualization platform for your social media stream</li>
<li>Learning to rank optimizing MRR for recommendations. Very cool <a href="http://t.co/6zBF9Ds3" target="_blank">work</a>.  It taps into the <a href="http://dl.acm.org/citation.cfm?id=1148245" target="_blank">less is more</a> concept, which I'm a big fan of</li>
<li>Thumbs up to real-world stuff: Beyond Lists: Studying the Effect of Different Recommendation Visualizations;  Yokie - Explorations in Curated Real-Time Search &amp; Discovery Using Twitter; A System for Twitter User List Curation; The Demonstration of the Reviewer’s Assistant; CubeThat: News Article Recommender (browser extension for Chrome displays recommended additional news stories related to the same topic as the current news story)</li>
<li>Challenges in music recommendation (@plamere from @echonest). A couple of interesting insights: "Understanding the specifics of your domain is critical to building a good recommender"; and recommending down-tail is OK, while recommending up-tail (britney to one who likes tom waits) is risky. Might be offensive to one's music identity. So make your recommendations <strong>Hipster-Friendly</strong> ;)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2012/09/14/recsys-2012-few-things-i-remember/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Communications and Multimedia Security Workshop</title>
		<link>http://www.syslog.cl.cam.ac.uk/2012/09/05/1034/</link>
		<comments>http://www.syslog.cl.cam.ac.uk/2012/09/05/1034/#comments</comments>
		<pubDate>Wed, 05 Sep 2012 08:08:42 +0000</pubDate>
		<dc:creator>Jon Crowcroft</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Workshop]]></category>

		<guid isPermaLink="false">http://www.syslog.cl.cam.ac.uk/?p=1034</guid>
		<description><![CDATA[Communications and Multimedia Security University of Kent, Cantervury Sponsor IFIP Sep 3-4, 2012 Proceedings are LNCS - will give to CL library if people want to look up any paper there Basic conference is fairly good- lots of low level detailed work...mainly securty, but some systems stuff 9.30 - 10.30 Keynote Talk Privacy Management in [...]]]></description>
				<content:encoded><![CDATA[<p>Communications and Multimedia Security<br />
University of Kent, Cantervury<br />
Sponsor IFIP<br />
Sep 3-4, 2012</p>
<p>Proceedings are LNCS - will give to CL library if people want to look up any<br />
paper there</p>
<p>Basic conference is fairly good- lots of low level detailed work...mainly securty, but some systems stuff</p>
<p><span id="more-1034"></span></p>
<p>9.30 - 10.30 Keynote Talk</p>
<p>Privacy Management in Global Organisations<br />
Siani Pearson, HP Labs</p>
<p>This was a high level definitional view of the proble space - key takehomes were<br />
1. there are different jurisdictional areas with different legal and social definitions of privacy - if you are doing business across these, then you need to consider how the different privacy policy and legal systems interact and basicalyl work out the cross product of permissable (and non-permissable) things you can do with PII</p>
<p>2. HP have a pretty nice toolset for walking people through privacy policy - instead of asking their employees to read (and cmprehened) their corporate policy doc (of 300 pp), they ask them to run the wizardd...</p>
<p>11.00 - 12.30 Research Papers 1 - Image and Handwriting Analysis</p>
<p>Robust Resampling Detection in Digital Images<br />
Hieu Cuong Nguyen, Stefan Katzenbeisser</p>
<p>this is what it says on the tin</p>
<p>Feature Selection on Handwriting Biometrics: Security Aspects of Artiﬁcial Forgeries<br />
Karl Kummel, Tobias Scheidat, Claus Vielhauer</p>
<p>ditto - lots of machine learning</p>
<p>Security Analysis of Image-based PUFs for Anti-Counterfeiting<br />
Saloomeh Shariati, Francois Koeune, Francois-Xavier Standaert</p>
<p>PUFs are phsyical differences in things like printers - you get to know which particular printer made something - you can seed watermarks from those differences - this was a formal framework for understanding the security properties of PUFs just like other security entities (MACs, Marks, Identfiers etc) and was a nice talk..paper looks quite good</p>
<p>12.30 - 13.30 Lunch</p>
<p>13.30 - 15.00 Work in Progress 1 - Biometrics, Forensics and Watermarking</p>
<p>Computer-aided contact-less localization of latent ﬁngerprints in low-resolution CWL scans<br />
Andrey Makrushin, Tobias Kiertscher, Robert Fischer, Stefan Gruhn, Claus Vielhauer, Jana Dittmann</p>
<p>neat way of using low cost cameras to get 2.5-3D images of prints fro ma crime scene - looks like the group has work with company that has patents&amp;prototypes that could work...device right now is size of a carboot, but they hope to get it down to a fryingpan:)</p>
<p>A Method for Reducing the Risk of Errors in Digital Forensic Investigations<br />
Graeme Horsman, Christopher Laing, Paul Vickers</p>
<p>Nice talk by a CS/Forensics guy now retraining as a barriste (folow the money)r:)</p>
<p>Short Term Template Aging Eﬀects on Biometric Dynamic Handwriting Authentication Performance<br />
Tobias Scheidat, Karl Kummel, Claus Vielhauer<br />
Not about aging as in decrepit - just about how even over short periods (e.g. a term in college) some biometrics (handwriting, esp.) can alter enouh that the reference version starts to give more false negative/positives quite quickly...</p>
<p>A New Approach to Commutative Watermarking-Encryption<br />
Roland Schmitz, Shujun Li, Christos Grecos, Xinpeng Zhang</p>
<p>This is a formal paper on hw to design codes that can be used for make the order of crypt &amp; mark irrelevant - think generalized homomorphic crypto/watermark...</p>
<p>15.00 - 15.15 Extended Abstracts 1</p>
<p>OOXML File Analysis of the July 22nd Terrorist Manual<br />
Hanno Langweg</p>
<p>THis was real work on Anders Breivik's document that was sent out 2 horus before he bombed and shot dead 70 people. THe study was to determine that the document was largely or completely by one person kept almost as a diar over 4 years, and not likely to have hda other contributers (obvioiusly this matters in the poice followup to help to determine that the crinal acted aone and wasn't (as he claimed in court and before ) part of a movement.</p>
<p>15.15 - 15.45 Tea/Coffee Break</p>
<p>15.45 - 19.00 Tour of Canterbury</p>
<p>We went, of course, to the Crypt in the Cathedral and I also visited a Bazaar (the Shed by Caterbury West station:)</p>
<p>19.00 - 21.00 Welcome Reception with Poster Display</p>
<p>Tuesday 4th September 2012</p>
<p>9.00 - 9.30 Registration</p>
<p>9.30 - 10.30 Keynote Talk</p>
<p>From Panopticon to Fresnel, dispelling a False Sense of Security<br />
Jon Crowcroft, University of Cambridge<br />
You have the slides</p>
<p>http://www.cl.cam.ac.uk/~jac22/talks/</p>
<p>10.30 - 11.00 Tea/Coffee Break</p>
<p>11.00 - 12.30 Research Papers 2 - Authentication and Performance</p>
<p>Document authentication using 2D codes: Maximizing the decoding performance using statistical inference.<br />
Mouhamadou Diong, Patrick Bas, Wahih Sawaya, Chloe Pelle</p>
<p>what it says on tin</p>
<p>Data-minimizing Authentication goes Mobile<br />
Patrick Bichsel, Jan Camenisch, Bart De Decker, Jorn Lapon, Vincent Naessens, Dieter Sommer</p>
<p>Password free access - basically attribute based login - e.g. are you allowed in this bar (are you over 18/21) doesn't require proof of id, just proof of attribute) - later talk presented work on revokation (not sure how ou revoke being 18:)</p>
<p>No Tradeoﬀ Between Conﬁdentiality and Performance: An Analysis on H.264/SVC Partial Encryption<br />
Zhuo Wei, Xuhua Ding, Robert Huijie Deng, Yongdong Wu</p>
<p>looked at scaleable video codig and how you can crypt base codes but not enhancement layers and still get ok privay/integrity - the paper quantifies temporal and sptaial leakage in different codecs...</p>
<p>12.30 - 13.30 Lunch</p>
<p>13.30 - 15.00 Work in Progress 2 - Communications Security</p>
<p>Systematic Engineering of Control Protocols for Covert Channels<br />
Steﬀen Wendzel, Jorg Keller</p>
<p>Nice solid work on desigb&amp;capacity of different cover channels in TCP/IP and the like (think low order bits in TTL, etc)</p>
<p>Eﬃciency of Secure Network Coding Schemes<br />
Elke Franz, Stefan Pfennig, Andre Fischer</p>
<p>showed that you can secure network coded transmission without incurring too much overhead - actualy I think for wireless net coding, its easier than they think...but they concentraded on multicast coding classic..</p>
<p>A new Approach for Private Searches on Public-Key Encrypted Data<br />
Amar Siad<br />
no show</p>
<p>Multi-Level Authentication Based Single Sign-On for IMS Services<br />
Mohamed Maachaoui, Anas Abou El Kalam, Christian Fraboul, Abdellah Ait Ouahman</p>
<p>what it says on the tin!</p>
<p>15.00 - 15.30 Tea/Coffee Break</p>
<p>15.30 - 17.15 Extended Abstracts 2</p>
<p>Cuteforce Analyzer: Implementing a Heterogeneous Bruteforce Cluster with Specialized Coprocessors<br />
Jurgen Fuß, Wolfgang Kastl, Robert Kolmhofer, Georg Schonberger, Florian Wex</p>
<p>crazy brute force cryptanalysis toollike</p>
<p>A framework for enforcing user-based authorization policies on packet ﬁlter ﬁrewalls<br />
Andre Zuquete, Pedro Correia, Miguel Rocha</p>
<p>Uses national ID cards to setup an IP option to make packets accountable (i.e. linkable to specific person)...arghhh!!</p>
<p>From Biometrics to Forensics: A Feature Collection and ﬁrst Feature Fusion Approaches for latent Fingerprint Detection using a Chromatic White Light (CWL) Sensor<br />
Robert Fischer, Tobias Kiertscher, Stefan Gruhn, Tobias Scheidat, Claus Vielhauer<br />
2nd talk about some detail of the low cost fingerprint tech talked about earlier</p>
<p>Practical Revocable Anonymous Credentials<br />
Jan Hajny, Lukas Malina</p>
<p>Another connected talk, which showed how to revoke credntials (club membership etc)</p>
<p>Are 128 bits long keys possible in Watermarking?<br />
Patrick Bas, Teddy Furon</p>
<p>Answer yes - see paper for nice math on why</p>
<p>Predicate-Tree based Pretty Good Privacy of Data<br />
William Perrizo, Arjun G. Roy</p>
<p>no show</p>
<p>Privacy-Preserving Scheduling Mechanism for eHealth Systems<br />
Milica Milutinovic, Vincent Naessens, Bart De Decker</p>
<p>solid work to do cover/crowd timing stuff so health sensor/monitor reports are kept reasonable private and not subject to timing analysis attacks...</p>
]]></content:encoded>
			<wfw:commentRss>http://www.syslog.cl.cam.ac.uk/2012/09/05/1034/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
