Liveblog from Eurosys 2013 – Day 3

17Apr/130

Liveblog from Eurosys 2013 – Day 3

This is day 3 of Eurosys, the last day and we'll be running the live blog as usual!

Your friendly bloggers areÂ Natacha CrooksÂ (nscc),Â Ionel GogÂ (icg), Valentin Dalibard (vd) andÂ Malte SchwarzkopfÂ (ms).

Session 1: Scheduling and Performance Isolation

[No blog coverage available]

hClock: Hierarchical QoS for Packet Scheduling in a Hypervisor

Jean-Pascal Billaud and Ajay Gulati (VMware, Inc.)

RapiLog: Reducing System Complexity Through Verification

Gernot Heiser, Etienne Le Sueur, Adrian Danis, and Aleksander Budzynowski (NICTA and UNSW) and Tudor-Ioan Salomie and Gustavo Alonso (ETH Zurich)

Application Level Ballooning for Efficient Server Consolidation

Tudor-Ioan Salomie, Gustavo Alonso, and Timothy Roscoe (ETH Zurich) and Kevin Elphinstone (UNSW and NICTA)

Session 2: Scheduling and Performance Isolation

Omega: flexible, scalable schedulers for large compute clusters -- BEST STUDENT PAPER!

Malte Schwarzkopf (University of Cambridge Computer Laboratory), Andy Konwinski (University of California Berkeley), and Michael Abd-el-Malek and John Wilkes (Google Inc.)

Omega is Google's next gen cluster scheduling system. Scheduling in the cluster: tasks that are part of job and map those to machines. At Google, observed a number of trends in recent years: workload are becoming increasingly diverse. The size of the cluster keeps increasing and so does the rate atÂ whichÂ jobs arrive at the scheduler. TheÂ schedulingÂ logic could be the same for all workloads / tasks, but observe that overtime, keep adding scheduling tweaks andÂ specificationsÂ The clusters are very big, with huge number of machines. ThisÂ complexityÂ in the scheduler is complex because huge monolithic piece of code. The idea is therefore to break the scheduler up into independentÂ schedulersÂ But in order to do that, need to arbitrate resources in some way or another. You have various techniques, monolithic scheduler, static partitioning. Also include two level approach, where you have a resource manager component, and partitions the cluster according to use. (ex: Mesos).

Do something different: shared state, try, hope for the best, and solve any conflicts afterwards. Have n schedulers with replicas or the cluster state , generate "delta" which sends it to the global state. In the shared cluster states, deltas are applied, which may succeed or may conflict. If conflict, the first one succeeds, the other one fails and may reply.

Workload characterisation: break workload into batch and service jobs. Observe that most jobs are batch, but most resources are consumed by service jobs. Batch jobs run for a shorter time but arrive much more frequently. Service jobs are less frequent (larger scheduling budget) but run for longer (so worth it).

Simulation workload:
- model scheduler decision time. As more tasks in a job, take more time to schedule it. There's also a constant baseline oof work that you have to do for each job independent of it size.

Vary decision time for all tob vs per task decision time for each job vs CPU load
1) Monolithic scheduler. Observation that it does't scale for long jobs.
2) Monolithic fast path batch decision time. Problem is that have head of line blocking because even if have two scheduling logic, scheduler is not parallelised.
3) Mesos. Failed to schedule jobs in every case. Mesos works based on offers. But is greedy so could havc eone which receives offer of all availalbe resources, next one receives tiny offer, but first one is long running so has to retry many times.
4) Omega, no optimisations. Conflicts are problems. Some transactions class and schedulers have to redo work, so schedulers are busier.
5) Omega, optimisation. Figure comparable to monolithic. Slightly worse because conflicts still occur.
The omega shared state model performs as well as a coplex monolithic fast batch path schedluer.

Scaling to many schedulers: Scale up to 32 scheduler (with load balancing). Utilisation goes down by about 8 (rather than 32), that's because of conflicts. But still, quite a large number.

Trace based simulation:
How much interference observed? There's a large overhead due to conflcits. Turns out is because of oversubcription and placement constraints. Interference is higher for real word settings.
Optimisations: 1) fine grained conflict detection, (avoiding false conflicts). Can slap a sequence number, and when touch machine, increment sequence number. But if there's still headroom to run antoher task, allow scheduling anyways. Reduces rate signifcantly.
2) incremental conflicts: if one task out of all jobs, previously failed entire taks, now only retry tasks that failed not whole job.
2x difference observed. significant improvement in performance.

Case scheduler: mapreduce scheduler with opportunistic extra resources. Takeaway is to give fleixbitly to easily support custom policies.

Conclusion:Â flexibilityÂ and scale require parallelism. parallel scheduling works if you do it right, and using shared state + optimistic concurrency is the way to do it right. Â Â (nscc)

Scheduling many jobs on many machines: Because jobs are very different, many different schedulers have been designed. But they are all scheduling over the same set of resources. Previous solutions:
-Static scheduler: split the machines into different fixed size groups associated with each scheduler. Doesn't work if a scheduler is overloaded and doesn't have enough machines
-Dynamic scheduler: like mesos, each scheduler is associated a set of machines which changes dynamically
-Omega: Each scheduler can schedule on any machines. If there is a conflict, one of the scheduler wins and schedule his job whilst the other looses

Workload characterisation: 2 types of jobs
-Batch jobs: vast majority and very frequent, return in a short time (20 min)
-Service jobs: very rare bun run for a very long time (months) urn out to be the majority of the computation done because they run for so long
They try to guess what kind a job belongs in.

They do some experiment and observe something surprising about Mesos: A big proportion of jobs do not get scheduled by the scheduler. The reason is that when a single scheduler is given the right to schedule, all others are pretty much given no resources to play with if they try to schedule something concurrently.

They do evaluation through simulation. Google turns out to have a high fidelity simulator that uses real workload so they can evaluat it fully this way. The outcome is that there are a lot of conflicts (which result in a high time to schedule tasks) in scheduling eal workloads, so they do 2 optimizations:
-Fine grained conflict detection
-Incremental commits: when jobs have many tasks, and some tasks fail let the other ones go through.
They show that with these optimizations they are very close to the optimal scheduling time that they would get without conflicts Â (vd)

Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints

Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica (UC Berkeley)

Large datacenters are heterogeneous. Why, because its very hard to keep machine configurations identical. There's two types of diversity: more and more specialised hardware. Ex: lots of gpus, high memory machines, hardware accelaroes. Specialisation also happens on the software side.
More and more jobs have placement constraints. Two types of constraints: hard constraints, which must be satisfied fo rthe job to run on a particular machine. Soft constraints express a preference only.

How to fairly allocate machines when users have hard job placement constraints. Objective is to generalise max min fairness to extend constraints. What policyt to use. How to optimally achieve policy offline? How to approximate policy online? How well does it work? How to genearlise scheduling with constraints?

Policy:
traditionally, have a sharing incentive where each user gets 1/n or resources. Develop constrainted sharing incentive CSI asusme each user i provides k machines, then user i should be able to get at least i machines in any allocation. Shapley Value. give 1/k machine to each k users who want it. But this violates CSI unfornately.

Proposed policy: constrained max min fairness (CMFF) recursively maximises the allocation of the user that has fewest machines. An allocations is CMMF iff it is not possilbe to increase the minimum allocation for any subset of users by reallocating machines in the subset. Only policy that satisfies CSI. CMFF is strategy proof. Lying about demands can only hurt your application.

The intution is that basically doing water filling: 1) initally mark all users as non frozen. Increases the allocation of all the non frozne users the same rateuntil bottleneck hit. Freeze share of users that cannot get more without hurting others. Then repeats process until all users are fozen. This only freezes shares, not actual assignments (of machines) which might shift. To determine share, solve linear problem with machine and fairness constraints. Once done, to determine users, add another constraint with frozen users which must receive exactly m (the value maximised i the previous equation) shares. This determines which suers are frozen.

To determine frozen users in an allocation, fix ever yall but user i's current shares. Freeze user i iff its allocation cannot be increases when everyone else is frozen.

Try to approximate this in a more efficently way, because cloud schedulers make decisions online so jobs may consist of many thousands of tasks, need to schedule tasks quickly on the fy. Aprroximate an offline scheduler that cannot preement or migrate running tasks.

Evaluation: how different is Choosy to optimal scheduler previously mentioned? How does it compare? How different are the allocation vectors and the job completion times. 90% of the time, difference in allocation root mean sqaure erorr is less than 0.2% (a tiny bit worse if you allow scheduler to migrate and preempt). For job completnation times: choosy has almost optimal job completion time.

Why is Choosy working? ramp up time is fas for Choosy, users quickly get their fair share. ramp up time depends on pickiness of user (nb of machines user needs / nb of machines user can run on). Whats happening is that, if not picky, don't have to wait for outliers, hence why time is fast.

How can generalise ?
-> soft constraints like data locality. Existing techniques like Delay scheduler, Mantri, can be combined with Choosy.
-> Multi Resource Fairness. Choosy very similar to DRF. Schedule user with min dom share satisfying constraints.
-> Hierarchical Scheduling: most hierarchical schedulers support compositions.

Conclusion: constrained max min fairness only policy providing sharing incentive, optimsal offline calculation based on iterative linear programming. Choosy, online system close to offline version. Â (nscc)

Large datacenters are heterogenous. 2 types of differences:
-Different specialised hardware (e.g. GPUs)
-Different software (e.g. kernel version)

2 types of constraint:
-Hard constraints: I need to have a public address (they focus on this)
-Soft constraints: data locality

What policy to use: They generalise max min fairness with Constrained max-min fairness => recursively maximise the allocation of the user that has the fewest machines. They seem to have some cool proofs on this being the only way to satisfy some max-min like properties, but it's only in the paper. The CMMF works in the basic water filling algorithm way for min max: keep on maximising the resources allocated to the more unlucky ones. This can be turned into a linear programming algorithm that needs to be computed iteratively.

But doing this is expensive so they approximate it in the folling way: Approximate an offline scheduler that can not preempt tasks.

What Choosy does: when ever a resource frees up, allocate it to the more unlucky ones.
They compare Choosy with optimal schedulers, with different definitions of optimal. Choosy seems to perform very well compared to optimal. I am confused to what they call optimal. Do they mean the optimal max-min fair allocation? If a job is better at parallelising than another, max min allocation should be different than optimal (i.e. utility sum maximization) allocation.

After questions: turns out they were using the optimal CMMF allocation.

Personal point of view: it is a shame they didn't have a look at the optimal allocations, and even further, at the tradeoff between alpha-fairness (optimal: alpha=0, min-max: alpha = infinity) and performance. Even if just from a theoretical point of view without considering the complexity of the scheduling algorithm. That would have given great insight to the heterogeneity (or not) of workflow: how different are the job performance under parallelisation characteristics? Instead they made the assumtpion min-max is best with no backup and no insight to whether it is. Â (vd)

CPI2: CPU performance isolation for shared compute clusters

Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, vrigo Gokhale, and John Wilkes (Google, Inc.)

[no blog coverage]

Awards

PC members nominated papers based on merit, PC members listened to talks attendees, to refine list. PC chairs made final decision.

Best Paper Award: Â Â BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
Best Student Paper: Â Â Omega: flexible, scalable schedulers for large compute clusters