Keynote: Challenges for M*-core Systems
Steven Hand, University of Cambridge
Transistor counts double every 18 months, as we all know. But we are no longer building single-core chips out of these, but multi-cores. We've got dozes of cores now, but this is likely to increase. The reason for going multi-core in the first place was hitting the "power-wall": couldn't clock CPUs fast enough, but also "walls" in memory access and ILP. Multi-core ~= 2-16 cores, many-core ~= >16 cores; can be homogeneous or heterogeneous. In theory, should get ideal parallel speedup for parallelizable sections (cf. Amdahl's Law), but in practice, there are parallelization overheads that cause diminishing returns or even degradation as we go more parallel.
So how can we go faster when parallelizing? Maybe make serial portion faster (dynamic overclocking etc.); improve synchronization and communication primitives to reduce parallelization overhead; reduce straggling in parallel portions. The latter sounds suspiciously like MR-style stragglers -- but typical solution a bit different: use gang-scheduling for parallel threads to minimize imbalance. But this only works if the threads make equal progress given the same opportunity! This isn't always true -- many data-parallel applications aren't too regular (e.g. graph computations). Might be a good idea to partition into more numerous tasks and use work-stealing to balance; but problem with this is that the per-task overheads come back!
Memory access is also a big challenge: with NUMA, remote memory and deep cache hierarchies, getting the placement of memory objects right is crucial for balanced performance. Furthermore, having multiple parallel applications leads to a bin-packing problem for placement, while the parallel width of programs is also a variable. Starting to get a bit complicated, as we'd ideally like to optimize system-wide utility. BOPM experiment shows that, given sufficient work available, can get parallel speedup up to 60 threads (overcommitting the machine a little), but we get diminishing returns much earlier, so it makes sense for overall utility maximization to give some space to other applications. Scheduling this becomes tricky -- gang-scheduling goes some way, but still has trouble dealing with small gangs, localization and cost of scheduling, as well as microarchitectural interference.
Some indicative interference results, e.g. SPECCPU 2006: degradation in performance happens when sharing caches (as expected), but for some benchmarks it even happens if they are running on separate sockets and thus far away from each other. This also holds for higher-level macro benchmarks, such as typical data centre applications (again, 1.6x to 2x degradation between pairs on otherwise idle machine). Co-locating cache-sensitive and memory-intensive (streaming) workloads might be a promising approach (since the streaming benchmark does not benefit from the cache). Of course, if we're not doing anything clever, the cache-sensitive benchmark gets screwed over as its data is evicted from the cache. Solution: use page colouring to partition the cache and contain the streaming workload. Surprising result, however: the cache-sensitive benchmark improves, but the streaming benchmark runs with terrible performance. This turned out to be a result of the implicit page colouring, as this partitions the address space! Can be fixed with a hack (use EPTs to remap memory dynamically), at which point both workloads work well (with some caveats, preliminary work).
But what about new applications? Many attempts and concepts to avoid the programmer having to think much about parallelization. Might use threading libraries, or task-parallel data flow frameworks, which maybe also let us deal with heterogeneity. This is great, but running all this scaffolding on top of a standard OS might be a little inefficient, as it makes system-wide optimization hard to impossible. Answer: Mirage-style unikernel OS on top of Xen, highly specialized, but single-process/single-CPU. This means boot and live migration become very quick -- nice! No paging needed inside the unikernel, as all memory is pre-allocated at bootup time. Good performance on DNS server, and high-level language features in OCaml enable improvement on legacy systems (bind, NSD).
Now, clearly there won't be a mass-migration to Mirage-style unikernels any time soon. So can we do something interesting for existing programs that would like to exploit many-core? One interesting approach is trying to use speculation to extract parallelism from single-threaded applications. Initially looked at specialization in order to avoid waiting for I/O, which can lead to much idleness especially on many-core machines. But much of I/O, especially for desktop applications, actually just reads the same data every time. We could run a specializer on the binary that includes e.g. configuration settings in the binary, and thus avoid said I/O entirely. However, it is not always possible to specialize ahead of time, which gives us opportunities for speculation. For example, at a point of control flow depending on an unclear value, we could run ahead with threads assuming plausible values and then continue on the strand that turned out to be the correct one.
Asymmetry-aware execution placement on manycore chips
Alexey Tumanov, Joshua Wise, Onur Mutlu, Gregory R. Ganger
With many-core chips, we tend to have fewer memory controllers than cores. Since they have non-uniform distances to different cores, placing execution threads close to their memory controller becomes an important problem (ANUMA, = asymmetrical NUMA). The main difference here is that ANUMA systems have gradually changing memory access latencies to different memory controllers (much more fine-grained than previously). Micro-benchmarks show a worst-case latency differential of 14% on a Tile64 ANUMA chip. Does this matter to real-world workloads? Yes, get the same 14% difference on a single-core GCC benchmark. This is likely to get much worse once other applications use the other cores, and the NoC becomes contended. Classical static NUMA partitioning is not a great answer, as it is not contention-aware. Possible solutions: move the data to the computation (or at least allocate physical frames cleverly), or move the execution to the data, or hybrid.
What they did is to instrument the VM subsystem to collected information about page access, and then places threads appropriately. The placement algorithm is simple: threads are ordered by memory intensitivity and placed in decreasing proximity to the best memory controller. Ties broken by gradient descent to second choices. Their instrumentation is fairly heavyweight, so even the optimized version does not outperform the non-instrumented baseline (but does better than baseline with instrumentation turned on). In future work, they're planning to look at heterogeneous workloads, and extend the same ideas to caches ("ANUMCA"), and look at application-level goals.
Q (Simon Peter): Do you get the contention only to the memory controller, or also due to IPC on the NoC?
A: No-ish. Looked at interconnect throughput volume, which is shared between IPC and MC traffic.
Q (Malte Schwarzkopf): Why do you need per-process HW memory access counters?
A: Don't need to; was mistake, should be per-core.
Q (?): How do you know that your placement stuff does not hurt cache locality?
A: The benchmarks we looked at have a poor cache locality, so that the total number of memory accesses remained the same.
Supporting Iteration in a Heterogeneous Dataflow Engine
Jon Currey, Simon Baker, and Christopher J. Rossbach
There is increasingly much heterogeneity in systems, due to the prevalence of various accelerator chips (GPGPU, SoCs, FPGAs). Data-flow is a good model to program these things, as it expresses the minimal data movement, implies parallelism and leaves crucial decisions to the runtime engine. But data has expressivity limitations, especially with iterative and data-dependent workloads, conditional routing and stateful computations (e.g. accelerator buffers too small to contain all data). The classical data-flow ISA solutions of distributor and selectors does not work, as they are designed for fine-grained data flow (I'm not sure why? allegedly scheduling complexity?).
Their IDEA engine extends PTask, and enables iteration without requiring any additional nodes to be added to the data flow graph. A lot like CIEL, but unlike it, they do not extend the graph, but add predicates to the channels between data flow nodes. These predicates indicate specifically when an iteration should finish. Control signals piggy-back onto data blocks, and these signals are used for conditional routing. Iteration needs some special support: there is a special iterator function (can be user-defined) that decides what control signal to issue. The iterator has some kind of scope binding, which seems to be necessary for some distributed consistency properties.
Evaluation using optical flow workload, three-fold: CPU, GPU-with-driver-program, and GPU-with-IDEA. As expected, there is a huge speedup as a result of doing work on the GPU, and as the data size increases, the benefit of IDEA (which reduces the launch and load overhead) decreases. System works well and has good generality, but is a bit hard to program by hand. No support for dynamic data flow graph extension or generation.
[missed questions as I was asking one myself]
Supporting efficient aggregation in a task-based STM
Jean-Philippe Martin, Christopher J. Rossbach, Derek G. Murray, Michael Isard
TM is easier than locks, threads and classic parallel programming. Assertion: under low contention, TM has better performance/contention ratio (especially write sharing). Focus on one special kind of write-sharing: aggregation. Insight: we do not always have to serialize aggregations! Their system, Aggro, replaces tasks with threads, which can be spawned dynamically and equate to transactions. Reads and writes are on objects, which can be RW-shared, provided serialization guarantees hold.
The guarantees are: (1) for any run, there exists a serial execution consistent with the task graph, yielding the same result; (2) non-opacity [missed description]. Internally, Aggro has a bunch of expected data structures (TX contexts, read sets, write sets, object contexts etc.). Consider the example of a shared aggregation counter: modifying it is commutative, so TX may run in any order. It is only on read that the various writes that commited (in any order) in the mean time are "collapsed" (for which exclusive locking is required). Aggregation operators must be associative, commutative and side-effect-free.
Evaluation using k-means, wordcount, triangulation. Aggro scales best (up to 24-ish cores) on kmeans; more results in the paper.
Q (?): What happens in the read-after-write case in Aggro?
A: [missed details]
Keynote: The multicore evolution and operating systems: scalability by design
Reducing serial sections is crucial to good multicore performance. Typically, we fix OS scalability issues by profiling a target application, fixing bottleneck, repeat. But this is a little inelegant, as we cannot really tell if the bottlenecks we're fixing are fundamental, or workload-dependent phenomena. Instead, consider an interface-driven approach. Basic idea is the commutativity rule: if two operations commute, they can be implemented scalably -- this helps reasoning about interfaces. Their COMMUTER tool finds opportunities for commutativity and auto-generates test cases. Adopt a scalability definition that assumes that operations scale if they access disjont memory, OR merely read. The idea of commutativity came out of looking for an implementation-independent principle of determining if operations *can* work with disjoint memory. Might be able to change interfaces if they do not commute (as commutativity is desirable).
Turns out in practice that only very few operations commute unconditionally. Adopt a notion of "legal histories", which is essentially a way to derive if interleavings are legal under commutativity. Some theory about how this works; net result is that we can tell if sequences of operations are commutative and where the boundaries are. As a result, we can either change operation orderings (e.g. delay non-commutative ops) to get better scalability, or change the implementation/interface semantics.
Three classes of non-commutative operations ins POSIX: complex return values, unnecessary orderings, complex operations. They built a tool that takes a Python model of syscall behaviour, generates test cases and runs them on top of Linux in modified QEMU, which reports scalability violations (shared writes). Tried this with 10 POSIX FS calls, and found that only ~500 combinations scale; toy ScaleFS gets to ~1500. These gains scale to real-world micro-benchmarks, and also to higher-level workloads (mmap-heavy Metis MapReduce on 80-core machine).
Conclusion: the commutativity rule is quite useful in practice, and extends beyond OS APIs.
Heterogeneous Multicores: When Slower is Faster
Tomas Hruby, Herbert Bos, Andrew S. Tanenbaum
Breaking an OS into many components (classic µkernel spin) is great for dependability, but typically slow. NewtOS (their thing) is a high-performance version of MINIX 3, which avoids context switching and kernel boundary crossing for IPC (kernel does setup only). Proof-of-concept: disaggregated network stack in NewtOS, which goes from 200 Mbps in MINIX 3 to 10G for TCP in NewtOS.
[missed the rest of the talk; something about heterogeneous cores and resource efficiency]
[I missed most of the final session; sorry]