syslog
13Apr/120

Liveblog: EuroSys 2012 — Day 3

EuroSys 2012Various people from Cambridge are currently in Bern für EuroSys 2012, and will be reporting live from the conference here, as well as summarizing the trends and highlights afterwards.

The final day of the conference starts with a keynote from Steve Furber, followed by two more sessions. Click the link to read the details!

Keynote

Biologically-Inspired Massively-Parallel Computing
Steve Furber (Manchester University)

Bit of history: BBC micro had >100 of individual chips on the mainboard, mostly designed by 3rd parties, but some by Acorn. Because early 16-bit microprocessors were insufficient for Acorn’s purposes, they decided to design their own. ARM = Acorn RISC Machine! But Acorn was too small to really support the processor. Spin-off ARM Ltd. was formed because Apple wanted to use the ARM processor in the Newton, but not work directly with competitor. Timely, because SoCs took off just then, and the ARM could easily be integrated into an SoC due to its simplicity and small size -- leading to ARMs eventual success. Its low-power nature is a bit of an accident though: they had to keep it under 1W due to the plastic casing they were using, but they ridiculously over-engineered it because simulation tools were bad at estimating power dissipation. ARM now most successful chip design in the world by volume (>30 billion in 2011).
Manchester: Baby (1948) -- first working stored-program computer; then followed by Ferranti Mark 1 and Atlas. Latter was the first machine to have virtual memory and HW floating point. In 1980s, heavily involved in data-flow machines. Amulet in the 1990s, “Brain box” torus machine in the 2000s, using SpiNNaker CPU. Obviously, the evolution in performance is staggering, but what’s even more amazing is the energy efficiency: 5 J per instruction for Baby, 2*10^-10 J for SpiNNaker, so 25 billion times improved! That said, Moore’s Law, while still working, is increasingly challenged at atomic scales. Furthermore, we are likely to have processors with a significant fraction of non-working transistors. One great unsolved problem is the programmability of multi-core processors for general-purpose desktop computing -- hardly use more than one core!
However, let’s imagine that problem has gone away. Imagine an unlimited supply of free, or almost free, processors. This makes load-balancing irrelevant, and what matters is to minimize energy consumption, avoiding expensive synchronization and embracing non-determinism. But, thinking about it, these are properties that human brains have -- massive parallelism, massive connectivity, high energy efficiency, adaptivity in the face of failure and autonomous learning, despite being built from low-speed “components” (Hz, metres/second). So what might happen if we deliberately make our computers slower to improve energy efficiency?
So, let’s be inspired by biology. Tricky thing is, we still don’t really know how the human brain works. Maybe we should use our current computer infrastructure to work it out? Here’s the basics: neurons are multi-input, single-output logic-gate-like structures, arranged in a regular 6-layer cortical “microarchitecture”. The SpiNNaker project is trying to build a machine design that amounts to about 1% of the human brain. Generic platform for modelling the brain, and neural networks. Physical and virtual topology are independent, the machine is boundedly asynchronous. Interconnect is a 2D mesh, forming a torus. Communication is the hard problem; innovation: packet-switched on-chip communication. Programming model is fundamentally event-driven, packets model neuron spikes, processors react to them. They have a prototype board with 3 (?) chips on them, and a demo of a robot controlled by a neural network. Currently, they are working on scaling up to 48 cores. Their interconnect is not terribly high bandwidth, but can handle a very large number of tiny packets (~10B/s for a 1M core system). This is because their neural network applications typically pass lots of tiny impulses.

 

Session 9: Towards better Kernels

Improving Interrupt Response Time in a Verifiable Protected Microkernel
Bernard Blackham, Yao Shi, and Gernot Heiser (NICTA and University of New South Wales)

People want to build hard real-time system in a trustworthy way and with mixed criticality (consolidation of hard RT and best effort functionality). Based on seL4, which has a small TBC, rigid spec and formal proofs of conformance to spec. Goal here is to run hard RT applications and best effort applications on the same system with no interference to the hard RT one. Particularly concerned about interrupts. It’s been shown that interrupt latency can be computed on a fully formally verified kernel, but only without pre-emption. They would like to have pre-emption, in order to be able to satisfy hard RT requirements. However, this somehow conflicts with the verifiability of the kernel (didn’t quite understand why). seL4 is ~8.7k LOC of kernel code in C, and 200k lines of proofs in Isabelle, corresponding to 25 man-years of effort. Proofs work by translating between states, and they show that the C code takes them from one state to another in a way parallel to the model. Kernels can be event-based or process-based. The former has less state that needs to be considered for verification, making it much harder, so seL4 is event-based. Proofs also show that global invariants are always true. In fact, majority of proof work goes into maintaining invariants -- so avoid breaking them to avoid having to re-do lots of proving work. They leverage the concept of incremental consistency, where composite data structures are built from individual components that have invariants at intermediate steps. Examples: aborting IPC and lazy scheduling. The latter is to do with re-ordering of the scheduler run queues in the fact of frequent IPC. Lazy scheduling will avoid re-ordering, and leave blocked threads in the queue. This is bad for worst-case interrupt latency, so they instead use “Benno scheduling”, which keeps only runnable threads on the run-queue, and manipulates external structures if things change. Unfortunately, this introduced two additional invariants to do with the semantics of Benno scheduling, and they are still working on this (3 person-months already, 4 more to come!). They also have an example about badged IPC and aborts; details are intricate, but they basically introduce pre-emption points between actions, while remaining in a consistent system state. Final example: object creation (common op in micro-kernels). For verification, they forced object batches to be power-of-two in size, and to finish on power-of-two boundaries. So on creation, need to allocate memory for 2^n objects, and initialize them, which takes a long time (more than the 100ms hard RT requirements), and it’s unclear where to integrate a pre-emption point here. So what they do is to allocate the region for each object, initialize the data, the meta-data and then check for interrupts (I’m not entirely sure how that differs from a pre-emption point?). Turns out that breaks an invariant that took them 9 person-months to fix, and also uncovered bugs in 5 lines of simple code. They have a single set of graphs, which show that the worst-case latency of operations (e.g. syscall) that might delay an interrupt has been reduced to below 100ms.

 

Improving network connection locality on multicore systems
Aleksey Pesterev (MIT), Jacob Strauss (Quanta Research Cambridge), and Nickolai Zeldovich and Robert T. Morris (MIT)

Multi-core is upon us. Applications should scale. Apache doesn't. Per-core thoughput drops as more cores get involved. This is on "stock linux", which has some bottlenecks in the kernel (coarse-grained locking, state sharing between cores). If you remove the bottlenecks, things get much better and much more scalable. They developed a thing called an "affinity-accept listen socket" which supports fine-grained locking and does not share connection state between cores. Modern multi-queue network cards distribute connections across cores; the assignment is via a software-modifiable routing table. To alleviate the data-sharing problem, simply co-locate threads serving the same connection! But moving a user thread is expensive, and this is limited to a thread-per-connection mode, and re-routing packet deliveries is impossible because the NIC does not have enough memory for the necessary state. Instead, they observe that no moving of anything is necessary! Instead, they have per-core accept queues, and at accept-time, they pull a connection off the core-local queue, which means that threads end up co-located. Also reduces locking granularity, as locks are now per-core. But life is not this simple -- if one core is over-loaded, it will start dropping incoming connections (i.e. fail to accept()). So, in that case, we need to somehow re-route to idle cores. What they've done is basically work-stealing for network communication -- they could modify the NICs mapping to send the connection elsewhere (it remains somewhat unclear who does this; there is some kind of single load balancer), but this is too heavyweight. So, short-term, they simply pull connections out of other cores' queues if the local queue is empty, and accept the penalty due to broken locality for the new connection (and then later re-balance). Of course, the migration business is somewhat challenging with regards to atomicity and consistency. They address this by moving to a fine-grainedly locked (per row) hash table for requests, which is only 2% slower than the coarse-grained version in stock Linux. Evaluation: use an Apache/lighttpd workload. First look at round-robin accept across all NIC queues -- load-balances nicely, but no locality. This is called “fine-accept”, which only has fine-grained locking, but results in 180% improvement in scalability over stock Linux. “Affinity-accept” (not using round-robin) gives another 24% on top. They also looked into the case when there are few new connections arriving (since their optimizations really address the accept stage). Turns out fine-accept vs. stock makes little difference, but affinity-accept is still better, due to locality (NUMA effects, presumably). Throughput with many active connections is also massively improved by both fine-accept and affinity-accept. I’m not sure I understand why fine-accept should make a difference to that, though.

 

Session 10: Cores galore

TM2C: a Software Transactional Memory for Many-Cores
Vincent Gramoli, Rachid Guerraoui, and Vasileios Trigonakis (EPFL)

Hypothesis: TM is the way to exploit many-core architectures in a simple way. CC is going away, message passing is the future. But programming for MP is tricky and cumbersome, and we would like to leverage our existing shared-memory algorithms. So, enter TM to save the day. On top of CC, TM is called STM, on top of MP, it is DTM (distributed TM). Their thing, TM^2C is a DTM, and its novel feature is that it guarantees termination (i.e. non-starvation). They implemented this and evaluated it on the SCC. Claim: first TM for many-cores, first station-free TM. Hides all message passing, and is fully decentralized. There is an application runtime, and a DTM service. In particular, the DTM service contains a distributed lock service. Upon conflict, this service calls the “contention manager”, which resolves the conflict and guarantees liveness. The application-level API is the usual; on reads, the library eagerly acquires read locks, but lazily acquires write locks. How does this map to many-core? Can either time-share between DTM and application on all cores, or dedicate some fraction of cores to each. They took the latter approach, because it performs better. FairCM is their conflict resolution algorithm (I think), and it favours those transactions that have used less transactional time. It is starvation free, because there is always a core with the lowest time budget used (which gets preference). I wonder if there is some interesting circular scenario that can make this untrue... Evaluation: on the SCC, use default/slow performance mode. N/2 app cores, N/2 DTM runtime. So how does it compare to a sequential implementation? Use case: hash-table, varying number of buckets and update rates. At 50% update rate, they get 20x speedup (optimal: 24, due to 24 cores). Lower update rates result in lower speedups (~10x for 20% on 3072 buckets). As the number of buckets is decreased, the speedup is decreased, obviously, because more conflicts occur. So how does it compare to a parallel implementation using locks? They look at global lock, TM^2C with FairCM and TM^2C with Offset-Greedy, for a simulated “banking” workload. Offset-Greedy sucks (worse than global lock), FairCM beats both and scales best. So, how does this work on other platforms? TM2^C only needs message passing, so works on CC machines with MP library. They choose the usual 48-core Opteron. Interestingly, the SCC800 version achieves the same or better performance than the Opteron (which does MP in software). However, the Opteron scales better.

 

BWS: Balanced Work Stealing for Time-Sharing Multicores
Xiaoning Ding (Intel ISTC for Cloud Computing), Kaibo Wang (The Ohio State University), Phillip B. Gibbons (Intel ISTC for Cloud Computing), and Xiaodong Zhang (The Ohio State University)

Work stealing is a common technique for load balancing concurrent applications. They look at this in the context of multiple threads, with each thread having an individual work queue. Time-sharing is widely used in OSes, but work stealing has mainly been used in dedicated environments (not sure I buy that claim). To get work-stealing in time-shared environments, idle worker threads should yield to fall back into scheduler, so they get some work to do. Problems with this approach: unfairness, because OSes penalize applications that yield frequently by lowering their priority. Also low performance, because the yield may not lead to a successful scheduling (why wouldn’t it, if there is work waiting?! [ed] apparently if the yielding thread still has the highest priority, or if idle threads yield to each other). Also introduces more performance variance, as task wait times are not uniform. The main problem they seem to run against is that OS scheduling is unaware of application scheduling requirements (sounds like what they want is an exokernel-style externalized scheduler!). Apparently, for some applications they benchmarked, performance degraded by 15-18% when running in parallel with an unbalanced number of threads. Their algorithm, Balanced Work Stealing, includes a balancing mechanism that adjusts the number of workers as a function of current parallelism, a targeted yielding mechanism. The balancing is simply done by just putting threads to sleep/waking them up to scale down/up, while targeted yielding requires OS support. When they sleep, other applications can get in. There’s some stuff about “off-loading” the wakeup to other threads to decentralize -- not sure I understood. For the targeted yielding, they modified ~100 LOC in the Linux kernel (presumably to add the syscall).Evaluation: this is based on Cilk++, on a 32-core (?) Xeon. Various applications, with different levels of parallelism. They define a bunch of metrics for slowdown, unfairness and weighted throughput. Avg. unfairness with default Cilk++ is 151% (of what?), which goes down to 13% for BWS. Looks like it’s highly application-dependent, though. Throughput goes up in all cases, though (+33%).

 

This is it, the conference ends -- thanks for tuning in, see you again in Prague next year! :)

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.