Liveblog from APSYS 2013

Matt and myself are at the Asia-Pacific Systems Workshop today, presenting our paper on distributed operating systems in data centers, and we'll be live-blogging the workshop for you.



Registration stats: only 60+ registrations this time (down from last time), 31% Singapore, 43% from rest of Asia, 17% North-America, 6% Europe and 3% Australia. 11 student travel grants awarded (with SIGOPS/MSR support), up from 9 last time.

Program stats:

  • 73 total, 28% up from previous record; especially Asia-Pacific submissions increased (by 73%); 53 total from Asia-Pacific region.
  • Double reviewing load on PC as more than expected submissions.
  • 36 papers selected for discussion, 5 accepted outright, 15 discussed and re-ranked, ties broken by "likelihood to generated discussion"; face-to-face PC meeting planned, but not necessary as consensus was reached.
  • 20 total accepted papers, 11 from Asia-Pacific, 9 from elsewhere.
  • 27% acceptance ratio overall.
  • PC 50-50 split between Asia-Pacific and U.S./rest-of-world.
  • First time ever to give best paper/best student paper awards -- to be announced at banquet.


Keynote: “Can truly dependable systems be affordable?”

Gernot Heiser, UNSW & NICTA

  • Running a sizeable project on system verification and trustworthyness. Why?
  • Present day systems aren't trust worthy.
  • Lot's of things have been hacked. ATMs, cars, pacemakers, drones.
  • High assurance systems. Every line of code costs about $1000.
  • eg: Pacemaker - RTOS + drivers + network stack + monitoring = 10,000's of lines of code.
  • In high assurance domain, isolation provided by VMare/Xen/KVM etc. Isolation provided by 10,000's of lines of code.
  • Best practice - separation kernels - small enough for a single programmer to understand.
  • seL4 - mathematical proof of isolation - nice but .. what about cost sensitive environments?
  • There is a refinement proof for seL4 from high level goals all the way down to the binary. Exclusions: Init, privileged state and caches, multicore.
  • But what did it cost? 20.5 person years for refinement proof.  Integrity: about 1py. Confidentiality: 4.5py Availablility: 0py. Binary and WC exec time about 2pys each.  - Around $400 PLOC.
  • In comparison to state of the art. High assurance $1000 loc - still no guarantees, low assurance $100-200loc, no guarantees.
  • Estimate for a repeat is about half the cost -- about the same cost as doing L4::Pistachio.
  • Note: seL4 is aggressively optimised -- easy because proofs can be rerun over night.
  • What did we learn? - seL4 is probably about as secure as as existing kernel. BUT, there is a a foundation on which proof can now be founded allowing end-to-end trust, roughly the cost of traditional software.
  • How? Combine theorem proving with synthesis and DSLs.
  • Next step: High assurance drones. Modified arducopter.
  • Inside: Mission board - seL4 + linux.  Control board with a new verified RTOS generated by synthesis. Components connected via CAN bus.  Distributed system is nice because it is real.
  • Architecture: component architecture, using well known patters, glued together but auto generated  from a DSLs and synthesis.
  • More about synthesis - can synth drivers from a spec of the OS interface and the device interface.
  • More about DSLs: glue code is trivial, file systems and network stacks. Should be able to generate these from DSLs.
  • Files systems code has lots of error handling and marshinling code so should be able to generate this and reduce code size by 5x-10x.
  • Using a DSL means you can use a generator to produce execution code and verification code. Much less work for the verifier, none of the boring bits.
  • Using both DLS and synthesis means you can reduce the amount of effort to do full system verification.
  • Vision: Verified high-level runtime - If you can verify the runtime, you only have to verify the the programs at the high level semantics -- which again gives a productivity boost.



  • Formal methods aren't that much more expensive than SOA for about 10K LOC
  • We think we can scale bigger and cheaper if we componentise, synthesise and abstract with DSLs.
  • Big challenge is proof composition.


Q: What about proof maintenance? A: It's about the same effort as regular software.

Q: Know how transfer. Very high level (expensive) skills required to do the proofs: A: True. Using the right techniques (DSLs/synth etc), we can push this deeper into industry.

Q: Can you techniques be adapted for Linux/Windows? A: Yes. This is part of it. Linux synth'd drivers/filesystems work already.

Q: ARM trust zone? Comments? A: Trust zone is a hack. Limited in what it can do.

Q: What's the limit? How far can you go? A: Intentionally only talking about 1 order of magnitude. Components make verification easier, but make the performance harder.



Session 1: OS and Performance I


Schedule processes, not VCPUs!

Xiang Song, Jicheng Shi, Haibo Chen, Binyu Zang
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University

Many data centers use virtualization these days, and run multi-core VMs on top of multi-core CPUs. This leads to the classical problem of two-level scheduling: the hypervisor tries to schedule vCPUs over physical cores, and the guest OS tries to schedule processes over its vCPUs. The dilemma is that there is no information shared between these, and a semantic gap. Two key examples: vCPU preemption during a critical section (other vCPUs waiting for it e.g. in spinlock waste CPU time) and stacking (same issue, except with a vCPU blocked on another (I think?).

Some indicative numbers on the streamcluster benchmark shows that overheads of 2.5-5x (KVM) and 2-4x (Xen) are not uncommon when using 2-4 VMs on a 12-core Intel machine. Somewhat counter-intuitively, having a 12-vCPU VM run atop a 12-core machine may end up performing worse than a 6-vCPU one (on wordcount/histogram benchmark).

The solution proposed is VCPU ballooning: exclude the hypervisor from dynamic vCPU scheduling (basically seems to be pinning vCPUs to pCPUs), and decide the number of vCPUs for each VM according to some weight. Might need to balloon vCPUs, i.e. take them offline e.g. when VMs are migrated/deleted/created. There are existing solutions to this that have some problems. One counter-argument to vCPU ballooning is that applications might try to optimize the number of threads independently, and end up with a thread-vCPU mismatch, but they argue that this isn't a big deal: long-running server applications can typically adapt to CPU hotplug, and short-running applications will only be unbalanced for a short time.

Evaluation: some benchmarks from PARSEC and Phoenix MapReduce; vCPU ballooning achieves 3-35% speedup over Xen credit scheduler, while the affinity scheduler sometimes even degrades performance. Same picture with KVM, and less time spent in kernel in general. >100% speedup on Xen and KVM for Phoenix histogram/wordcount for their vCPU ballooning scheduler, while affinity scheduler degrades by a few percent.

Q (Nickolai Zeldovich): Why isn't using the hotplug support in the OS a good idea?
A: Xen supports hotplug insertion, but not removal. KVM does not support it in the simulator (?).



Exploring Microcontrollers in GPUs

Yusuke Fujii, Takuya Azumi, Nobuhiko Nishio, & Shinpei Kato
Ritsumeikan University & Nagoya University

GPGPUs are jolly good, and their performance keeps increasing very quickly. Now looking at 1500+ cores on nVidia Kepler GPUs. GFLOPS performance scales linearly, and importantly much faster than CPU performance (at much better performance per Watt). Great, if you have the right applications!

Much current research looks at OS support for GPUGPUs, e.g. Gdev (USENIX ATC '12). But GPUs still aren't sufficiently "general-purpose". The mean reason for this is that they do not do multitasking well, as they can't manage resource arbitration. So instead, resources are managed by the CPU in kernel-space. So in this work, they focus on putting a resource manager into GPU hardware, in order to avoid having to cross the boundary for resource management all the time. But modifying GPU hardware is hard due to closed-source "black-box" nature. Fortunately, the nouveau driver people are doing a pretty decent job at reverse-engineering the hardware and specs.

One thing that it looks like the found is that GPUs have a bunch of microcontrollers inside them, which have programmable firmware that can be replaced. But it's hard to do so -- must write assembly code for an undocumented black box. In this work, they provide a development environment for GPU firmware (compiler suite), build some firmware and add new features that leverage the microcontrollers.

They are porting the GUC framework (this seems to be an existing nVidia assembly thing from nouveau) to the LLVM infrastructure, so that one can write in any front-end and get GUC assembly from the backend (which they added). Obvious benefits of easier development, higher productivity etc. fall out.

Using this suite, they developed a "baseline" firmware that has the same features as the default, but exposes a bunch of microcontroller details (not clear how). In the extended version, they add microcontroller-based data transfer functionality for DMA from host memory to GPU memory.

Evaluation: compare basic firmware performance, find that theirs is as good as nVidia default and nouveau firmware. When comparing the data transfer performance between various existing mechanisms and their newly implemented method. They seem to to mostly lose out against conventional methods in serial data transfer (apart from a narrow transfer size range), but win by a small margin in the concurrent data transfer case.

Future work: resource management using microcontrollers!


Q: Why is there such large error bars in the evaluations?
A: PCIe and GPU memory variation.

Q (Malte Schwarzkopf, Cambridge): Most of the time conventional is better, why do you only win sometimes?
A: We're not sure why. But our methods can be used concurrently with current methods.

Q (Stefan Bucur, EPFL): How much of the black box reverse engineering work can be applied to other work? E.g. Radeons?
A: No, although they have microcontrollers too, so some principles might be transferrable.



Opportunities and pitfalls of multi-core scaling using Hardware Transaction Memory

Zhaoguo Wang, Hao Qian, Haibo Chen & Jinyang Li
Fudan University, Shanghai Jiao Tong University, New York University

Many people use coarse-grained locks for mutual exclusion. Can make fine-grained, but this is hard work. TM to the rescue! This has been hailed for years, and hardware TM has been around in high-end or prototype hardware for a while. But Intel Haswell now makes TM instructions available in general-purpose CPUs. Basic idea: use cache to track read/write set, and then use cache coherence to detect conflicts. Limitations: working set is limited (plus something I didn't catch).

They did a comparative study of RTM on real hardware, all the way from programming effort over compiler effects to comparison with traditional sync methods. Some lessons learnt: avoid memory alloc/dealloc in a TX region, RTM prefers reading to writing, compiler optimizations for removing memory accesses are beneficial inside RTM regions, fallback handler locks should occupy a cache line, different abort events should be handled differently and this should be tuned to the workload. Experiments using skip list from LevelDB (K-V store) using RTM emulator and real Haswell hardware.

Consider the case of insert into a skip list. Naive approach is to wrap the entire insert function into an RTM transaction. However, this failed to make progress at all, due to including memory allocation (for new node) in the critical region. So move the allocation out of the RTM region, compare on emulator and hardware. Vary number of parallel threads, measure TX abort rate. At 16 threads, get ~0.4 abort rate in emulator, but up to 1.6 on 32 threads. On hardware, aborts are much more likely however! Abort rate of 5 at only 4 threads. For another workload ("1M nodes"), the emulator failed to work due to cache eviction, but it worked on the hardware! Reason is that L1 on real hardware only tracks write set, while emulator tracks read and write set. So the hardware can tolerate read set cache line eviction.

Compiler study: workload is to insert 100k nodes concurrently to skip list, with different compiler optimizations turned on. Interestingly, things vary pretty wildly here (3x difference in abort rate, and no clear total ordering between optimizations!). Emulator and real hardware show almost opposite behaviour... When investigating reasons, they found that -O1 generates the fewest number of memory access instructions (and can hence outperform -O3).

Combining RTM with a traditional lock that is acquired on the abort path massively increases TX failure rate, and this execution time versus the simple retry approach. Can smarten this up a bit by checking which type of abort occurred (conflict or cache capacity induced), but find that this increases abort rate to ~3.5 at four threads. This is because there's a falsely shared cache line, so add some padding. Now looking much better: 0.17 abort rate instead of 3.5 for four threads!

Finally, compare against traditional sync methods. RTM has about the same performance as fine grained locks or lock-free skip lists.

Limitations by own admission: study limited to data structure behaviour (skip list), only simple micro-benchmarks and limited to 4 threads on real hardware available.

Q: The lock-free version in your comparison appears to have about the same performance as the fine-grained lock version. Are you sure you implemented it right?
A: The lock implementation is optimized (something about having to lock predecessors, didn't catch details).

Q: What about overlapping transactions? Need to take care with boundaries? Does TM really make writing the skip list easier?
A: Provable correctness is easier to attain when using hardware TM (?).

Q (Matthew Grosvenor, Cambridge): Did you do a bottleneck analysis? Specifically, what is the bottleneck of the lock-free version?
A: We have more results showing that lock-free and fine-grained locked version are equivalent in performance up to 40 threads.



The nonkernel: A Kernel Designed for the Cloud

Muli Ben-Yehuda, Omer Peleg, Orna Agmon Ben-Yehuda, Igor Smolyar, Dan Tsafrir

Where's the money in OS work? In "the clouds". Specifically, in very-short term rentals, i.e. resource micro-charging. Some providers sell "customizable resource packages", where prices are set based on supply and demand. Assume this all works out, we still have to deal with aligning economic incentives of cloud providers, clients and the OS (who also needs resources), and somehow manage to share resources reliably and with isolation at fine granularity. How do you build an OS for this kind of environment?

In fact, they claim that there is a single missing piece in historical OS development: architectural support for machine virtualization throughout the system. With this in mind, an OS kernel should not just optimize for performance, but also for cost! Second, the kernel should expose physical resources and get out of the way, giving applications means to manage their own resources while maintaining isolation.

The nonkernel is a hybrid kernel/hypervisor, and can run on bare metal (being a hypervisor) or on top of a legacy hypervisor (as a guest kernel). Effectively, it does as little as possible, and merely provides little more than hardware-assisted virtualization. It boots the machine, arbitrates contended resources (but only gets involved in the contention case, as I understand it), isolates applications and provides efficient user-space IPC (managing setup and tearndown of channels, it seems). Could build on top of existing OS, or from scratch. Not quite decided on which one yet...

Pros: good performance, zero-overhead virtualization, reduced driver complexity, a more secure system and higher system efficiency due to the integrated economic model.

Cons: clean back, no backwards compatibility, no legacy hardware (need architectural virtualization support), no legacy software.

Preempt question "isn't this just another Exokernel?" -- answer: this is Exokernels done right, as the hardware is now in a place where it can better support this and avoid drawbacks (e.g. downloading user code to kernel). Working on prototype, "nom".

Q: Relegating lots of functionality to applications. How do you ensure that a malicious application does not compromise the machine?
A: Not trivial to answer this. For DMA, use IOMMU, for other things use architectural virtualization support which we know how to use it safely.

Q: Do I have to rewrite my application to run on nom?
A: You should, but you might not have to. We provide defaults for IO stack, network stack etc. Not for everyone, this is for people who can take advantage of this.

Q (Gernot Heiser, NICTA): How is this different from a minimal hypervisor like NOVA?
A: The big difference is that we allow applications to manage their resources.
Q: This should be easy using just capabilities.
A: No, isn't easy. Applications can dynamically tweak resource requests based on economic model.



Session 2: Mobile

The Systems Hacker’s Guide to the Galaxy: Energy Usage in a Modern Smartphone

Aaron Carroll & Gernot Heiser 

  • Back in time we published this paper on intrumenting a mobilephone to figure out the power usage. 
  • This was great work and we had some nice publications, but the cricicism was that we didn't use a "real" device.
  • The device we used was a "open moko" which is a "free" phone, kind of designed exactly for this.
  • But can we do it on a "real" phone?
  • I said "no" but I was actually wrong.
  • We got hold of  a Galaxy S3, totally new at the time, and started to see if we could instrument it.
  • When you look at power distribution in a phone, there's a battery connected to a power management resistor and then there's a bulk inductor.
  • It turns out that inductors are quite large physically. So we can attach a sense resistor inline by "tombstoning" and "tee-peeing" the indocutor with a sense resistor.
  • So we managed instrument a whole bunch of cool stuff.

Things we learned?

  • This display is super expensive. 
  • And looking at video decoding is also reasonably expensive. But what if we did the decoding in software? Very very expensive! Hardware offload is very important.
  • Android "power used" meter is totally inaccurate.
  • We also looked at measuring the power usage of sensors. The interesting lesson is that sensors in general aren't that expensive. But the software loop used to read the sensors really is.
  • RAM power isn't that important. Always used in combination with other things. And the CPU/GPU is much more expensive.

What changed?

  • We have a couple of devices. The Galaxy S3, Open Moko and [?..].
  • In general things haven't changed that much. Idle, cpu without the screen, phonecalls, all roughly the same cost.
  • BUT - Screen density is increasing rapidly, and this is the most draining resource.



The Case for Mobile Forensics of Private Data Leaks:Towards Large-Scale User-Oriented Privacy Protection

Joseph Chan Joo Keng, Tan Kiat Wee, Lingxiao Jiang, & Rajesh Krishna Balan, Singapore Management University (SMU)


Why mobile privacy protection?

  • Mobile phones collect tons of data.
  • This makes this a treasure trove for malicious people.

What about current tools?

  • They don't really work. 
  • And people don't really understand the implications of the the permissions in the app store.

We're different:

  • Because we analyse logs forensically so that we can figure out if a privacy leak has happened. 

How useful is it? Methodology.

  • Used TantDroid as a leak detector. 
  • Sampled the 10 top apps on the Google play store.
  • Nexus One using android 2.4
  • Don't check weather the leaks are valid or invalid, just take them at face value.


  • 226 apps, 105 leaky 48%
  • Leaks are start up, user triggered
  • 3 main types of data leaked. IMEI number, course grain cellphone positioning, and IP based positioning.
  • 60% of apps leak data thanks to user actions.

Leak Cause Analysis

  • Instrumented android to log user events. 
  • Used machine learning to classify leaks using WEKA rule miner.

Leak prevention mechanism

  • Display visual notification over leaky aspects of applications visually.

Q (Matthew Grosvenor): You mentioned that you picked the top 10 apps, but your analysis has over 200?
A: As a result of our testing methodology, we ended up with a bunch more applications.

Q: (Aaron, UNSW) It's nice to be able to notify people about leaks, but can you bin applications to stop them leaking?  Some leaks are more serious than others. But disabling leaks  also disables functionality. We want users to have informed consent.



Feasibility Study of Mobile Phone WiFi Detection in Aerial Search and Rescue Operations

Wei Wang, Raj Joshi, Aditya Kulkarni, Wai Kay Leong & Ben Leong National University of Singapore

  • Lots of search and rescue operations all the time. 12 per day in the US And lots of people die.
  • Using drones are expensive and require expert operators.
  • Cheap hobby drones are available.
  • Previous work has used visual image recognition to find people, but it requires (heavy) good quality cameras, and good lighting to work.
  • Our insight? What about using wifi?
  • Phones regularly scan for WiFi signals, but, scan frequencies vary depending on the mode. This also varies depending on phone manufacturer.
  • We developed an app to increase scan freq. but to keep the screen off. We can keep the phone running for about 51 hours, which is pretty good. We could push the scan frequency up to 11x with battery life still up to 21hours.
  • Passive probing: Phones issue small numbers of these frames.
  • Active probing: If we send lots of RTS (request to send) frames, we can get a CTS (clear to send) response from the phones.
  • At 15m altitude, our range is about 200m.
  • Phone orientation matters. Veritcal orientation has better range than horizontal.
  • Conclusion: Wifi is available, battery life is ok, range is ok.

Q: (Gernot) what does it matter if your phone life is 21 hours the quad is only flying for 45mins? A: Use lots of quads.

Q: Why use wifi (1mw) use cellular its 1W! : A: This is a good idea, but the limitation is that cellular harder is heavier (WFT??)



Mobile Applications Need Targeted Micro-Updates

Alvin Cheung, Lenin Ravindranath, Eugene Wu, Samuel Madden, & Hari Balakrishnan MIT CSAIL


  • We'd like to be able to update our advertising API. But we don't want to wait for people to update their software to use new APIs. Or maybe add new functionality. 
  • We want't to be able to make small, parameterized changes to apps. "Micro-updates"
  • You could update the app in the app-store. But there's no push mechanism.
  • You could have an RPC sever, but the performance cost is high.
  • You could have an updater built in, but it's expensive.
  • Parameterised conditions are a pain. We don't only want updates, but we want targeted updates.
  • Satasuma: Micro-update framework + targeting framework.


  • Code annotations @allowUpdate. 
  • Compiler that understands the annotation and embeds the runtime.
  • Push notifications are sent to awake the runtime to download updates.

This is dangerous. How to restrict the class of changes?

  • Annotations have permissions and scoped pairs.
  • Updates are only allowed on specific keys.
  • "scope" may be "all" - do anything you want, or "sandboxed", side effect free statements + whitelisted only.

Need to to specify the which devices to apply this to:

  • Need to decouple the update from the location. 
  • We have a table of locations and quires that can be sent back. Devices push locations into the table, then we can push our updates back with SQL like queries.

How to apply the updates?

  • Android automatic updates are nice a but not everyone has it. 
  • Android allows dynamic class reloading.
  • iOS need to use function stubs and an interpreter.


  • We'd like to give users the possibility of constraints.  (power usage, privacy etc)
  • Optimisations - Too many round trips.
  • Code revisions? - How to you do dependencies, manage versions etc.
  • Permissions? - How to they compose etc. How do we resolve conflicts.

Desktop/Web might be interesting places to put this to.

Satsuma - Separates targets from names.


Q: Is a target, just a location? A: Location is useful, but, maybe you want more, is the wifi on, is the application installed.

Q: What about GCM, how it this different? A: GCM requires every developer host their server. Which is expensive and we'd like to generalise this.

Q: How do you know where to annotate? A: Maybe the compiler can set everything to annotated and  then optimise later.



Session 3: OS and Performance II

New wine in old skins: the case for distributed operating systems in the data center

Malte Schwarzkopf, Matthew P. Grosvenor, & Steven Hand University of Cambridge Computer Laboratory & Microsoft Research Silicon Valley 

We make the case that distributed operating systems, an idea from the 1980's are worth a revisiting in the case of the modern operating system.


Challenges in the Emulation of Large Scale Software Defined Networks

Arjun Roy, Ken Yocum & Alex C. Snoeren, University of California, San Diego

  • Networks impact overall system performance.
  • Testing in production doesn't really work for networks.
  • Test beds are ok, but need to be the same size as the real network.
  • Simulation is ok, but can't run real code.
  • Emulation allows us to do more accurate models but how?
  • SDN intro.
  • Emulation: Mininet-HiFi - Tried to reproduce real results with scaled down networks. 1G -> 10mb/s. Seems to work for small scale networks? But does it work if you scale up?
  • But there are issues. Example scenario - 3 tiers of machines, front middle and back ends, connected by a switch.
  • The clogged drain problem: Any given link can have 1000's of flows, but if you shrink the links, you need to shrink the flows as well.
  • Flow initiation rate doesn't decrease as quickly as the flows complete as you shrink down the network rate.
  • Why is this bad? Cache crashing happens as link bandwidth is shrunk. This is contention in the switch.
  • Flow completion times are a result of the switch contention. Big increase in flow completion times as a result of link shrinking.
  • It seems like the emulator is lying to us.In this case, it's lying, but it's conservative. But not always.
  • Straggling herd problem. Emulation spreads out completion times resulting in reduced contention.
  • Might be tempted to fix the flow table churn. But fixing the clogged drain problem makes the straggling herd problem worse.

Shrinking networks can yeild incorrect conclusions, conclusions can be either unrealistically optimistic or pessimistic.



Storage-class memory needs flexible interfaces

Haris Volos, Sankaralingam Panneerselvam, Sanketh Nalli, Michael M. Swift

SCM technologies: phase-change memory (PCM), spin-transfer torque DRAM, memristors etc. -- all non-volatile RAM technologies. SCMs are as fast as DRAM, but unlike DRAM, they're persistent. Been heralded as the next awesome thing for a while, but might actually be around the corner now.

With SCMs, some things, such as disk request scheduling, become obsolete. As they expose load/store instruction interfaces, can also get rid of the device driver and a bunch of other layers. Nevertheless, many designs rely on a filesystem atop the SCM devices. This is because FSes provide convenient abstractions for global naming, protected sharing and crash consistency. However, this additional layering can inflate the quick access time to SCMs. For this reason, people have added a fast-path for data, bypassing the FS for bulk operations, and only using the FS for meta-data. However, even meta-data operations matter: 20-50% of time for various workloads (file server, web proxy, web server) is spent doing such operations! For example, a meta-data operation takes about 800ns on RAMFS, which is 10x the access latency of an SCM. Indeed, hierarchical FS traversal can be even more expensive.

So wouldn't it be nice if we could bypass the POSIX abstractions for applications that want to do high-performance SCM access? This is what their Aerie library FS provides. The key here is that all meta-data lookup and hierarchy traversal (and thus many meta-data operations) can be done in user-space, without involving the kernel. The only thing that crosses the kernel boundary is the calls into the (specialized) SCM manager. But can't trust user-mode libraries with metadata integrity and sharing leases, so they also have a trusted user-space service that implements this.

Using this libFS structure, they implemented two file systems: a POSIX-like FS, PXFS, for legacy compatibility, and a key-value interface FS, KVFS. Compared these against ext3, and find ~30% latency reduction compared to ext3 on web proxy and file server, but an 18% increase in the web server (likely because of the cost associated with the open() call, which is common in this environment). RAMFS does pretty well too, but KVFS reduces the latency by 66% over ext3 and 16% over RAMFS for the web proxy workload, while giving better guarantees than RAMFS.

Q: Why do you compare with ext3, not ext4? All performance optimization was done on ext4 in recent years, so should look at that to be fair!
A: Will look at in future work.

Q: Presumably application developers will rewrite their applications for SCM. Why do you need to support some of the more arcane legacy APIs (as mentioned in future work)?
A: I guess people will have to rewrite applications if they want to reap the benefits.



Optimizing RAM-latency Dominated Applications

Yandong Mao, Cody Cutler, & Robert Morris

RAM latency can dominate application performance, for example if you need to follow long pointer chains or the working set exceeds cache size. Lots of cache misses lead to many memory accesses and load on the memory bus. Their solution is to treat RAM more like a disk (but with different latency magnitude), and apply the same optimization techniques: batching, sorting, pipelining etc.

Experimental environment: six-core Xeon X5690, one memory controller with three channels that each have a row buffer cache inside them. It also supports a bunch of prefetching features (in hardware and software), parallelism across channels and executes instructions out-of-order. Note that row buffer cache hits are 2-5x faster than misses, and sequential access has 3.5x higher throughput than random access.

Consider a garbage collector example: long pointer chains, but with no predictability of next locality. So each access ends up generating a cache miss and stalls for RAM access. Linearisation of requests could help here, so on each GC cycle, arrange objects in tracing order, and make use of prefetcher. For 1.8 GB working set of HSQLDB, found that sequential order tracing is 1.3x faster than random order. In future work, would like to use a better linearization algorithm than copy collection.

Second example is using Masstree: for this key-value store, it is not possible to linearize memory accesses, as the storage abstraction is a shared B-tree. Despite careful design, Masstree is still RAM latency-dominated, as each key lookup follows a random path. To improve this, use batching and interleaving of tree lookups. Specifically, they use software prefetching to interleave tree lookup and RAM fetch. On a single core, this interleaving strategy can improve the performance by 30% with a batch size of five.

Parallelization also helps, as the RAM can actually fetch stuff in parallel, so using more cores helps. Up to 12 hyperthreads, the total number of RAM loads scales linearly with the Masstree throughput! Can also combine parallelization and interleaving, and still get 12-30% improvement over just parallelization, although this tails off with more threads.

Conclusions: interleaving seems more general than generalization, but is more difficult than parallelization, especially if we try to do it automatically and without programmer input (might be impossible without some help). Found that interleaving technique can also be applied to hashtable, which receives at 1.3x throughput boost (for memcached). Would also be nice to have tools to identify RAM stalls -- they had to use an indirect technique based on function costs and function operations.

Q: You propose two techniques that can improve RAM access throughput. But they're highly tailored towards the application and data structures. How do you think you can generalize your techniques or make them application-independent?
A: Don't claim that all of them can be applied to all applications. But categories of techniques might work for categories of applications, so something of a pattern-based approach might work.



Optimizing the File System with Variable-Length I/O for Fast Storage Devices

Yongseok son, Jae Woo Choi, Hyeonsang Eom, & Heon Young Yeom, Seoul National University



  • Lots of fast SSD storage devices are around
  • Median file size
  • in the desktop is about 4kbB Cloud and HPC 25%-40% of files <4kB .
  • Fast storage has poor performance on small sizes.
  • HDD vs SSD
    • HDD - Mechanical overhead == Request size no affect on overall latency.
    • SSD - Seektime between 0.08 and 0.16ms - Affects overall latency.
  • Usually a syscall for 1kB read results in a 4kb read. [Block read as per HDD]
  • Linux VFS masks request size at byte-granularity with page-size granularity.
  • Results in fetching unnecessary data.


  • Extend the I/O subsystem to to include byte granularity information. 
  • Use sub page management module.
  • Sub pages are managed using a bitmap and a worker thread to sync.


  • Read operations <4kb  1.3-2.6x faster. 
  • Write operations 3.7-6.3x faster.
  • Write operations >= 4kb. Faster.


Q: Have you sent you patch into the linux kernel? A: No.

Q: Is you dataset random or real? A: random; C: This is unfair as linux is optimistically caching.


RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups

Chun-Ho Ng & Patrick P. C. Lee , The Chinese University of Hong Kong



  • VM image checkpointing is common, but data exploding. Growing faster than disk cost is shrinking.
  • Thus, use reduplication.
  • Most people use inline instead of out-of-order dedup.
  • But inline dedup causes fragmentation. Lots of seek operations.
  • Other people have looked a this 1) selective rewrites and 3) container capping and prefetching.
  • All previous work done on simulation.

Our work:

  • Goal - Mitgate fragmentation for read on latest backups. 
  • Contributions - Hybrid of inline and out-of-order.
  • Two phase reduplication used. Course-grain dedup. But fine-grain d-dedup.

Global dedup.

  • Segments are several MB in size. 
  • Tries to apply in the version, different versions of the same VM and different versions of the different VMs.
  • Small memory usage for indexing.
  • Mitigate fragmentation by using reading to amortize the cost.
  • Reverse dedup on early versions of the
  • Shared segments

[Sorry, fell asleep, not because I wanted to. Jetlag + all night hacking the CamIO demo]


TinyFTL: An FTL Architecture forFlash Memory Cards with Scarce Resources

Jongmin Lee, Yongseok Oh, Hunki Kwon, Jongmoo Choi, Donghee Lee, & Sam H. Noh, Dankook University, University of Seoul & Hongik University


  • Flashmemory is key to lots of things. 
  • This paper looks at flash memory cards.
  • The main difference is the resources. Onboard cards are small micros with very little RAM.
  • Flash memory is unique because it can't be overwritten, data can be relocated, but to erase, entire blocks must be erased at one time.
  • The mapping between segments and data is stored in a table also on flash.
  • Short version - Flash storage requires caching and garbage collection.
  • This is done by the FTL - Flash Translation Layer

We propose:

  • TinyFTL - Target is mobile storage. 
  • Need to minimise resource usage.


  • FTL uses a summary page located in the last page of each block. 
  • The overhead of the summary page is negligible.
  • In comparison to DFTL (current popular) TinyFTL uses summary pages as well as checkpoints.
  • TinyFTL does efficient garbage collection. Three optimisation points:
    • Computation for victim blocks:  DFTL  compares the whole block. TFTL uses a window with a small valid count field from the summary page. Resulting in much lower computation.
    • Memory usage:  DFTL valid counts grow linearly with size of device. Using windows means TFTL scales better.
    • ?


  • Done in simulation from ESMC
  • Three benchmarks - Financial, MSN and something
  •  TFTL does better on all of the benchmarks.
  • TFTL also does better on bootup.



Security bugs in embedded interpreters

Haogang Chen, Cody Cutler, Taesoo Kim, Yandong Mao, Xi Wang, Nickolai Zeldovich, Frans Kaashoek

Interpreters are everywhere these days, from the Linux kernel to Type 1 font renderers. Vulnerabilities in interpreters are also ubiquitous. Consider the example of a packet filter, e.g. for tcpdump. Could run the whole filter in user space, which is safe but slow. Could also run in the kernel, but then we have untrusted user space code in the kernel. Traditional solution: Berkeley Packet Filter (BPF) + interpreter. This seems sound: the interpreter runs in the kernel, and byte code comes from user space. But byte code and input data are still untrusted, and if a malicious user was to inject malformed code or data, bugs in the interpreter can compromise the kernel. There are real examples of this: e.g. the INET_DIAG infinite loop vulnerability -- DoS as a result of passing in a zero "oplen" parameter via BPF. Second example: ClamAV anti-virus signed division vulnerability, where a parameter check was implemented incorrectly and ended up with the interpreter trapping as a result of division by zero. Finally, third example is an arbitrary code execution vulnerability in FreeType font interpreter: by passing a negative argument count, can manipulate interpreter stack pointer and thus break into system.

Conclusion: writing secure interpreters is hard! Rest of talk will discuss some security guidelines for writing interpreters.

  • Best option is to not have an interpreter at all, but let's assume we do need one.
  • One option is to make the interpreter non-embedded and run it in a separate process (exceptions when this won't work: high performance required, or interpreter in kernel).
  • Keep the byte code as simple and least expressive as possible, to avoid as many possible pitfalls as possible. Don't overdesign!
  • Limit resource use (e.g. interpreter runtime, memory etc.) to avoid DoS of the host system.
  • Don't allow calling into external host system functions from byte code, as this can easily lead to arbitrary code execution vulnerabilities (as e.g. in Python's pickle interpreter).

What research opportunities are there? Automated testing of embedded interpreters. Could do static analysis, but the invariants are too dynamic and complicated. Could do symbolic testing, but control flow and complexity highly depends on the byte code. Or maybe we can build a reusable general embedded interpreter, e.g. based on Java byte code or something.

Q: Did you find *any* good interpreter that didn't have any of these problems? Is there any good citizen?
A: Of course, but this doesn't mean it can't have these problems in the future.

Q: You know that JVM has a theorem prover that detects bad byte code that runs on bytecode before executing it?
A: Our goal is not to analyze general-purpose byte code machine interpreters.

Q: You mention constraining resources. How would you actually do that?
A: [missed the answer]

Q: For OS kernel code, it's hard to enforce security as there's one big address space. What could one do to OS design in order to make embedded interpreters in the kernel safe?
A: Could do micro-kernel style interpreter-in-userspace. But if need performance, nothing you can do beyond making sure the interpreter is correct.



Making Automated Testing of Cloud Applications an Integral Component of PaaS

Stefan Bucur, Johannes Kinder, George Candea

PaaS (AppEngine and friends) is jolly good, as it's easy to use. But how does one test these applications systematically? Consider the example of a simple trivia Q&A service. Could do unit tests or integration tests. But constructing test inputs is a bit tedious, as we need to wrap them into tons of boilerplate (XML etc.). Even harder is generating all the inputs, and getting full coverage is almost impossible. Integration tests focus on corner case, and can come up with lots of these and yet not cover all of them.

Wouldn't it be nice if there was an automated way of generating tests, especially if this was offered as a service with PaaS application deployment? Developer just submits code and input description, and then "the cloud" performs test case generation, testing etc. Could somehow use symbolic execution for doing the test case generation (using a constraint solver). But real web apps are more complex, and do not just take primitive types as input. This now makes the parser and the symbolic execution tree a lot more complex (or at least bigger). What's worse, the fuzz test case generator may end up exercising all kinds of exceptions in e.g. the JSON parser, but won't trigger the cases exercising bits of our app that we're interested in. Their solution to this is something called "layered symbolic execution", which essentially replaces the output of the parsing with a "fresh" variable. Unfortunately, the variable now doesn't contain a valid JSON string though. They use an "onion object", which is the input description submitted by the user, to then synthesize reasonable values.

Prototype with GAE dev server and S2E symbolic virtual machine. Currently WIP, can generate one test case per second.

Q (Matthew Grosvenor): What in this work makes it specific to cloud applications? Could this be used in general applications as well?
A: What's specific is the onion-style layered inputs, but could in principle work for other applications too.



Optimizing unit test execution in large software programs using dependency analysis

Taesoo Kim, Ramesh Chandra, Nickolai Zeldovich

Unit tests are important, but running them takes too long (e.g. 10 minutes in Django). This work makes them fast (2s).

Research direction here is to do regression test selection (RTS), which is to run only the necessary tests instead of everything. This requires syntactical analysis of test cases and code changes, and resolving the appropriate tests to run. But RTS techniques are never adopted in practice. Hypothesis: "soundness" (no false negatives) ends up killing their benefits. For example, if you change a global variable, this could affect (and thus re-run) everything, in which case we may as well never have used RTS in the first place. The goal of this work is to make RTS practical, by prioritising performance over soundness and better integrating it with the development cycle.

To track which tests execute which functions, first need to figure out dependencies between functions. Also incrementally update this dependency information as unit tests run, and these increments can be pushed to the repo server alongside code changes (so that other people only need to run the affected tests when they sync these changes?).

Still have the false negative problem, so need to do a better job at dependency identification. They introduce an asynchronous test server that actually runs all tests continuously, and monitors them for any missed false negatives (example: non-deterministic control flow leading to function only being called sometimes).

Evaluation: using Django and Twisted frameworks, and looked at last 100 commits to trunk. Find that sometimes, small modifications to "hot" functions lead to tons of unit tests being re-run, and sometimes large-scale changes end up re-running few cases (didn't quite understand why this was). In terms of test runtimes, they go from minutes to seconds as a result of introducing TAO (their stuff). Downside (ish): TAO adds extra stuff to code base, ranging from 60 to 117% of original repo size! But could reduce this maybe.

[no questions]



Lazy Tree Mapping: Generalizing and Scaling Deterministic Parallelism

Yu Zhang & Bryan Ford
University of Science and Technology of China & Yale University

Many-core hardware now abundant, but parallelism makes life hard due to non-determinism. Data races and non-determinism are everywhere! Can just accept this and live with it, or use race detectors, or use deterministic schedulers. Ideally, we'd like to develop a parallel programming model in which races do not exist in the first place. Determinator tries to this this, but isn't very general. Their previous work, SPMC, enabled producer-consumer page sharing under deterministic parallelism. However, this was very wasteful in terms of space. Also introduced DetMP, which is a deterministic message-passing API with explicit SPMC (single producer, multiple consumer) channels.

In this work, they introduce Lazy Page Mapping and Tree-style Mapping, which are both optimizations to make SPMC sharing better. Available as an extension to Determinator (xDet) and on top of Linux (DLinux). Performance is much better than previous work for various MPI workloads.



That's it, folks!

Comments (0) Trackbacks (0)

No comments yet.

Leave a comment

No trackbacks yet.