{"id":1222,"date":"2013-07-29T00:51:00","date_gmt":"2013-07-29T00:51:00","guid":{"rendered":"http:\/\/www.syslog.cl.cam.ac.uk\/?p=1222"},"modified":"2013-07-30T09:27:50","modified_gmt":"2013-07-30T09:27:50","slug":"liveblog-from-apsys-2013","status":"publish","type":"post","link":"https:\/\/www.syslog.cl.cam.ac.uk\/2013\/07\/29\/liveblog-from-apsys-2013\/","title":{"rendered":"Liveblog from APSYS 2013"},"content":{"rendered":"
Matt<\/a> and myself are at the Asia-Pacific Systems Workshop today, presenting our paper<\/a> on distributed operating systems in data centers, and we'll be live-blogging the workshop for you.<\/p>\n <\/p>\n <\/p>\n Registration stats:<\/strong> only 60+ registrations this time (down from last time), 31% Singapore,\u00c2\u00a043% from rest of Asia, 17% North-America, 6% Europe and 3% Australia. 11 student travel grants awarded (with SIGOPS\/MSR support), up from 9 last time.<\/p>\n Program stats:<\/strong><\/p>\n <\/p>\n Gernot Heiser, UNSW & NICTA<\/em><\/p>\n <\/p>\n Conclusions:<\/p>\n Questions:<\/p>\n Q: What about proof maintenance? A: It's about the same effort as regular software.<\/p>\n Q: Know how transfer. Very high level (expensive) skills required to do the proofs: A: True. Using the right techniques (DSLs\/synth etc), we can push this deeper into industry.<\/p>\n Q: Can you techniques be adapted for Linux\/Windows? A: Yes. This is part of it. Linux synth'd drivers\/filesystems work already.<\/p>\n Q: ARM trust zone? Comments? A: Trust zone is a hack. Limited in what it can do.<\/p>\n Q: What's the limit? How far can you go? A: Intentionally only talking about 1 order of magnitude. Components make verification easier, but make the performance harder.<\/p>\n (mpg39)<\/em><\/p>\n <\/p>\n <\/p>\n Xiang Song, Jicheng Shi, Haibo Chen, Binyu Zang Many data centers use virtualization these days, and run multi-core VMs on top of multi-core CPUs. This leads to the classical problem of two-level scheduling: the hypervisor tries to schedule vCPUs over physical cores, and the guest OS tries to schedule processes over its vCPUs. The dilemma is that there is no information shared between these, and a semantic gap. Two key examples: vCPU preemption during a critical section (other vCPUs waiting for it e.g. in spinlock waste CPU time) and stacking (same issue, except with a vCPU blocked on another (I think?).<\/p>\n Some indicative numbers on the streamcluster benchmark shows that overheads of 2.5-5x (KVM) and 2-4x (Xen) are not uncommon when using 2-4 VMs on a 12-core Intel machine. Somewhat counter-intuitively, having a 12-vCPU VM run atop a 12-core machine may end up performing worse than a 6-vCPU one (on wordcount\/histogram benchmark).<\/p>\n The solution proposed is\u00c2\u00a0VCPU ballooning<\/em>: exclude the hypervisor from dynamic vCPU scheduling (basically seems to be pinning vCPUs to pCPUs), and decide the number of vCPUs for each VM according to some weight. Might need to balloon vCPUs, i.e. take them offline e.g. when VMs are migrated\/deleted\/created. There are existing solutions to this that have some problems. One counter-argument to vCPU ballooning is that applications might try to optimize the number of threads independently, and end up with a thread-vCPU mismatch, but they argue that this isn't a big deal: long-running server applications can typically adapt to CPU hotplug, and short-running applications will only be unbalanced for a short time.<\/p>\n Evaluation: some benchmarks from PARSEC and Phoenix MapReduce; vCPU ballooning achieves 3-35% speedup over Xen credit scheduler, while the affinity scheduler sometimes even degrades performance. Same picture with KVM, and less time spent in kernel in general. >100% speedup on Xen and KVM for Phoenix histogram\/wordcount for their vCPU ballooning scheduler, while affinity scheduler degrades by a few percent.<\/p>\n Q (Nickolai Zeldovich): Why isn't using the hotplug support in the OS a good idea? (ms705)<\/em><\/p>\n <\/p>\n Yusuke Fujii, Takuya Azumi, Nobuhiko Nishio, & Shinpei Kato GPGPUs are jolly good, and their performance keeps increasing very quickly. Now looking at 1500+ cores on nVidia Kepler GPUs. GFLOPS performance scales linearly, and importantly much faster than CPU performance (at much better performance per Watt). Great, if you have the right applications!<\/p>\n Much current research looks at OS support for GPUGPUs, e.g. Gdev (USENIX ATC '12). But GPUs still aren't sufficiently \"general-purpose\". The mean reason for this is that they do not do multitasking well, as they can't manage resource arbitration. So instead, resources are managed by the CPU in kernel-space. So in this work, they focus on putting a resource manager into GPU hardware, in order to avoid having to cross the boundary for resource management all the time. But modifying GPU hardware is hard due to closed-source \"black-box\" nature. Fortunately, the nouveau driver people are doing a pretty decent job at reverse-engineering the hardware and specs.<\/p>\n One thing that it looks like the found is that GPUs have a bunch of microcontrollers inside them, which have programmable firmware that can be replaced. But it's hard to do so -- must write assembly code for an undocumented black box. In this work, they provide a development environment for GPU firmware (compiler suite), build some firmware and add new features that leverage the microcontrollers.<\/p>\n They are porting the GUC framework (this seems to be an existing nVidia assembly thing from nouveau) to the LLVM infrastructure, so that one can write in any front-end and get GUC assembly from the backend (which they added). Obvious benefits of easier development, higher productivity etc. fall out.<\/p>\n Using this suite, they developed a \"baseline\" firmware that has the same features as the default, but exposes a bunch of microcontroller details (not clear how). In the extended version, they add microcontroller-based data transfer functionality for DMA from host memory to GPU memory.<\/p>\n Evaluation: compare basic firmware performance, find that theirs is as good as nVidia default and nouveau firmware. When comparing the data transfer performance between various existing mechanisms and their newly implemented method. They seem to to mostly lose out against conventional methods in serial data transfer (apart from a narrow transfer size range), but win by a small margin in the concurrent data transfer case.<\/p>\n Future work: resource management using microcontrollers!<\/p>\n <\/p>\n Q: Why is there such large error bars in the evaluations? Q (Malte Schwarzkopf, Cambridge): Most of the time conventional is better, why do you only win sometimes? Q (Stefan Bucur, EPFL): How much of the black box reverse engineering work can be applied to other work? E.g. Radeons? (ms705)<\/em><\/p>\n <\/p>\n Zhaoguo Wang, Hao Qian, Haibo Chen & Jinyang Li Many people use coarse-grained locks for mutual exclusion. Can make fine-grained, but this is hard work. TM to the rescue! This has been hailed for years, and hardware TM has been around in high-end or prototype hardware for a while. But Intel Haswell now makes TM instructions available in general-purpose CPUs. Basic idea: use cache to track read\/write set, and then use cache coherence to detect conflicts. Limitations: working set is limited (plus something I didn't catch).<\/p>\n They did a comparative study of RTM on real hardware, all the way from programming effort over compiler effects to comparison with traditional sync methods. Some lessons learnt: avoid memory alloc\/dealloc in a TX region, RTM prefers reading to writing, compiler optimizations for removing memory accesses are beneficial inside RTM regions, fallback handler locks should occupy a cache line, different abort events should be handled differently and this should be tuned to the workload. Experiments using skip list from LevelDB (K-V store) using RTM emulator and real Haswell hardware.<\/p>\n Consider the case of insert into a skip list. Naive approach is to wrap the entire insert function into an RTM transaction. However, this failed to make progress at all, due to including memory allocation (for new node) in the critical region. So move the allocation out of the RTM region, compare on emulator and hardware. Vary number of parallel threads, measure TX abort rate. At 16 threads, get ~0.4 abort rate in emulator, but up to 1.6 on 32 threads. On hardware, aborts are much more likely however! Abort rate of 5 at only 4 threads. For another workload (\"1M nodes\"), the emulator failed to work due to cache eviction, but it worked on the hardware! Reason is that L1 on real hardware only tracks write set, while emulator tracks read and write set. So the hardware can tolerate read set cache line eviction.<\/p>\n Compiler study: workload is to insert 100k nodes concurrently to skip list, with different compiler optimizations turned on. Interestingly, things vary pretty wildly here (3x difference in abort rate, and no clear total ordering between optimizations!). Emulator and real hardware show almost opposite behaviour... When investigating reasons, they found that -O1 generates the fewest number of memory access instructions (and can hence outperform -O3).<\/p>\n Combining RTM with a traditional lock that is acquired on the abort path massively increases TX failure rate, and this execution time versus the simple retry approach. Can smarten this up a bit by checking which type of abort occurred (conflict or cache capacity induced), but find that this increases abort rate to ~3.5 at four threads. This is because there's a falsely shared cache line, so add some padding. Now looking much better: 0.17 abort rate instead of 3.5 for four threads!<\/p>\n Finally, compare against traditional sync methods. RTM has about the same performance as fine grained locks or lock-free skip lists.<\/p>\n Limitations by own admission: study limited to data structure behaviour (skip list), only simple micro-benchmarks and limited to 4 threads on real hardware available.<\/p>\n Q: The lock-free version in your comparison appears to have about the same performance as the fine-grained lock version. Are you sure you implemented it right? Q: What about overlapping transactions? Need to take care with boundaries? Does TM really make writing the skip list easier? Q (Matthew Grosvenor, Cambridge): Did you do a bottleneck analysis? Specifically, what is the bottleneck of the lock-free version? (ms705)<\/em><\/p>\n <\/p>\n Muli Ben-Yehuda, Omer Peleg, Orna Agmon Ben-Yehuda, Igor Smolyar, Dan Tsafrir Where's the money in OS work? In \"the clouds\". Specifically, in very-short term rentals, i.e. resource micro-charging. Some providers sell \"customizable resource packages\", where prices are set based on supply and demand. Assume this all works out, we still have to deal with aligning economic incentives of cloud providers, clients and the OS (who also needs resources), and somehow manage to share resources reliably and with isolation at fine granularity. How do you build an OS for this kind of environment?<\/p>\n In fact, they claim that there is a single missing piece in historical OS development: architectural support for machine virtualization throughout the system. With this in mind, an OS kernel should not just optimize for performance, but also for cost! Second, the kernel should expose physical resources and get out of the way, giving applications means to manage their own resources while maintaining isolation.<\/p>\n The nonkernel is a hybrid kernel\/hypervisor, and can run on bare metal (being a hypervisor) or on top of a legacy hypervisor (as a guest kernel). Effectively, it does as little as possible, and merely provides little more than hardware-assisted virtualization. It boots the machine, arbitrates contended resources (but only gets involved in the contention case, as I understand it), isolates applications and provides efficient user-space IPC (managing setup and tearndown of channels, it seems). Could build on top of existing OS, or from scratch. Not quite decided on which one yet...<\/p>\n Pros: good performance, zero-overhead virtualization, reduced driver complexity, a more secure system and higher system efficiency due to the integrated economic model.<\/p>\n Cons: clean back, no backwards compatibility, no legacy hardware (need architectural virtualization support), no legacy software.<\/p>\n Preempt question \"isn't this just another Exokernel?\" -- answer: this is Exokernels done right, as the hardware is now in a place where it can better support this and avoid drawbacks (e.g. downloading user code to kernel). Working on prototype, \"nom\".<\/p>\n Q: Relegating lots of functionality to applications. How do you ensure that a malicious application does not compromise the machine? Q: Do I have to rewrite my application to run on nom? Q (Gernot Heiser, NICTA): How is this different from a minimal hypervisor like NOVA? (ms705)<\/em><\/p>\n <\/p>\n Aaron Carroll & Gernot Heiser\u00c2\u00a0<\/em><\/p>\n Things we learned?<\/p>\n What changed?<\/span><\/p>\n (mpg39)<\/p>\n <\/p>\n Joseph Chan Joo Keng, Tan Kiat Wee, Lingxiao Jiang, & Rajesh Krishna Balan,\u00c2\u00a0Singapore Management University (SMU)<\/em><\/p>\n <\/p>\n Why mobile privacy protection?<\/span><\/p>\n What about current tools?<\/p>\n We're different:<\/p>\n How useful is it? Methodology.<\/p>\n Results:<\/p>\n Leak Cause Analysis<\/p>\n Leak prevention mechanism<\/p>\n Q (Matthew Grosvenor): You mentioned that you picked the top 10 apps, but your analysis has over 200? Q: (Aaron, UNSW) It's nice to be able to notify people about leaks, but can you bin applications to stop them leaking? \u00c2\u00a0Some leaks are more serious than others. But disabling leaks \u00c2\u00a0also disables functionality. We want users to have informed consent.<\/p>\n (mpg39)<\/em><\/p>\n <\/p>\n Wei Wang, Raj Joshi, Aditya Kulkarni, Wai Kay Leong & Ben Leong National University of Singapore<\/em><\/em><\/p>\n Q: (Gernot) what does it matter if your phone life is 21 hours the quad is only flying for 45mins? A: Use lots of quads.<\/p>\n Q: Why use wifi (1mw) use cellular its 1W! : A: This is a good idea, but the limitation is that cellular harder is heavier (WFT??)<\/p>\n (mpg39)<\/em><\/p>\n <\/p>\n Alvin C<\/i><\/em>heung, Lenin Ravindranath, Eugene Wu, Samuel Madden, & Hari Balakrishnan\u00c2\u00a0MIT CSAIL<\/i><\/em><\/p>\n Why?<\/p>\n What?<\/p>\n This is dangerous. How to restrict the class of changes?<\/p>\n Need to to specify the which devices to apply this to:<\/p>\n How to apply the updates?<\/p>\n Problems:<\/p>\n Desktop\/Web might be interesting places to put this to.<\/p>\n Satsuma - Separates targets from names.<\/p>\n <\/p>\n Q: Is a target, just a location? A: Location is useful, but, maybe you want more, is the wifi on, is the application installed.<\/p>\n Q: What about GCM, how it this different? A: GCM requires every developer host their server. Which is expensive and we'd like to generalise this.<\/p>\n Q: How do you know where to annotate? A: Maybe the compiler can set everything to annotated and \u00c2\u00a0then optimise later.<\/p>\n (mpg39)<\/em><\/p>\n <\/p>\n Malte Schwarzkopf, Matthew P. Grosvenor, & Steven Hand\u00c2\u00a0University of Cambridge Computer Laboratory & Microsoft Research Silicon Valley<\/i>\u00c2\u00a0<\/em><\/p>\n We make the case that distributed operating systems, an idea from the 1980's are worth a revisiting in the case of the modern operating system.<\/p>\n <\/p>\n Arjun Roy, Ken Yocum & Alex C. Snoeren, University of California, San Diego<\/em><\/p>\n Shrinking networks can yeild incorrect conclusions, conclusions can be either unrealistically optimistic or pessimistic.<\/p>\n (mpg39)<\/em><\/p>\n <\/p>\n Haris Volos, Sankaralingam Panneerselvam, Sanketh Nalli, Michael M. Swift<\/i><\/p>\n SCM technologies: phase-change memory (PCM), spin-transfer torque DRAM, memristors etc. -- all non-volatile RAM technologies. SCMs are as fast as DRAM, but unlike DRAM, they're persistent. Been heralded as the next awesome thing for a while, but might actually be around the corner now.<\/p>\n With SCMs, some things, such as disk request scheduling, become obsolete. As they expose load\/store instruction interfaces, can also get rid of the device driver and a bunch of other layers. Nevertheless, many designs rely on a filesystem atop the SCM devices. This is because FSes provide convenient abstractions for global naming, protected sharing and crash consistency. However, this additional layering can inflate the quick access time to SCMs. For this reason, people have added a fast-path for data, bypassing the FS for bulk operations, and only using the FS for meta-data. However, even meta-data operations matter: 20-50% of time for various workloads (file server, web proxy, web server) is spent doing such operations! For example, a meta-data operation takes about 800ns on RAMFS, which is 10x the access latency of an SCM. Indeed, hierarchical FS traversal can be even more expensive.<\/p>\n So wouldn't it be nice if we could bypass the POSIX abstractions for applications that want to do high-performance SCM access? This is what their Aerie<\/em> library FS provides. The key here is that all meta-data lookup and hierarchy traversal (and thus many meta-data operations) can be done in user-space, without involving the kernel. The only thing that crosses the kernel boundary is the calls into the (specialized) SCM manager. But can't trust user-mode libraries with metadata integrity and sharing leases, so they also have a trusted user-space service that implements this.<\/p>\n Using this libFS structure, they implemented two file systems: a POSIX-like FS, PXFS, for legacy compatibility, and a key-value interface FS, KVFS. Compared these against ext3, and find ~30% latency reduction compared to ext3 on web proxy and file server, but an 18% increase in the web server (likely because of the cost associated with the open() call, which is common in this environment). RAMFS does pretty well too, but KVFS reduces the latency by 66% over ext3 and 16% over RAMFS for the web proxy workload, while giving better guarantees than RAMFS.<\/p>\n Q: Why do you compare with ext3, not ext4? All performance optimization was done on ext4 in recent years, so should look at that to be fair! Q: Presumably application developers will rewrite their applications for SCM. Why do you need to support some of the more arcane legacy APIs (as mentioned in future work)? (ms705)<\/em><\/p>\n <\/p>\n Yandong Mao, Cody Cutler, & Robert Morris RAM latency can dominate application performance, for example if you need to follow long pointer chains or the working set exceeds cache size. Lots of cache misses lead to many memory accesses and load on the memory bus. Their solution is to treat RAM more like a disk (but with different latency magnitude), and apply the same optimization techniques: batching, sorting, pipelining etc.<\/p>\n Experimental environment: six-core Xeon X5690, one memory controller with three channels that each have a row buffer cache inside them. It also supports a bunch of prefetching features (in hardware and software), parallelism across channels and executes instructions out-of-order. Note that row buffer cache hits are 2-5x faster than misses, and sequential access has 3.5x higher throughput than random access.<\/p>\n Consider a garbage collector example: long pointer chains, but with no predictability of next locality. So each access ends up generating a cache miss and stalls for RAM access. Linearisation of requests could help here, so on each GC cycle, arrange objects in tracing order, and make use of prefetcher. For 1.8 GB working set of HSQLDB, found that sequential order tracing is 1.3x faster than random order. In future work, would like to use a better linearization algorithm than copy collection.<\/p>\n Second example is using Masstree: for this key-value store, it is not possible to linearize memory accesses, as the storage abstraction is a shared B-tree. Despite careful design, Masstree is still RAM latency-dominated, as each key lookup follows a random path. To improve this, use batching and interleaving of tree lookups. Specifically, they use software prefetching to interleave tree lookup and RAM fetch. On a single core, this interleaving strategy can improve the performance by 30% with a batch size of five.<\/p>\n Parallelization also helps, as the RAM can actually fetch stuff in parallel, so using more cores helps. Up to 12 hyperthreads, the total number of RAM loads scales linearly with the Masstree throughput! Can also combine parallelization and interleaving, and still get 12-30% improvement over just parallelization, although this tails off with more threads.<\/p>\n Conclusions: interleaving seems more general than generalization, but is more difficult than parallelization, especially if we try to do it automatically and without programmer input (might be impossible without some help). Found that interleaving technique can also be applied to hashtable, which receives at 1.3x throughput boost (for memcached). Would also be nice to have tools to identify RAM stalls -- they had to use an indirect technique based on function costs and function operations.<\/p>\n Q: You propose two techniques that can improve RAM access throughput. But they're highly tailored towards the application and data structures. How do you think you can generalize your techniques or make them application-independent? (ms705)<\/em><\/p>\n <\/p>\n Yongseok son, Jae Woo Choi, Hyeonsang Eom, & Heon Young Yeom,\u00c2\u00a0Seoul National University<\/i><\/p>\n <\/p>\n Motivation:<\/p>\n Approach:<\/p>\n Results:<\/p>\n <\/p>\n Q: Have you sent you patch into the linux kernel? A: No.<\/p>\n Q: Is you dataset random or real? A: random; C: This is unfair as linux is optimistically caching.<\/p>\n <\/p>\n Chun-Ho Ng & Patrick P. C. Lee ,\u00c2\u00a0<\/i>The Chinese University of Hong Kong<\/i><\/p>\n <\/p>\n Motivation:<\/p>\n Our work:<\/p>\n Global dedup.<\/p>\n [Sorry, fell asleep, not because I wanted to. Jetlag + all night hacking the CamIO demo]<\/p>\n (mpg39)<\/p>\n Jongmin Lee, Yongseok Oh, Hunki Kwon, Jongmoo Choi, Donghee Lee, & Sam H. Noh,\u00c2\u00a0Dankook University, University of Seoul & Hongik University<\/i><\/p>\n <\/p>\n We propose:<\/p>\n Description:<\/p>\n Results:<\/p>\n (mpg39)<\/em><\/p>\n <\/p>\n Haogang Chen, Cody Cutler, Taesoo Kim, Yandong Mao, Xi Wang, Nickolai Zeldovich, Frans Kaashoek Interpreters are everywhere these days, from the Linux kernel to Type 1 font renderers. Vulnerabilities in interpreters are also ubiquitous. Consider the example of a packet filter, e.g. for tcpdump. Could run the whole filter in user space, which is safe but slow. Could also run in the kernel, but then we have untrusted user space code in the kernel. Traditional solution: Berkeley Packet Filter (BPF) + interpreter. This seems sound: the interpreter runs in the kernel, and byte code comes from user space. But byte code and input data are still untrusted, and if a malicious user was to inject malformed code or data, bugs in the interpreter can compromise the kernel. There are real examples of this: e.g. the INET_DIAG infinite loop vulnerability -- DoS as a result of passing in a zero \"oplen\" parameter via BPF. Second example: ClamAV anti-virus signed division vulnerability, where a parameter check was implemented incorrectly and ended up with the interpreter trapping as a result of division by zero. Finally, third example is an arbitrary code execution vulnerability in FreeType font interpreter: by passing a negative argument count, can manipulate interpreter stack pointer and thus break into system.<\/p>\n Conclusion: writing secure interpreters is hard! Rest of talk will discuss some security guidelines for writing interpreters.<\/p>\n What research opportunities are there? Automated testing of embedded interpreters. Could do static analysis, but the invariants are too dynamic and complicated. Could do symbolic testing, but control flow and complexity highly depends on the byte code. Or maybe we can build a reusable general embedded interpreter, e.g. based on Java byte code or something.<\/p>\n Q: Did you find *any* good interpreter that didn't have any of these problems? Is there any good citizen? Q: You know that JVM has a theorem prover that detects bad byte code that runs on bytecode before executing it? Q: You mention constraining resources. How would you actually do that? Q: For OS kernel code, it's hard to enforce security as there's one big address space. What could one do to OS design in order to make embedded interpreters in the kernel safe? (ms705)<\/em><\/p>\n <\/p>\n Stefan Bucur, Johannes Kinder, George Candea PaaS (AppEngine and friends) is jolly good, as it's easy to use. But how does one test these applications systematically? Consider the example of a simple trivia Q&A service. Could do unit tests or integration tests. But constructing test inputs is a bit tedious, as we need to wrap them into tons of boilerplate (XML etc.). Even harder is generating all the inputs, and getting full coverage is almost impossible. Integration tests focus on corner case, and can come up with lots of these and yet not cover all of them.<\/p>\n Wouldn't it be nice if there was an automated way of generating tests, especially if this was offered as a service with PaaS application deployment? Developer just submits code and input description, and then \"the cloud\" performs test case generation, testing etc. Could somehow use symbolic execution for doing the test case generation (using a constraint solver). But real web apps are more complex, and do not just take primitive types as input. This now makes the parser and the symbolic execution tree a lot more complex (or at least bigger). What's worse, the fuzz test case generator may end up exercising all kinds of exceptions in e.g. the JSON parser, but won't trigger the cases exercising bits of our app that we're interested in. Their solution to this is something called \"layered symbolic execution\", which essentially replaces the output of the parsing with a \"fresh\" variable. Unfortunately, the variable now doesn't contain a valid JSON string though. They use an \"onion object\", which is the input description submitted by the user, to then synthesize reasonable values.<\/p>\n Prototype with GAE dev server and S2E symbolic virtual machine. Currently WIP, can generate one test case per second.<\/p>\n Q (Matthew Grosvenor): What in this work makes it specific to cloud applications? Could this be used in general applications as well? (ms705)<\/p>\n <\/p>\n Taesoo Kim, Ramesh Chandra, Nickolai Zeldovich Unit tests are important, but running them takes too long (e.g. 10 minutes in Django). This work makes them fast (2s).<\/p>\n Research direction here is to do regression test selection (RTS), which is to run only the necessary tests instead of everything. This requires syntactical analysis of test cases and code changes, and resolving the appropriate tests to run. But RTS techniques are never adopted in practice. Hypothesis: \"soundness\" (no false negatives) ends up killing their benefits. For example, if you change a global variable, this could affect (and thus re-run) everything, in which case we may as well never have used RTS in the first place. The goal of this work is to make RTS practical, by prioritising performance over soundness and better integrating it with the development cycle.<\/p>\n To track which tests execute which functions, first need to figure out dependencies between functions. Also incrementally update this dependency information as unit tests run, and these increments can be pushed to the repo server alongside code changes (so that other people only need to run the affected tests when they sync these changes?).<\/p>\n Still have the false negative problem, so need to do a better job at dependency identification. They introduce an asynchronous test server that actually runs all tests continuously, and monitors them for any missed false negatives (example: non-deterministic control flow leading to function only being called sometimes).<\/p>\n Evaluation: using Django and Twisted frameworks, and looked at last 100 commits to trunk. Find that sometimes, small modifications to \"hot\" functions lead to tons of unit tests being re-run, and sometimes large-scale changes end up re-running few cases (didn't quite understand why this was). In terms of test runtimes, they go from minutes to seconds as a result of introducing TAO (their stuff). Downside (ish): TAO adds extra stuff to code base, ranging from 60 to 117% of original repo size! But could reduce this maybe.<\/p>\n [no questions]<\/p>\n (ms705)<\/em><\/p>\n <\/p>\n Yu Zhang & Bryan Ford Many-core hardware now abundant, but parallelism makes life hard due to non-determinism. Data races and non-determinism are everywhere! Can just accept this and live with it, or use race detectors, or use deterministic schedulers. Ideally, we'd like to develop a parallel programming model in which races do not exist in the first place. Determinator tries to this this, but isn't very general. Their previous work, SPMC, enabled producer-consumer page sharing under deterministic parallelism. However, this was very wasteful in terms of space. Also introduced DetMP, which is a deterministic message-passing API with explicit SPMC (single producer, multiple consumer) channels.<\/p>\n In this work, they introduce Lazy Page Mapping and Tree-style Mapping, which are both optimizations to make SPMC sharing better. Available as an extension to Determinator (xDet) and on top of Linux (DLinux). Performance is much better than previous work for various MPI workloads.<\/p>\n (ms705)<\/em><\/p>\n <\/p>\n That's it, folks!<\/p>\n","protected":false},"excerpt":{"rendered":" Matt and myself are at the Asia-Pacific Systems Workshop today, presenting our paper on distributed operating systems in data centers, and we’ll be live-blogging the workshop for you.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[85,79,35],"_links":{"self":[{"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/posts\/1222"}],"collection":[{"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/comments?post=1222"}],"version-history":[{"count":28,"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/posts\/1222\/revisions"}],"predecessor-version":[{"id":1224,"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/posts\/1222\/revisions\/1224"}],"wp:attachment":[{"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/media?parent=1222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/categories?post=1222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.syslog.cl.cam.ac.uk\/wp-json\/wp\/v2\/tags?post=1222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}Introduction<\/h2>\n
\n
Keynote:\u00c2\u00a0\u00e2\u20ac\u0153Can truly dependable systems be affordable?\u00e2\u20ac\u009d<\/h2>\n
\n
\n
Session 1: OS and Performance I<\/h2>\n
Schedule processes, not VCPUs!<\/h3>\n
\nInstitute of Parallel and Distributed Systems, Shanghai Jiao Tong University<\/i><\/p>\n
\nA: Xen supports hotplug insertion, but not removal. KVM does not support it in the simulator (?).<\/p>\nExploring Microcontrollers in GPUs<\/h3>\n
\nRitsumeikan University & Nagoya University<\/i><\/p>\n
\nA: PCIe and GPU memory variation.<\/p>\n
\nA: We're not sure why. But our methods can be used concurrently with current methods.<\/p>\n
\nA: No, although they have microcontrollers too, so some principles might be transferrable.<\/p>\nOpportunities and pitfalls of multi-core scaling using Hardware Transaction Memory<\/h3>\n
\nFudan University, Shanghai Jiao Tong University, New York University<\/i><\/p>\n
\nA: The lock implementation is optimized (something about having to lock predecessors, didn't catch details).<\/p>\n
\nA: Provable correctness is easier to attain when using hardware TM (?).<\/p>\n
\nA: We have more results showing that lock-free and fine-grained locked version are equivalent in performance up to 40 threads.<\/p>\nThe nonkernel: A Kernel Designed for the Cloud<\/h3>\n
\nTechnion<\/i><\/p>\n
\nA: Not trivial to answer this. For DMA, use IOMMU, for other things use architectural virtualization support which we know how to use it safely.<\/p>\n
\nA: You should, but you might not have to. We provide defaults for IO stack, network stack etc. Not for everyone, this is for people who can take advantage of this.<\/p>\n
\nA: The big difference is that we allow applications to manage their resources.
\nQ: This should be easy using just capabilities.
\nA: No, isn't easy. Applications can dynamically tweak resource requests based on economic model.<\/p>\nSession 2: Mobile<\/h2>\n
The Systems Hacker\u00e2\u20ac\u2122s Guide to the Galaxy: Energy Usage in a Modern Smartphone<\/h3>\n
\n
\n
\n
The Case for Mobile Forensics of Private Data Leaks:Towards Large-Scale User-Oriented Privacy Protection<\/h3>\n
\n
\n
\n
\n
\n
\n
\n
\nA: As a result of our testing methodology, we ended up with a bunch more applications.<\/p>\nFeasibility Study of Mobile Phone WiFi Detection in Aerial Search and Rescue Operations<\/h3>\n
\n
Mobile Applications Need Targeted Micro-Updates<\/h3>\n
\n
\n
\n
\n
\n
\n
Session 3: OS and Performance II<\/h2>\n
New wine in old skins:\u00c2\u00a0the case for distributed operating systems in the data center<\/h3>\n
Challenges in the Emulation of Large Scale Software Defined Networks<\/h3>\n
\n
Storage-class memory needs flexible interfaces<\/h3>\n
\nA: Will look at in future work.<\/p>\n
\nA: I guess people will have to rewrite applications if they want to reap the benefits.<\/p>\nOptimizing RAM-latency Dominated Applications<\/h3>\n
\nMIT CSAIL<\/i><\/p>\n
\nA: Don't claim that all of them can be applied to all applications. But categories of techniques might work for categories of applications, so something of a pattern-based approach might work.<\/p>\nOptimizing the File System with Variable-Length I\/O for Fast Storage Devices<\/h2>\n
\n
\n
\n
\n
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups<\/h2>\n
\n
\n
\n
<\/h2>\n
TinyFTL: An FTL Architecture forFlash Memory Cards with Scarce Resources<\/h2>\n
\n
\n
\n
\n
\n
Security bugs in embedded interpreters<\/h2>\n
\nMIT CSAIL<\/i><\/p>\n\n
\nA: Of course, but this doesn't mean it can't have these problems in the future.<\/p>\n
\nA: Our goal is not to analyze general-purpose byte code machine interpreters.<\/p>\n
\nA: [missed the answer]<\/p>\n
\nA: Could do micro-kernel style interpreter-in-userspace. But if need performance, nothing you can do beyond making sure the interpreter is correct.<\/p>\nMaking Automated Testing of Cloud Applications an Integral Component of PaaS<\/h2>\n
\nEPFL<\/i><\/p>\n
\nA: What's specific is the onion-style layered inputs, but could in principle work for other applications too.<\/p>\nOptimizing unit test execution in large software programs using dependency analysis<\/h2>\n
\nMIT CSAIL<\/i><\/p>\nLazy Tree Mapping: Generalizing and Scaling Deterministic Parallelism<\/h2>\n
\nUniversity of Science and Technology of China & Yale University<\/i><\/p>\n