Posted by & filed under Uncategorized.

The PLOS crowd!I’m here in the Nemacolin Woodlands at SOSP 2013 with a huge gaggle of the SRG (both old and new), in a room packed to capacity for the PLOS 2013 workshop!  I’m co-chairing it with Tim Harris, so I’ll be liveblogging talks when I’m not coordinating sessions.  The keynote is from Russ Cox from Google, and you can find all the papers on the ACM digital library.

Russ Cox: Using Go to build distributed systems

Go is an open-source programming language that makes it easy to build simple, reliable and efficient software. Started in 2007, open sourced in Nov 2009, and has been open source development since then.  Started as an answer to software problems at Google: multicore, networked systems, massive clusters, and working at scale: 10^7 LoC, 10^3 programmers, 10^6(+) machines.

Go is a simple but fun language. Start with C and remove the complex parts — add interfaces, concurrency and GC/closures/reflection/strings.

“Less is exponentially more”.

Engineering: fast compilation

The joke is that Go started when waiting for a large C++ program to compile.


package main
  import "fmt"
  func main () {
    fmt.Printf("")
  }

import “fmt” is guaranteed to read exactly one file.  Metadata in that package describes what else may be needed in the future.

Q from audience: what doesnt scale with 1000 includes instead of one?
A: the compilation time doesnt scale — not runtime issues.


$ go get github.com/golang/glog

package main
  import (
    "flag"
    "github.com/golang/glog"
  }

func main () {
  flag.Set("logtostderr", "true")
  glog.InfoF("hello world")
  }

Still guaranteed to read exact one file, but the import namespace is decentralized.

Engineering: Program rewrites

There’s a mechanical mandated format that go requires for syntax.


$ gofmt -r 'glog.InfoF -> glog.Errorf' hello1.go

Prints the same program but with the Info line from before into Error.  Very useful for refactoring large code bases.

Engineering: garbage collection

In C/C++, too much programming and API design is about mem management. Go has garbage collection only.  Fundamental for interfaces.

Go lets you limit allocation by controlling memory layout — for example, a ring buffer.

The GC is still an area of active research. Design decision: cannot reuse Java GC directly due to an early design decision to permit interior pointers, as are foreign pointers.  It is implemented as a mark and sweep language.  It’s idiomatic in Go to plan your memory layout upfront, unlike (for example) in Java where you are encouraged to treat garbage collection as “free”.

Interfaces

An interface defines a set of methods.


package io
type Writer interface {
Write (data []byte) (n int, err error)
}

A type implements the interface by implementing the methods (structural typing).

An implementation of an interface can be assigned to a variable of that interface type.

Reader is an obvious counterpart, and this lets you write a function Copy that operates purely over Reader and Writer interfaces.

A more complex adapter is Adapters.


func MultiWriter(writers ...Writer) Writer

creates a writer that duplicates it writes to all the provided writers, like Unix tee.

Interfaces tend to be fairly small, but one bigger example is in the Networking library.  Errors are explicitly returned.


func NewClient(conn net.Conn, host string) (*Client, error)

You can build other things such as SSL on top of this interface.

Interface lessons

There’s no dependence between the interface and implementation, unlike in Java where interfacing is very explicit.  Expressive composition, easy testing and avoids overdesign and the rigid hierarchy of inheritance-based OO.

Interfaces are the source of all generality the Go.

Concurrency vs Parallelism

Concurrency is about dealing with a lot of things at once. Parallelism is about doing a lot of things at once.  Concurrency is about structure, and parallelism is about execution.

Examples of concurrent: keyboard, display, disk drivers in an OS.

Parallel: vector dot product, matrix multiply which can often be sped up almost linearly with cores.

Concurrency enables parallelism, but is useful on its own since modern programs must deal with many things at once.

Go provides two important concepts: a goroutine is a thread of control within the program, with its own local variables and stack.  They are cheap and easy to create.

Channels provide a mechanism for goroutines to communicate via typed messages between goroutines.


package main
import "fmt"
  func main() {
  c := make(chan string)
  go func() {
    c <- "Hello"
    c <- "World"
  } ()
  fmt.Println(<-c, <-c)
}

The func() runs concurrently with the main() which consumes those messages in the Println.

The model is taken from Tony Hoare’s CSP and is orthogonal to the rest of the language and can keep a familiar model for computation, and focus on composition instead.  You can switch to CSP primitives only when you want to worry about computation.  On the other hand, something like Erlang requires you to learn a new method of working (actors) and is less easy to move from C/C++.

Caveat: they’re not purely memory safe and sharing is legal.  Passing a pointer over a channel is idiomatic, and its been working well in practise.

Sequential network address resolution given a work list.


for _, w := range worklist {
w.addrs, w.err = LookupHost(w.host)
}

This is a simple one that does it step by step.  To do it in parallel, lets use channels


done := make(chan bool, len(worklist))

for _, w := range worklist {
go func (w *Work) {
w.addrs, w.err = LookupHost(w.host)
done <- true
} (w)
}

for i:= ; i<len(worklist)l;i++) {
<- done
}

Channels are being used here to build a pattern, which can be put in the stdlib.  This is actually a sync.WaitGroup from the standard library. Parallelism can also be bounded by building a channel to keep track of channel limits.

If you are using mutexes, you are certainly creating bugs.  Channels seem easier for people to get right in practise.

Example: replicated storage with read/write quorums.  (Eventually consistent)


const (
F = 2
N = 5 // >= 2F + 1
ReadQuorum = F + 1
WriteQuorum = N - F
)

(Shows an example of this using channels and writes)

Q: How are slow messages cancellable?

A: the channel itself will be GCed since the writer stuffs the request into the channel. However cancellation is a common pattern in distributed systems and so a more advanced version can return a channel to be used for interruptions (Which can in turn by watched by a network RPC engine and invoked if stuff takes too long).

Select allows choosing between multiple channel ops.  For example a chat program


for {
  select {
  case event := <- ui:
  //process user interface event
  case msg := <- server:
  // process server message
 case t := <- tick:
  // time has elapsed
  }
}

Concurrency lessons

This is a *key* feature for building distributed systems, and supported by closures and garbage collection.  Message passing inside the program, and also outside the program.  This helps you to not change your thinking when moving from within a single program to a distributed system (which has to be message passing).

Most importantly, do not communicate by sharing memory and instead, share memory by communicating.

Q: are closures deep copy?
A: no, they are capturing references to variables.

Q: what about cache coherence across domains?
A: being worked on, but not there yet.

Type systems

What the type system chooses not to express is as important as what it does.

Interfaces are most powerful when they are the least restrictive.  A Write for example, doesnt have a flush call, just a write.  Restrictions that could bifurcate interfaces that Go doesnt have are:

  • qualifiers like const
  • memory management details
  • concurrency details: does this function block?  If it is a library, then your interfaces are less powerful.  Go just takes care of this in the runtime.

Concurrency details restrict implementations:

  • Function f does not block.
  • Now function f needs to send a log message as an RPC.
  • F’s signature much change.
  • Upstream functions must block.

In Go, blocking is assumed, and the runtime can interleave goroutine execution. For concurrency, use a goroutine!  This makes stacks much more useful.

Is go used in production?  Yes, with a couple of examples within Google that are public.

  • vitess/vtocc: a MySQL query balancer: serves all YouTube MySQL queries and has months of crash-free and leak-free operation.
  • groupcache: distributed in-memory immutable k/v cache. Used by parts of dl.google.com, Blogger, Google Code, Google Fiber, and various production monitoring systems.

This is all available at http://golang.org

Q: in Haskell, the pain in the IO monad is much felt.  The type system is a tool for finding technical debt.  Haskell prevents shortcuts and forces you to do the right thing.
A: Loves Haskell, but is unable to write programs. There’s a real tradeoff between the type systems complexity and expressivity.  Go hits a spot in the middle that doesnt force you to be honest all the time (like Haskell).

Q: are there any fundamental things you can do in Go vs (e.g.) Erlang
A: in Erlang, there’s no concept of a channel and you pass around process ids.  In Go, it’s just a nice solid model based on CSP that is easy to pick up.

Q: in Singularity, it had channels that were safe, but in that case channels are statically guaranteed.
A: they dont make this static guarantee, but there is a static analyser called Race Detector which records all reads and writes and observes bad backtraces.

Q: there’s a certain complexity to Java now with generics and so on.  How do you stay simple?
A: not claiming we still stay simple forever.  Go offers a balance: Java doesnt trust you with anything and they make things clear.

Stephen Kell: An operating system, should there be one?

Oracle Labs, University of Cambridge
[paper link]

IMG_0810
Recursive search of a pttern.  What is grep doing?


grep - r 'pattern' $d

what kind of thing is $d, and what does grep do? $d is a tree node (directory).

grep -r will recursively explore (readdir), work on leaf nodes, read characters (bytes) and if matched, output a reference. What stops it generalising further?

It turns out that if you grep the grep docs, they are gzipped and we hit a semantic wall for composability.  Hitting a limit of Unix (semantics free) byte streams.

In a Smalltalk world, Dan Ingles suggests that an “operating system is a collection of things that dont fit into a language. There shouldn’t be one”. Specific grumbles: “depart from an otherwise consistent framework”, “very primitive’”.  This never quite took off for some reason though…

A key part of OS’s value is communicating abstractions. Unix is tending towards Smalltalk — but its meta abstractions are lurking. Smalltalk is all about a programming abstraction with pervasive late binding, with uniformity and big, well-designed libraries.

What is a “meta system”?  A “system” is a collection of behaviors modelling a domain.  A “metasystem” models the domain of systems.  For this talk, a machine-readable specification is of interest.  Metasystems enable value added orthogonal functionality such as persistence, REPLs, debugging and so forth.

In the land of UNIX, “since we are programmers, we naturally designed the system to make it easy to write, test and run programs”.  It has multiple programming abstractions, primitive metaabstraction from I/O and some late binding.  But it’s missing a semantic meta-abstraction (e.g. classes) and in modern UNIX, uniformity…

In Plan9, “we build a UNIX out of little systems, not a system out of little Unixes”

Files and objects are some subdivision of a computation, implementing a basic protocol, consisting of messages that are network transparent — onto which additional meaning can be layered.

It’s now missing documented semantics for files (e.g control files) and so in this world, what can I grep or cp -r ?

There is a lurking Smalltalk in UNIX, but also metasystems:

  • /proc filesystem, vmmap
  • synthetic file systems (/sys and sometimes /dev)
  • DWARF debugging
  • /etc/services
  • filename extensions, MIME
  • HTTP content encoding

What Stephen wants to do is to build a Smalltalk out of lots of fragments, not a big fragmented system out of little [insert PL here]s”

Smalltalk is one view among many!  Languages appear to be more ephemereal than OSes, and language implementations should be “a view on the wider world”.

What are the benefits if we join the dots on metasystems?  Via minor expansions to UNIX like interfaces to unify the metasystem and not just the system.

Concrete Use:
New tool called Pmirror : knows Linux /proc, ELF, Dwarf (more wanted) Explains any byte in a system to more semantic introspection.

How can an IPC metasystem have a provided/require metasystem? Can we have someting similar to Dwarf to interpret files and directories with more semantic information in the style of DWARF?

Gabrielle Keller: verifying filesystems

[paper link]

IMG_0813Causes of filesystem bugs can be broken down into semantic confusion (58%), memory handling (14%) and error handling (10%) and finally concurrency.  For now, putting concurrency aside and worrying about the rest.

Starts with an Isabelle functional specification and a domain specific language (DDSL) which contains the data layout spec

The DDSL compiler compiles serialisation code which is then passed through to a lower level Isabelle spec.  Finally all this is extracted to C code along with proofs that they are equivalent to the specification.

The current implementation contains a CDSL core design and has static and dynamic semantics formally defined.  A case study involves  a low level filesystem.

The DDSL is a key part of the system.  There are a lot of existing data description languages (how to specify a data structure and layout, and how to express constrains on the values and how to check them statically).  A lot of work, but mostly engineering.

The control CDSL was more interesting from a research perspective.  It involved a tussle between systems people who demand efficiency, destructive updates and expressiveness. The theory people want to have controlled side effects, memory safety and static reasoning.

Therefore the design of CDSL ended up using two key things: a linear type system allows interpretation via value as well as update semantics.  The main idea is that the values of a linear type have to be used exactly once.   This can be statically checked.


createObj :: Int -> Object
updateObj :: (.Object, Int) -> .Object

(where the dots represent linear types).

The main problem with linear types is how it makes programming more difficult.  Therefore CDSL has syntactic sugar to make it easier to break apart values and deal with them separately.

Error handling is dealt with by encoding it in the type.  There is only a single error code type and results are propagated to be linear-friendly (i.e. on success the same object is returned as a fresh one)

CDSL ends up having sound value semantics and update semantics. Given a well typed CDSL program, it can be interpreted with functional semantics (i.e. equational) and also sound update semantics for efficient implementation. The results of evaluation are the same!  In addition to the final value, every single evaluation step has a correspondence in the value and update sematnics.

Linear types have been around for quite a while, but they aren’t used widely despite their nice properties. The type of programs that can be written are quite restrictive, but this is still a good tradeoff to achieve the correctness goals for this project.

The results of this project are very promising, and its really a question of how cheap it’ll be.  How much reuse of proofs will be possible, and how much can the abstraction level of CDSL be pushed towards proof automation?

Q: (Butler Lampson) you already built a verified kernel. What’s the differnece here?
A: Heavy emphasis on proof automation.  Gernot Heiser observes that it took 22 person years

Towards a verified component platform (Matthew Fernandez)

[paper link]

What’s a trustworthy system? Its about DO-178C — they check development process, testing, config management, certification and eventually formal verification.IMG_0814

The issue with this right now is the expense of scaling up formal systems as LoC grows.  A component-based development system offers an opportunity to build layers on top of a verified seL4 kernel and eventually grow it into a large scale system at much cheaper cost than is currently possible (with monolithic verification).

The end to end guarantees involve starting from an architecture description, generating glue code that combines component code into a userspace image.  The arch description also outputs an architecture semantics, which lets us show the strong separation of multiple components.  The user provides an architecture spec, a trusted component specification and component code.

It’s all done in Isabelle/HOL and they have a lot of code built using it now (including previous filesystem talk), so its heading into being a “real” system.

CAmkES is a component platform which has abstractions for procedures, events and shared memory. (shows the architecture semantics)

The component process language is a concurrent imperative language. It consists of a UserStep to change a local value, a SKIP to do nothing and a composition double-semicolon to make a request, and a similar if/then/else to handle non-deterministic responses.

Q: safety-critical but not real time systems?
A: not modelling time at the moment.

Modelling NICs with Unicorn

[paper link]

IMG_0815
Builds on a paper at HotOS — increasing complexity, diversity and functionality in network cards (checksum, segmentation offload, lots of hardware queues, hw filtering, virtualization, direct cache acces). Cores are not getting faster, but networks are. Current operating systems fail to utilize NIC hardware resources though.
The Dragonet stacks breaks up the problem into the NIC and net stackb being dataflow graphs (c.f. x-kernel or click).

There is a Physical Resource Graphs (the PRG) that contains hw functions and configuration into a Logical Protocol Graph (the LPG) that has protocol state and packet processing. The PRG is combined with the LPG to embed into a real system.

Unicorn is a programming language to talk about NICs. A concrete syntax for writing PRGs and LPGs. The building blocks are function nodes: a single computation with a single input and ports with with potentially multiple output edges. When computation is done, one port is activated and subsequently, nodes connected to that port are activated. Operator nodes provide more useful
functionality such as supporting multiple inputs.

In the LPG (shows a packet forwarding example), there is a very loose partial ordering to enable more flexibility when embedding. This allows sw/hw resources to be used interchangeably.

Modelling NIC configuaration is quite difficult — there are lots of options that drastically change the behaviour of the NIC. Thus configuration has to be first-class in the embedding to get high performance.

The implementation is in Haskell — a edsl that embeds algorithms into PRG/LPGs and then into Unicorn (language+model). At the very bottom is the Intel DPDK which executes it on an Intel NIC (i82599). The stack responds to ping (so ARP/ICMP/etc) and they are continuing to develop it.

Q: Does this system have any modularity?
A: Yes! (I missed how though, sorry)

Annotation for automation: Rapid prototyping

[paper link]

IMG_0816File system aware applications include a file system checker (fsck) to ensure that it’s consistent. For example, two files with the same name and directory will result in fsck renaming one of them so that the unique naming invariant is preserved.

Filesystem tools are strongly coupled with the fs-specific details – imagine n file systems and k file system tools.

Goal is to decouple a file system format from file system tools. It’s analogous to compilers decoupling the frontend and backend formats.

The challenge is that we need a specification of the filesystem format to traverse the fs graph. However, nodes are under-specified and so we dont have enough information to make a tool.


struct baz {
short name_len;
/* name goes here */
};

The ondisk format of this has a variable length array (which can’t easily be declared in C).

The second problem is that edges are implement. For instance, does struct foo point to struct bar? The ondisk format doesnt reflect this strongly.

Their approach is to create an annotation language to identify on-disk data structures which arent currently specified in the C code. This involves specifying the pointer information, the type of structure references, and the conversion to disk addresses. Next, conditional interpretation of fields (tagged unions) increases flexibility.

They are building a simple ext2-style file system to evaluate their language.


FSSTRUCT inode {
u32 i_type;
u32 i_mod_time;
u32 i_size;

POINTER (type=dir_block, when = self.i_type == I_DIR)
u32 i_block_nr[NR_DIRECT_BLOCKS];
... (see paper)

These annotations are detailed in the paper, but essentially map ondisk data to the C struct. Their tool can now use these to generate code that (e.g.) forms iterators. They can annotate real file systems (e.g. ext3 btrfs etc) and improve code generation.

Q: (Russ Cox) this seems to be focussing on serialization, so why not just provide libraries?
A: the libs are provided by the filesystem authors and only spottily available. The goal is to support all styles of applications (both online and offline).

Q: can you handle corrupted datastructures?
A: this would be considered part of the application logic. They currently generate code which would to be embedded in the template.

Q: the idea is to write one bit of code in fscheck, and then write code against all file systems. What would this look like?
A: You need to write the fschecker, but also the filesystem specific consistency rules. In addition to annotation, there is also the fs-specific policies and consistency invariants which also need to be specified.

Q: could you write a database like this? cf the internal contents of files which are quite complicated now (e.g. a sqlite database stored as a file)
A: yes, you would treat the database file as a filesystem and specify the metadata that’s part of the db definition.

Marius Eriksen: your server as a function, Twitter Inc

[paper link]

IMG_0817Futures is a placeholder for a result that may not exist. A Future[Int] future is an integer-valued future for example. Futures are used for any sort of async operation, for example pending RPCs, disk I/O, or pending timeouts. Generally they’re ignostic to the source of the operation, and consumer and producers are coupled. A promise is a writeable future to fill in its value.

It is possible to view futures as an at-most once element container. You can transform them by constructing a map function.


Future[T].map[U](f: T => U): Future[U]

The most important way to use futures is to compose futures as a function of other futures.


Future[T].flatMap[U]( f: T => Future[U]): Future[U]

Futures recover from errors by another form of dependent composition.


val f = auth(id, pw) rescue {
case Timeout => auth(id,pw)
}
(which is similar to stack based catching of exceptions)

It’s possible to build up a collection of futures and wait for all the futures to succeed, returning the sequence of returned values. The returned future fails if any of the constituent futures fail.


def fetch(url:String): Future[Buf]
def parse(buf: Buf): Seq[String]

def crawl(url:String): Future[Seq[Buf]] =
fetch (url) flatMap { buf =>
val ls = Future.collect(parse(buf) map crawl)
ls map(_.flatten)
}

He shows a simple webcrawler (for DAGs only!) that use all these abstractions.
Then shows how services are just functions (e.g. memcached, redis and so on). A service can have filters which transforms a set of requests over a service and constructs the pipeline of changes. Simple example is a timeout filter which can either timeout a request or map through the result. Another filter is an authentication one that ensures that a given request has been authenticated.


val timeout = new TimeoutFilter
val auth = new AuthFilter
val authAndTimeuot = auth andThen timeout

The observation is that futures, services and filters form a good orthogonal basis on which to build server software of the sort found at production at Twitter. The style of programming encourages good modularity and separation of concerns. Most of their systems are just future transformers.

As presented though, futures are purely read-only constructs which form a DAG where consumers are separated from consumers. For example, a producer should know when a value is no longer needed to avoid needless work and avoid queuing. They therefore added an Interrupt model to cancel a future.


val p = new Promise[T]
p setInterruptHandler {
case exc => if (p.updateIfEmpty(Throw(exc))) tearDown ()
}

Make futures a little less clean, for example caching futures is a problem and its not fully composable anymore. But in practise it’s been fine.

Futures, services and filters form an orthogonal basis for writing server software. The strategy has paid off: picking a few composable abstractions has led to the emergence of small, reusable components. This is the software tooling approach for building server software now. Separating semantics from mechanics has been hugely beneficial too.

Q: when it breaks how do you expose a useful explanation to the user?
A: have a virtual stack trace, exposing the graph. When evaluating the graph they record the stack trace and show it to the user. A separate approach is a bytecode rewriting plugin for the JVM which rewrites all the spots in the code where promises are included to include provenance information about what they’re waiting for.

Q: Mothy: in the past, people have talked about futures as a lot :) How are these futures different?
A: The traditional notion of a future in Java/C++ have precluded the type of composition shown here. For example a java.util.Future can’t be composed; you can only wait for the result (same is true for i-structured variables). OCaml’s Lwt is most similar.

Weir: a streaming language for performance analysis, Anton Burtsev

[paper link]

IMG_0818Motivated by the need to keep track of what’s going on in a complex Xen stack. There are 3-4 layers of scheduling and an I/O match and generally a complex distributed system. Xentrace provides a large execution trace from a Xen setup.

Weir is a new language that takes an event trace and combines them together into a pipeline for processing. The analysis algorithm is a composition of stages, and the logic of the algorithm consists of the topology of the stream.
Weir constructs these pipelines — it is an imperative streaming language with arbitrary topologies and dynamic scheduling.

He then shows off the basic pipeline operators such as a pipe, sequence and if operator.

vm(0) | {
match(HYPERCALL);
match(EXCEPTION);
} | count ()


foreach(vm_id()) {
match(EXCEPTION) | count ();
}

A join operator lets you count the time between the pair of events for example.

An example is to build a histogram in the Xentrace trace, which consists of a foreach and join that computes that there is a high number of a particular event (exits). Then they figure out why the exit happens (a rtdsc timestamp event). The next Weir trace figures out the total time in Xen spent due to rdtsc events, which is built using the gate constructs (to open and close portions of the pipeline).

MBrace: Cloud Computing with Monads

[paper link]

IMG_0819
Most distributed frameworks are tied to specific programming models: map reduce, dataflow or actors for example. They tend to restrict specific distribution patterns, and arent expressive enough for certain classes of algorithms. It is difficult to influence task granularity and is time consuming to deploy, manage and debug code to and from the cloud.

MBrace is a new programming model for the cloud that provides an elastic fault-tolerant programming framework for the cloud. It uses a monad for composing distribution workflows. It’s essentially a continuation monad that admits distribution through the use of primitive combinators. It is based on F#s computation expression (which is the F# equivalent of monads but including useful keywords such as while loops). The whole thing is inspired by F# asynchronous workflows.


let download url : string = cloud {
let client = new System.Net.WebClient ()
let content = client.DownloadString(url)
return content
} : Cloud

The cloud keyword shows that its a monad, and the return value is a future (in the parlance of Marius’ talk).


let downloadSequential () = cloud {
let! c1 = download "http://foo"
let! c2 = download "http://bar"
let c = c1 + c2
return c
}

let downloadParallel () = cloud {
let! c1,c2 =
download "http://foo"
<||>
download "http://bar"
let c = c1 + c2
return c

The basic operators are a binary parallel operator, a variadic parallel combinator and a non-deterministic parallel combinator for selecting one. F# provides useful support for these monads — monadic for, while and exception handling loops.

Execution happens with a scheduler/worker organization and a symbolic execution stack (free monad and trampolines) and the scheduler interprets the “monadic skeleton”. The native leaf expressions are dispatched to workers. The symbolic stack winds across multiple machines, and so if there is a distributed computation with an exception, it can be wound across multiple machines.

There’s no storage service built into MBrace, and instead it relies on third party storage services. Out of the box support for SQL, Azure and filesystems and planned support for Amazon S3. Storage services are interfaced through data primitives which act as references to distributed resources that can only be accessed through the monad. A “cloud ref” is a data reference, and is conceptually similar to ML ref cells.

Shows benchmarks for a distributed grep to benchmark it (a straightforward mapreduce implementation) and found that mBrace is slightly faster than Hadoop with roughly equivalent performance. Moving onto iterative mapreduce (kmeans) they compared against Apache Mahout. mBrace pulls ahead an order of magnitude faster.

Semi-Automated Debugging via Binary Search through a Process Lifetime. Kapil Arya

[paper link]

kapil1Motivational example is a repeated insertion and deletion on a bounded linked list. The desire is for the total length of the linked list to be less than a million. While debugging it, they discover the list is too big. In the debug history they have a list of its length. How do we find the fault in the lifetime of the program? Ideally, they want to do a binary search of the desired invariant over the process lifetime.

First issue is determining how to find a halfway point while debugging, which is essential for a binary search. By looking at the debug history, they can attempt a binary search over it and determine which of the debugging commands caused the invariant to be violated. Each debug command though might either be a breakpoint (next, continue) but the continue consists of lots of commands. They want to decompose the continue in a series of smaller nexts to home in on the problem quickly.

The other issue is the slow speed of repeated execution of debug history. There is a lot of repeated computation when falling back through a series of debug commands to find the bug. hey use process checkpointing to speed up the binary search by storing a common checkpoint and restarting from there in a future binary search through the history.

The final main issue is how to deal with non-determinstic programs (such as gettimeofday). They need to record interesting events and ensure they replay those events regularly. In multithreaded programs, there are locks involved for synchronization and on restart, locks are acquired/released in a different order. They keep an event log and use it to deterministically reexecute the program after a restart. The assumption is that a well structured program only communicates across threads via locks.

What if a fault is triggered in a secondary thread? Threads are not kept synchronous, so background threads continue to execute. Kapil describes the algorithm to do the scheduler locking tricks to find the problem (details in paper, my fingers hurt now!).

Did the debugging on MySQL and found a couple of bugs (atomicity violation, data race, MySQL bug 12228 and 32319 if anyone wants to look it up). The checkpoint image size for MySQL was around 60MB with 1.2 million entries in the event log, and the average log entry size was around 4 bytes, so not a huge amount of storage required.

Q: Tim Harris: binary search is over an increasing value. In the linked list example, its not binary as it goes over and under the desired list length size.
A: the goal is to find at least one bug and then go back from there.

Understanding the genetic makeup of device drivers

[paper link]

Device driver developmentis a pain as you need to write and port/backport and maintain in a loop that is complex, expensive, slow and error-prone as kernel versions develop. There have been lots of attempts to help with this from research, but it doesnt help with all drivers. Some of the problems include: unsupported classes of devices, complex and firmware dependent devices, intellectual property restrictions and generally community acceptance of the solutions.

There are already some good drivers out there… billions of instances of drivers work and have been developed for many years and are used in mission critical environment.

How do we exploit the knowledge from these good devices drivers?

gene1The urban legend is that device drivers are implemented by copy-and-pasting, but simple clone detection doesnt work well in real-world drivers. They came up with the idea that drivers are made up of genes. For example, there are some features such as NAPI, checksumming, DMA support and offloading engines that have some features common across network drivers. They need to express the behaviour of this feature in drivers, given that the implementation will not be contiguous (but spread out across the subsystem).

They then show how the genetic abstraction works well in a modern network driver stack (see papers for details).

Q: what is the intrinsic difference between a gene and a clone?
A: clone is a piece of clone identified as a result of cut and paste (with variable changes) . But if you need to understand the relationship between lines of code that aren’t in sequence.