SSLShader: Cheap SSL Acceleration with Commodity Processors
Sending things over plain HTTP is bad -- there may be eavesdropping or surveillance going on. Nonetheless, most sites these days don't use HTTPS. Performance overhead HTTPS vs. HTTP: 22x setup time, 50x data transfer time. They made an RSA-on-GPU implementation that's faster than "high-end" hardware accelerators. Parallelism achieved between seperate connections, between records within a connection, and between encryption units within a record. The big exponentiation in RSA is optimised fairly obviously. They also accelerated AES and SHA-1 to complete the vanilla SSL soup. Without batching their data they take a 10x performance loss compared to working on the CPU due to coordination / copying overhead. With batching they get 8-20x. They find they can extract bulk crypto performance similar to 9-28 parallel CPU cores, and that their AES and SHA-1 implementations are limited by copy bandwidth on a PCIe 2.0 system with an NV GTX580. For interactive use they need adapative batching -- the first couple of records are processed on the CPU, with GPU being used as the queue of pending operations exceeds the critical batch size. So interactive SSH will use almost entirely the CPU. In fact many CPU cores can share a single GPU used in this way. There's also some stuff for scaling to NUMA.
Eval: lighty vs. dummy clients, using vanilla OpenSSL or accelerated flavour at the server (clients remain the same). Tested connection-handling rate for small entities (stresses RSA): 2.5x improvement for 1024-bit key, 6x for 2048-bit. Found that 60% of the time in the 1024-bit case was spent inside the kernel TCP/IP implementation. Tested performance with varying content size: found 2.1x improvement at 4KB body, 0.87x at 64MB. Not sure how that could be -- AES implementation gets worse with large blocks?? (cs/ms)
ServerSwitch: A Programmable and High Performance Platform for Data Center Networks
Aiming to support all manner of unusual protocols, e.g. switch ARP interception, layer-2 ECN. Existing work: OpenFlow, but they don't like it because it focuses on a programmable control plane, whereas they want to customise the data plane. NetFPGA: fully programmable, but difficult to program (need to know Verilog!) and not commodity devices. Goals: build a fully programmable packet forwarding engine in silicon, and use low-latency PCI-E links to defer control plane messages to the host. Architecture: use an off-the-shelf modern switching chip + multiple NIC chips on a card; asserts that modern switching chips are flexible enough (capable of interpreting several L3/4 headers, and include a TCAM for custom packet processing). They demonstrate how to do automatic source-routing using that TCAM. Looks like they really built the board: also includes 10G ethernet sockets for board-to-board bus interconnect. Compared vs. a 4-core i7 server, using NetFPGA to generate traffic in both cases. Found they could do wire-speed forwarding and 6x better latency than software, but not clear what the test workload was. Test 2: Quantised Congestion Notification.
Here they're using their board to filter out congestion notifications but do the processing in software. Looks like it's acting as an intermediate switch here. They artificially varied bandwidth and showed that QCN did its thing quickly. Declared limitations: requires use of underlying standard protocols, e.g. Ethernet. Can't do line-rate complex rewriting like e.g. NetFPGA. Chuanxiong Guo (an author, sitting next to me) clarifies that the standard protocols limitation is partial: for example, the device will try to calculate an Ethernet-style trailing CRC32 whether you ask it or not, and dropping frames which are "bad" in this sense cannot be disabled. So you don't need to use Ethernet, but it needs to resemble it closely enough that this check passes. Presumably the same applies to the IPv4 checksum, but that's easier as the card won't do it if the ether_proto doesn't match. (cs)
Goal: do TeraSort better than Hadoop by improving per-node efficiency. Figure for comparison: 3MB/s per node in Hadoop TeraSort record-breaking run. Aim to achieve that primary goal using "balanced hardware and software," which they define as levelling bottlenecks to get near-100% utilisation of all components. "Balanced software" achieves balance of the hardware. Their hardware balancing is pretty straightforward -- matching max disk-array throughput to network throughput, number of disks available per core to cores' ability to generate work, and memory available per core. Platform: 52 fairly awesome machines built from commodity parts + an enterprise-grade Cisco switch. Optimise disk utilisation by avoiding reading/writing the same disk -- partitioned disks into readable and writable. Their sort is two-phase like Hadoop's TeraSort -- bucket then sort each bucket. Sounds like they statically allocate workers in advance so that the data being written by the bucket phase is being sent to the right place.
Batching takes place in a couple of places to ensure that network and disk ops are always using healthy chunk sizes. Their network implementation is rather cheeky, as the transmitter just enters into a tight loop of non-blocking send(2) calls. They don't use select or similar, thus dedicating a core to transmission. The disk-writing code is fairly cunning too, trying to order their writes to minimise seeks. Eval'd using 100TB Gray sorting challenge. Found that writing to disk was the bottleneck stage for both sort phases, though other stages were able to remain active a reasonable proportion of the time. Hardware utilisation was similarly reasonable: CPU usage at 25%, all memory being used, network utilisation ~50%, disk achieving ~80% maximum possible write throughput. 6x per-node throughput over previous record holder (what was the previous guy's hardware like?).
Analysed a few hardware swapouts: higher-throughput disks pushed the bottleneck to CPU. Boosting RAM allowed them to chunk more writes prior to going to disk and so minimise seeks even more, but their disk utilisation was already 80% best possible, so it didn't matter much. Future work: MapReduce implementation using the same techniques. (cs)
Diagnosing Performance Changes by Comparing Request Flows
They consider two kinds of ways request flows can differ: by response time, and "structurally". They use a bag of heuristics to figure out which difference is most important in causing the difference between two run profiles. Profiles are gathered as you'd expect: trace everywhere and sellotape the traces together post-hoc. Trace overhead: <1%. Synchronised clocks required? The categoriser defines categories by having the same "shaped" profile, e.g. the same servers participate in the same way. They expect that this will generally lead to little intra-category variance of operation cost, and find this is true 90% of the time in their own system and 50% of the time in BigTable (50% of what?)
They identify the second kind of difference (response time changes) using some simple stats on request times, both overall and sub-transactions within the overall operation. The structural mutations are also detected as you'd expect -- participants change, or there's a sub-transaction that doesn't usually occur... They use some cunning heuristics to figure out which categories' requests are migrating to which other category during troubled times, but tease us by deferring to the paper at this point.
To eval, they ran this thing against one of their own systems and Google BigTable. They measure usefulness by the signal-to-noise ratio in the suggested mutations; they find that's 100% for some problems in their own system and 40%ish for the worst. Did some further localisation when it turned out that in some examples there's a structural difference but it's not clear *why* (turned out the user was asking for fault tolerance, hence extra write). The Google test is a simple qualitative study of setting it loose in figuring out why two BigTable clusters perform differently: conclusion: problem is below the level of BigTable. Unsatisfying, but probably true. (cs)
Starting off with an example: They added meta-data prefetching as feature to a distributed storage system (UrsaMinor). But it turns out it actually reduces performance! Using request flow comparison, they can work out what's going on. In the particular example, it turns out that there is a lock being acquired and released many times due to additional meta-data DB queries. RFC aims to identify changes in request distribution (N.B.: not the same as anomaly detection!). They can also use this technique to exclude the distributed system as a culprit for performance degradation -- if the request flow distribution is the same, then nothing has changed in terms of the distributed system. "Spectroscope" takes "non-problem period graphs" and "problem period graphs" as inputs, categorizes them and analyses them, producing "structual mutations" and "reponse-time mutations". For this, they require end-to-end request tracing, but this is a solved problem: already have many production systems supporting this at low run-time overhead (Magpie, X-Trace, Google Dapper). Request flow graphs are obtained by stitching traces from different places together, as one would expect. Request flow graph shows trace points as nodes and times between them as labels on edges. They need to categorize the graphs produced because it is pointless to compare individual request flows; instead, need to group similar ones. They decide that requests taking the same path through the system and having a similar cost should be grouped. Reponse-time mutations: structurally identically but have larger overall latencies in the problem period. Spectroscope uses hypothesis test to decide if the request flow distributions are actually different, or just the same modulo variation. Then just show which edges in the graph take longer in the problem period. Structural mutations: different paths through the system taken in the problem period. Assuming similar workloads across all input traces, they use three heuristics to identify precursors and structurally mutated requests. Example: lock contention due to many meta-data requests in the above example. Note that sometimes changes can yield multiple categories of structural mutations, e.g. if there are cascading behaviour changes. They show a "UI mockup" for Spectroscope, giving the developer a ranked lists of request flow categories, aiding investigation into changes in behaviour.
They did case studies with UrsaMinor and at Google; evaluation is by quantifying the proportion of "relevant" problems identified; worst case is only 40% relevance. They also used Spectroscope to diagnose two real problems, one in UrsaMinor and one in Google -- in both cases, it helped finding the issue (inadvertent configuration change with UM; cluster configuration in Google case). (ms)
Profiling Network Performance for Multi-tier Data Center Applications
They target network/application interactions: cases when the network and app are both functional, but the latter is using the former badly. Assert that packet-level tracing is expensive, and pairing it with application-level events is a pain. They pull stats out of the TCP stack instead, as they like the fact that it understands the app's behaviour and the network's state. They sample rather than take complete traces for performance sake. They categorise trouble by which end-to-end stage caused trouble: application data generation, send buffers, congestion, receive buffers, or the far end application.
Seperately they try to assign blame to a specific piece of hardware by correlating problems with the hardware involved. In real world tests they found that the vast majority of apps are performing badly because they delay an ACK waiting for more data. Normally you'd never notice, as send(2) will return once the data is copied to the local transmit buffer. However their apps were using zero-copy tricks which lends the app's memory to the TCP stack; the stack won't release the page back to the app until the ACK comes back, as it might need to use the page again for a retransmit. Stupid apps that wait for the send op to complete will be disappointed, as they've successfully achieved end-to-end tight coupling. Real-world problem 2: developers switching off slow-start-restart in favour of permitting a large congestion window to dangle forever, causing a burst loss when they suddenly try to use it. (cs)
Efficiently Measuring Bandwidth at All Time Scales
Trying to find bursts in traffic by periodic sampling: hard to know which periods will hide and which will reveal the bursts. Naive dumps suck because they're massively verbose, and SNMP counters because they're too gross. End hosts are configured to audit at the "base time scale": the highest possible resolution. Then this record will be munged to calculate stats at any granularity.
How to do this without eating storage: exponential bucketing. The summary is that you keep stats per power-of-two multiple of the base interval, then extrapolate to generate stats afterwards using the power of two closest to the desired resolution. However details deferred to paper. Algorithm 2: "dynamic bucket merge," also a prog rock band. Aims to use a fixed amount of memory to keep stats, then recover stats post-hoc. It just merges the two lowest-bandwidth time intervals every time a data point comes in. Other merge policies are possible, e.g. merge similar buckets. Depends what stat you're looking for at the end.
Eval: 1. how accurate are the stats compared to the real deal? Recorded a trace of TritonSort doing a 500GB sort; then munge it with both exponential bucketing and dynamic bucket merge. Found that EXPB produced worst-case inaccuracy at large granularities (worst 8.1%). DBM gave worst case inaccuracy 59% at high resolutions for the stdev stat. Eval part 2: does it hurt performance to trace this way? Costs 200-500nsec/packet to collect stats; worst case 2.5% performance hit (compare 8% for tcpdump). (cs)
ETTM: A Scalable Fault Tolerant Network Manager
People want a bunch of features from networks (intrusion detection, firewalling, NAT, ...) from networks, but these are currently implemented for reasonable size networks by having several expensive middleboxes. These are deployed at the edge, so e.g. with intrusion detection, you don't really see things going on internally with no outside interaction. Reliance on proprietary vendors also not good.
Their approach is different: "End-to-the-Middle" (ETTM) uses "virtual middleboxes" running on commodity network hosts.For this, they assume a trusted shim at each (!) host -- so really, this is distributing and federating the middleboxes. Further assumptions: hosts (a) have trusted computing HW, (b) are multi-core, (c) are virtualized and (d) in a single administrative domain. Also assume that switches provide access control. Assert that this is all the case with enterprise commodity PCs. Goal: have *one* platform with simple APIs for network management, implement management in software, not HW. All of this should be scalable, work pervasively, consistently and be fault-tolerant. Use Paxos between machines, and a TPM-enable hypervisor (can attest that the correct HV is running to a "verification server"). They've also extended 802.X protocol to use the TPM attestation information, and they run their network management task in a separate virtual machine on the host (rather than in the hypervisor). This VM is called "Attested Execution Environment" (AEE). Can we make a bunch of these machines make consistent decisions quickly? Paxos is used to make consensus decisions for each flow (see later for performance). Authorization between host and switch/verification server takes ~1s, which can easily be parallelized with boot. Time is dominated by TPM signing the request.
Fault tolerance: need to function even in the presence of arbitrary failures. They keep a shared table holding the state for each management application on each host (value filled in through Paxos), but they do also allow unsafe progress and state reconciliation in the presence of catastrophic failures.
As a demo, they have implemented something called µpfilter (OpenFlow/iptables backends) that can modify IP packets. The OpenFlow implementation allows header modification only, iptables implementation allows everything, but is limited to 250 MBps. On top of this, they implemented a NAT service, deep packet inspection, web cache, traffic shaper, intrusion detection and a firewall.
What might go wrong with this? Consensus speed: fault-tolerant decisions in under 1ms on wired LAN. Consensus scalability: can add between 8000 and 1700 cells to a row per second for (up to 19 Paxos machines). Performance: suffers for small flows (up to 10 KB), but fine beyond, perf. being identical to native flow performance. Their traffic shaper works, and can make per-flow, per-RTT decisions while not hurting performance, although they only give indirect evidence of this. (ms)
Finding out about evil hosts in your LAN can be hard if they restrain themselves to owning other LAN boxes and don't pass through your IDS. So: let's install a trusted shim on the end hosts to produce the effect of having an IDS running on every LAN switch. Target environment is an enterprise network so you can actually do that to your hosts. Requires a TPM and virtualisation support. Also needs smart switches that can do access control, e.g. 802.1X.
Overall goals: easy to program for, use end-hosts' computing powers and so scale, and be fault tolerant against compromised hosts. Challenge: trust the end hosts to execute the shim faithfully. Solution: use TPM to attest to hypervisor's presence, plus extend the switch's 802.1X implementation to use the attestation information to admit the host. Once the host is assured to be running your hypervisor then the rest is easy. Total time to attest and get the link up: 1.1s, run in parallel with ordinary OS boot. Legacy or inflexible hosts (read, Apple products) can be tunneled through attested hosts which take responsibility for them.
Fault tolerance (preserving IDS observations and decisions across machine failures) uses Paxos (not sure what that is). Looks like they use this not just to supply an IDS but to replace all kinds of middle-boxes, e.g. NAT. Paxos constructs a shared view of the NAT table. Sounds like it'll be expensive to add a row. The trusted VM runs the NAT code and so prevents the user from maliciously using someone else's public port.
Example apps: NAT, IDS, transparent HTTP proxy, prioritiser, scan detector, firewall. Basically consist of descriptions of their state for paxos replication purposes. Find that wired LANs are able to reach paxos consensus quickly and so agree on e.g. new NAT mapping in 1ms (20 hosts). 19 hosts can deal with 1700 ops per second, and ops are only needed for control-plane things. Find that running over NAT like this costs you a factor of 5 or so for small flows but negligable for bigger flows. Presumably that will improve if you can push the filter down from the trusted VM to hardware.
In summary sounds a bit fragile (but doesn't anything that involves keeping distributed state that must remain consistent and durable), but it's pretty neat to make your hosts be their own manager. (cs)
Design, Implementation and Evaluation of Congestion Control for Multipath TCP
Holy refried beans, batman! It's real world! All the best talks should begin "in the beginning, there was..." He characterises MP as being packet switching writ large, in that you can gang links. Question: how do you fairly share the link-gang? How do you adapt TCP to work in such a world?
Goals of MPTCP:
- Be fair when multiple subflows cohabit with a regular TCP flow.
- Use efficient paths: reach distributed agreement that if there are three flows using seperate direct paths, none of them should invade the other's connections. Conflicts with efforts towards fairness. Theoretical solution: send all your traffic on the least congested path.
- Be fair compared to MPTCP. Problem with the previous goal: adopting that principle on a mobile device with GSM and 802.11n available it should send everything over nice clean 9.6kbps GSM. That's no fun. So goal: do no worse than TCP, either to the user or the network.
- Adapt quickly to congestion changes
- Don't oscillate.
Goals 4 and 5 not discussed. Implementation: keep a window per subflow. Set the increase-on-ACK function to take account of all subflows' current windows to avoid taking more than TCP overall whilst avoiding congestion. Backoff-on-drop is the same as always.
Summary: make a bag of links act like a big link with the sum of their capacities, where some of your fellow flows might be foolishly using only one path -- that is, they are TCP. (cs)
This stuff is related to the IETF multi-path TCP working group. At the dawn of time, things were circuit-switched. Then people discovered that they can squeeze more juice out of a link by doing packet switching, especially when there are surges. Multipath is awesome magic that lets us do the same across multiple links: imagine the juice from a flow being poured down several pipes at the same time. Real-world use case: data centres (useful e.g. with a BCube topology), or wireless devices with multiple network interfaces (could get throughput of WiFi and resilience of 3G). In the latter example, fun arises from the fact that these interfaces are quite different and we don't just want to divide things up in a naive way or hack the special case -- ideally, let's come up with a general solution to this problem.
Design goals: (1) MPTCP should be fair to "regular" TCP on shared links, especially shared bottle necks. More subflows should not just equal more bandwidth! Strawman solution: just make MPTCP run TCP with a less agressive flow control. (2) MPTCP should use efficient paths. Splitting traffic evenly underutilizes the links. Bias in favour of more efficient (e.g. 1-hop) paths improves on this. Theoretical solution for arbitrary network: every MPTCP flow should sends all of its traffic on the least-congested paths. (3) MPTCP should be fair compared to TCP. Consider example of 3G and WiFi link, where the 3G link is less congested. Our answer to (2) says that we should send everything over the 3G link, but that's got crap RTT (even if less packet loss). That hurts the user, and users tend to not be sympathetic to that. So we conclude: we need to benefit the user. We need to be at least as good as normal TCP (3a). At the same time, we shouldn't hurt the network either (3b). Note that in this formulation, goal 3 subsumes goal 1. There are two more goals (adapt quickly to congestion and prevent oscillation), but skipped for time's sake here.
So how is this implemented? We maintain a congestion window for each path, and modify it for each per-path ACK. The step size is proportional to window size, so that we shift away from congestion. Decrease on packet drops by halving per-path window size.
So what high level problem is this trying to solve? Imagine a multi-homed web server with two outgoing 100 MBit/s links, 2 TCP flows on one, 4 TCP flows on the other. Add two MPTCP flows and they will end up on the 2-flow link. As we add more, bandwidth is shared fairly, which is clearly good. Ergo: MPTCP makes a collection of links behave like a single large pool of link capacity.
This stuff is really nice -- it sounds convincing, worthwhile and as if it works well. The idea of sharing capacity between links is not new (see HW switch-level port trunking), but this is making it more automatic and adaptive. Summary: I want one of those, ideally now :-) (ms)
CIEL: A universal execution engine for distributed data-flow computing
Such alacrity; such poise. All talks should be like this one. After this one is done, none other shall be worth seeing. And my, what a handsome presenter! But what of the others mentioned on the title slide? Who is this mysterious "Steven Smith"? I picture a Viking quaffing from a horn of finest mead, before riding into the sunset. (cs)
A Semantic Framework for Data Analysis in Networked Systems
A device for combining expert-generated models with raw logs to answer users' questions about the system. Their models are constructed in terms of "behaviours," which are related events (events being primitives from traces/logs, like file or network ops). The system or notation for describing them permits behaviours to be themselves composed to yield more complex behaviours. Related-ness can mean causally related, concurrent, or logically related (e.g. both a success and failure message is obviously bad).
He gives an example of a failure we could diagnose: DNS cache poisoning, characterised by lots of DNS replies with an incorrect XID and abnormally frequent races between two responders. Their behaviour annotation language/system will presumably be able to describe those relationships so that it can describe it to the user as "probable DNS poisoning attack". Sounds not a million miles away from Wireshark's stateful packet dissectors, e.g. those which can pair requests and replies and so highlight dup replies, unanswered requests, etc. The DNS example includes their composite behaviour thing, as you can specify DNS-race as a behaviour, then cache-poisoning = many DNS races during the same time as many DNS bad responses.
The system's answers appear to come back Prolog-style -- assignments to the variables in the model, where the facts will be TCP packets in this case. (cs)
Users want to analyse their networks. There are abstract models of networks, but marrying them with the precise questions asked is not always easy. There's a bunch of existing approaches that occupy different niches in a performance/abstraction space. Their semantic analysis approach lives at a high abstraction level, and accepts lower performance as a result.
Their approach extracts "facts" from data and then groups them into "behaviours", which in turn generalise into a "model". This allows to easily encode higher-level system semantics, should one want to do that. Key enabling property: "relationships" between behaviours. They have invented a modeling language that can express relationships. There are "temporal" relationsships and "concurrent" relationships, as well as "logical" ones, and finally "dependency" relationships (for example, a file open and file close event are related by the common file name). To express these relationships, they made operators such as the squiggly arrow (~>, temporal) or the "olap" operator (overlapping). The framework becomes "semantic" by indentifying relationships.
As an example, they model the Kaminsky DNS cache poisoning attack. This is tricky to analyse using traditional approaches as it requires expertise, uses random values in the data and various other complications. Now they assume that an "expert" has detailed knowledge about how the attack works (essentially a flow-chart). Different paths through the flow-chart result in different outcomes. The behavioural model just considers those outcomes. Using their spiffy modeling language, they can directly express the model in code, so the expert can write it down and then non-experts can do things with it.
What have we achieved here? Took a high-level understanding of the behaviour of a system and transformed it into a model. This model can now be used by a (non-expert) user to ask questions about behaviour: for example, the user can plug in some raw input data (e.g. a trace) and the model can tell them whether the DNS attack succeeded or failed.
Implemented in Python, O(N^2) worst case performance of algorithm. They think this is quite generally applicable. In future work, they also want to be able to model probabilistic behaviour and packet distributions, as well as improve the efficiency and performance. Finally, they describe the advantages of using semantic analysis: (1) composing models to create higher-level meaning, (2) sharing and reusing expertise (feeding a public knowledge base). (ms)
Paxos Replicated State Machines as the Basis of a High-Performance Data Store
Paxos is nice (consistent, persistent, fault tolerant...). Real systems compromise: give up on consistency after failure, require clock sync... They set out to use actual Paxos with its nice properties whilst remaining performant. They restrict themselves to considering datacentre-like situations where we don't care so much about tolerating partitions, and our operations take longer than network latency. A "non-trade-off": even though replicated state machines like those implemented by Paxos seek to agree on a global order of machine inputs, we can still execute the state transitions out-of-order in a manner analogous to OoO execution in modern processors -- we can transparently re-order instrucgtions so long as we respect the semantics of the machine.
They avoid synchronous writes for journalling, though I'm not sure how this protects against ops which appear to have happened to the client but are reverted after a crash. Perhaps it doesn't.
They implemenent their system using a custom kernel-mode disk driver sitting underneath NTFS on the client, and a userland server. Lots of engineering fun to make it efficient -- standard things like pipelining, batching...
I'm not sure I understood the rest, but I *think* the point is that you can get rid of a lot of the synchronous operations that a naive Paxos implementation would do if you're clever and so retain the ability to do e.g. seek minimisation, burst reads on disks.