Liveblog from SIGCOMM13 – Day 2

Hello again from SIGCOMM 2013, Hong Kong, day 2. I'll be live blogging as much of the event as I can, weather permitting...

Morning Sessions Canceled

The Hong Kong weather observatory has issued a "Typhoon Warning 8" due to Tropical Cyclone Utor, which basically means that all government infrastructure has been suspended. This includes the University at which SIGCOMM13 is being held. We are awaiting further updates. Current weather forecasts suggest that this condition will remain until around 2pm today.  More details expected at midday.


Datacenter Networking 1

Ananta: Cloud Scale Load Balancing

  • Desire a load balancer that does 20TBbps and has N+1 redundancy with quick failover. 
  • Distributed loadballencer with PAXOS for keeping NAT state synced.
  • To manage latency, group pools of 8 port and 160 VMs.

[mpg39: sorry, this is a bit of a poor summary, I was a bit late and the speaker was very quiet] 


Speeding up Distributed Request-Response Workflows

  • Latency matters ==> Strict SLAs for latencies
  • Why? Transient hotspots, network congestion etc.
  • Web services are implemented as complex workflows with a collection of input/output dependencies.
  • Up to 150 workflow stages, degree 40, 10 steps ==> stochastic latencies accumulate
  • Lots of work has been done, but we care about end-to-end latency
  • Reducing latency:
    • Application level techniques:
      1. Reissue a request after a timeout -> uses extra resources
      2. Terminate a request early. Use whatever results we have got before a timeout to produce results.  ->returns  incomplete results
      3. Priority and pre-emption can be used. -> introduces additional contention
    • Results in conservative parameters for all of the above.
  • Kwiken: Uses workflow description and logs to get an idea for latency at different stages.
  • Challenges:
    1.  Different stages show different gains with different techniques
    2. Composing costs across stages is non-trivial
    3. Decomposing end-to-end latency into indivitual components is hard.
  • Kwiken: End-To-End Design
  • Goal to minimise latency subject to cost <= budget
  • Solution #3 Minimize sum of variance in workflow latency instead
  • Solution #2 [..]
  • Solution #3 Construct per-stage variance response curves.
  • Eval: Traces from 45 most used workflows at Bing, simulated work loads
  • 150 stage workload, with a 1% budget results in a 75% latency reduction using a combination or reissues and terminations.
  • Conclusions
    • Complex workflows lead to high tail latencies
    • Kiwiken is an end-to-end framework with combines many techniques.

Leveraging Endpoint Flexibility in Data-Intensive Clusters 

  • Communication is crucial to analytics. 33% of the time is spent in comms.
  • Network usage is imbalanced
  •  What is the soruces of cross-rack traffic. 
  • 54%/40% from DFS writes.
  • DFS - GFS, HDFS, Cosmos etc.
    • Many frameworks communicate via the DFS.
    •  Files ate 64MB to 1GB
    • Each block is replicated 3x for FT
    • 2 fault domains
    • Uniformly (random) placed for balanced storage
  • How to handle DFS flows? (A few seconds long)
  • Observation: We don't care where the replicas end up.
  • Sinbad - Faster writes by avoid contention during writes.
  • Assumptions:
    • Link util is stable during the write duration
    • All blocks are the same size.
  • Greedy placement minimizes average block/file write time.
  • Reality:
    • Link utilisations are reasonably stable
    • Blocks are mostly (93%) the same size.
  • Sinbad puts the first block through the lowest loaded link.
  • Master:
    • Prosodically estimates network hotspots
    • takes greed online decisions
    • adds hysteresis to next measurement
  • Evaluation
    • 3000 node trace driven simulation.
    • Simulation matches against a 100node EC2 deployment.
  • Simulation 1.39X, expr 1.26X. If using in memory storage 1.6X faster.
  • What about storage balance?
    • Hotspots are uniformly distributed
  • Related work:
    • #1 increase capacity (bi-sec BW)
    • #2 Decrease loads: Data locatalty
    • #3 Balance usage
  • Planning to deploy Sinbad at Facebook.


Q: How do you asses the read performance? A: Assume that the file will be read by many readers.  

Q: What does the name mean? A: Wanted to call it bad, then nbad, but that sounded bad, so called it Sinbad.

Q: What is the network architecture assumption? A: We need to know the possible locations of bottlenecks.

Q: [can't hear] A: used GFS open source using a plugin.

Q: Can you apply this to task scheduling? A: The reason we didn't do it is the complexity is very high if you must consider receiving as well.

Comments (0) Trackbacks (0)

No comments yet.

Leave a comment

No trackbacks yet.