ï»¿ï»¿Liveblog from EuroSys 2013

16Apr/130

ï»¿ï»¿Liveblog from EuroSys 2013 – Day 2

Hi from all of us here in Prague -- this is day 2 of Eurosys and we'll be running the live blog as usual!

Your friendly bloggers areÂ Natacha CrooksÂ (nscc),Â Ionel GogÂ (icg), Valentin Dalibard (vd) andÂ Malte SchwarzkopfÂ (ms).

Session 1: Large scale distributed computation II

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Zuhair Khayyat, Karim Awara, and Amani Alonazi (King Abdullah University of Science and Technology), Hani Jamjoom and Dan Williams (IBM T. J. Watson Research Center, Yorktown Heights), and Panos Kalnis (King Abdullah University of Science and Technology)

Natacha Crooks - Mizan a system for dynamic load balancinc in large scale graph processing

Researchers use graph to abstract application specific alogrithms into generic problems represented as interctions using vertices and edges. Although they all have similar computation behaviour, each application has its own computation requirement. Key application is Pregel, which is based on vertex centric computation and is BSP. The main idea of PRegel is to be in memory, and message massing. Pregle has a set of supersteps and a barrier which marks the end of the super step. Balanced computaiton and communication is fundamentla to Pregel's efficiency as performance depends on slowest worker. Existing work focuses on optimising for the graph structure. The claim is that the users know what their data use like, so should know how can partition it well. But they believe that none consider the algorithm behaviour, and that looking at static properties (like grah structure) are not enough. Need dynamic optimisations. Two types of alogrithm categories based on their behaviour: stationary and non stationary. Page Rank is an example of a stantionary algorithm (same sets of vertices remain active) whereas DMST is not stationary. The computation imbalance in non stationary algorithms is caused by the fact that in a superstep, some workers will be doing the bulk of the ocmputaiton work, and receive most of the messages, and this is will change over time (given have different set of active vertices).

Mizan is a BSP based graph processing framwork. Which uses runitme fine grained vertex migrations to balance computaiton and communication. Key objectives of Mian are to be decentralised, simple, and transparent: no need to change Prege's API and do not assume any a priori konwledge to graph structure or algorithm. Mizan also consissts of supersteps. They add an extra layer: migration layer. Mizan does all of its planning/migratioin in this layer. Computes statistics, do the new planning, and then do the migrations.

To plan the migrations, they look for the source of the imbalance, by comparing the worker's execution time against a normal distribution and flagging outliers. Mizan monitors statisitcs for vertices: remote outgoing messages, all incoming messages, and response times. These statistics are broadcast to each worker. Mizan then tries to find the strongest cause of workload imbalance, by comparing statisitcs for outgoing messags and incoming mesages of all workers with the worker's executions. They then pair over utilised workers with a single under utilised worker and pair with it. Afte rpairing, look at which evertices to migrate. Then do the migration: key wquestions 1) know new locations 2) how do to fast migration 3) how to recompute state 4) how to broadcast info. They use a DHT to implement a distributed lookup service, where a vertex can execute at any worker, but there is a notion of a home worker. Workers ask the home worker of V for its current location, andthe home worker is notified on changes. For migrating vertices with large message ssize, they introudce something called delayed migration for very large vertices. They move the ownership of the vertex one superstep before actually moving the vertices.

Q: did you try the single machine approach?
Yes we ran some of the algorithms on the single machine but iddn't really look at the overhead.
Q: but this fits in memory (300 million nodes)
Yes you can run it on a single machine but not interesting for us because wanted to see how would work on a distributed setting and we think that would scale up.

Valentin:

The talk start by describing the BSP model and its Pregel implementation. The discussionn then goes on the partitionning method used for splitting the vertices among machines.
They split graph algorithms into two classes: Stationary and non-stationary. It refers to which vertex programs needs to be executed. Algorithms that are non stationary execute different vertices at different BSP iteration, like Minimum spanning tree. Stationary algorithms execute the same ones everytime like PageRank.
=>Conclude that graph structure is not all (I think he is refering to PowerGraph optimizing to the power law distribution) but that the way the run time works is also important
Mizan uses a BSP model, exactly the same API as Pregel.

Mizan adds to the execution model a Migration Barrier after the BSP barrier. The aim is to balance the partitionning then. What they do at this barrier:
-Identify the source of imbalance (by looking at runtime)
-Select the migration objective
-Pair over utilized workers with underutilised one in a decentralised way
-Migrate vertices

There is the problem of the fact that now, we don't know where vertices are located to send messages. They solve this by using a DHT mapping vertices IDs to workers. They move large vertices in multiple goes. Basically some of the messages are sent to the old worker which forwards them to the worker that now holds the vertices.

My personal point of view: The evaluate on graphs no bigger than 2.5GB which fits into the memory of a single computer. They showed no scaling for the sort of graphs that you would hope to perform on for what they called "1000s of machines". And yet they didn't look at the overhead when compared to single machine implementation.

A nice question was also the fact that when they repartition, they are one step behind in terms of workload, and it is not obvious that succesive BSP iterations correlate in workload.

MeT: Workload aware elasticity for NoSQL

Francisco Cruz, Francisco Maia, Miguel Matos, Rui Oliveira, Joao Paulo, Jose Pereira, and Ricardo Vilaca (HASLab / INESC TEC and U. Minho)

Natacha Crooks - Elasticity specific to cloud computing paradigm. Elasticity is growing resource computations according to demand. For NosQL database, manage the bulk of data from modern web applications, scalable and dependanble systems, data paritioning acorss several computing nodes. An external system is required to know when to scale out/up, add/remove nodes. Correctly configuring a NoSQL database is a difficult task because there are many configuration parameters. For example, in HBAase, bock cache size and memstore sizes are the parameters that most affect cluster performance. Block cache size favours read requests, memstore size favours write requests. But reconfiguring this parameter implies that have to restart system.

There is a heterogeneity in how access data. Different applications have different access patterns, which may change over time, or witihn applications can have data access hot spots. But claim that locality of accesses is no longer relevant. To validate their hypothesis, use random data placement with homonegeous configuration, manual and homogenoeus, manual and heterogeneous where they classify paritions per access pattern, juse a manual data load balancer.

MeT is a cloud management framework which can do autmoaticmanagement of NoSQL database clusters and reconfigure automatically based on data access patterns. The decision maker is two fold: decision algorithm, which is based on the resource usage metrics decides whether to add or remove nodes, and the second the distribution algorithm, which firsts classifies regions, then groups them together, then assigns them to the correct configuration.

Claims that by configuring databases heterogeneously, throughput can be improved by 35% , both in multi tenant and single tenant scanrios. Data partitions can be specifically configured nodes considering their acces matters.

Q: the decision algorithm decides whether the system is in a suboptimal state. But the distirbution algorithm is separate from it. So doesn't depend on the first one. How does this work? In the distirbution algorith, can you have oscilliations where the migration cost is expensive? And yes what do you do to prevent this?
A: In the firs titeration, have no infromation about the first cluster. Just look at what the monitor says, which is CPU utilisations, etc. etc. and by adding new nodes decide can lower that. Second question: currently working on a cost function whether should or should not alternate between two states.

Q: how would met would do for real application worklaods (highly skewed workload)?
MET would look at that load and decide that a particular configuration should be assigned to an entire region server.

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (University of Chicago), and Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber (HP Labs)

Natacha Crooks -
Trend in the large sclae processing frameworks: data parallel frameworks where you rpocess each record in parallel. Graph centric frameworks address limitaitons of the previous. The last is array-based frameworks (MadLnik) which process blocks of array in parallel. For example, compute PageRank using Matrices. There are a number of algorithms that can be written as linear algerba operations on sparase matrices, for which array-based frameworks are very well suited for.
Presto enables large scale machine learning and graph processing on spare matrices. Their approach is to extend R amd make it scalable / distributed. Problems is that there is a tremendous imbalance in terms of amount of data that is located per block. So this leads to computation imbalance. The second challenge is how to share data efficiently. Sparse matrices suffer signifcantly from communication overhead. Ca'nt just share data through pipes/neworks bceause is both time inefficient (send copies), and space inefficient (multiple copies).

Add to primitives to R. darray is one. If have a large data, darray gives you a handle into the distribution you can have (row/column based partition). "foreach" lets teh users specify a particular function which can be executed on particular parts of the array. The foreach funciton then exectes the funciton in a cluster. Prgrammer has the flexibility to specify what are the partitions that are gonna be accessed and compute this function on the cluster.

The presto architecture is a master slave architecture. The master is linked with the user shell. Each worker consists of a number of R instances, which are managed by a worker process. Allt he darrays are stored in DRAM on each machine.

They provide an online repatitioning system. They do a profile execution based on amount of time takes to compute particular partition. They can then check whether there is an imbalance between epartitions. This is an iterative process. They do this online parittioning in an iterative manner as well. One of the big challenges is how do they deal with multiple distributed arrays, and how to maintain size invarariants between them.

In order to share distributed arrays efficienctly, the objective was to do zero copy sharing across cores. The intution is that if there is immutable parittions then it is safe/easy to share. They create versioned distirbuted arrays, so as soon as change it creates a new version, which can then assemble to create full version. This makes it a lot easier to share. Other challences include garage collectiona nd header conflicts (linked to having multiple instances of R sharing objects). So they override R's allocator. They allocate process local headers, and map data in shared memory.

Q: have you talked with people who write R?
They are interested in things of the sort, but they are interested in doing high level things like "do linear regression", so whats left to do is build these libraries on top.

Q: what about matrix/matrix multiplication? (as opposed to matrix vector partition)?
Yes, also apply (see Netflix Collaborative Filtering slide)

Valentin:

With big data, a trend of machine learning + graph algorithms. He studies the trend in what is computed:
MapReduce 2004
Pregel 2010 to do graph computation
MadLINQ 2012 process blocks arrays in parallel

Presto extends R to be scalable and distributed for sparse matrices. The first issue comes from how to split the matrix. If you just do it at random, the power law mean some partitions are much denser than average. The second issue is how to send messages efficiently

They add 2 primitives to R:
-darray: distributed arrays
-foreach(): function that can be executed in parallel on the darray.

Presto architecture: Master/worker
Master controls what goes where, but workers talk between each other to pass messages.

Dealing with partitioning. They repartition at runtime by profiling the time it takes to do iterations on each machine. They repartition if max_time/median_time > delta.

Dealing with the sharing efficiently. They override R's allocator. Not too sure what went on there.

Evaluation:
They first thing they show is how easy it is to use, it has all the advantages of R (e.g. plotting). It looks very neat. They then show the efficiency of the dynamic repartitioning and it does seem like they get very good results.

A good question: isn't linear algebra restrictive? A: yes it is, but for a lot of linear algebra stuff is used so may as well be efficient.

Session 2: Operating System Implementation

RadixVM: Scalable address spaces for multithreaded applications

Austin T. Clements, Frans Kaashoek, and Nickolai Zeldovich (MIT CSAIL)

Natacha Crooks - Paralell applications use VM intensively which puts stress on memory system becuase has to serialise all memory applications to huge scalabilty problems. Indeendent VM operaitons operate on non overlapping regions, but memory system can't

get this to scale. Goal is to have perfectly scalable mmap munap and page fault operations on non overalapping address space regions.

The problem is that most os have a big lock around their page table. The OS don't know which CPUS has page so have to broadcast shootdown resuts. There's also heavy cache contention on TLB misses. The big issue here is cross core communication. Radix Vm addresses this by elminiating communication between cores, using concurrent memory mapreresation, targetting TLB shootdowns.

Need to store OS level metadata about all memory mappins in memory. MOst OS use a balanced tree of region objects, which introduces unnecessary commnucation even if it is memory efficient. There is still a need to transfer ownerships of tree nodes just for reads. Could used array based memory map, operations on non overalpping regions are concurrent and induce no commnuicaiton, so avoids transferring ownship. But the space use is gigantic. The problem also is that operations take time proportional to the size of the data they're operating on.

Solution proposed is a range oriented radix tree, which enables good compression. Fold constant valued chunks unto parent, and recursively. It's only 2/3 times the size of a normal tree whilst having better performance.

The TLB shootdown: in the common case, there is a little or no sharing. A software managed TLB would make this easy, because could implement a trap and track mechanism. Could similate this by having per core page talbes , and interpose on page faults. When the CPU misses, would go down and record the fact that this CPU now knows about this particular mapping, so can target TLB shootdowns.

The last scalability issue is reference counting for phsyical pages and radix nodes. The refenrece counter has to be scalable. There is a need to limit ocmmunication between increment and decrement operations (if appear on the same radix node). So use a distributed counter, and give each CPU each slot. But its very expensive to detect when counter has gone down to 0. So the idea is that start with a distributed counter but build a tree on top of that so only need to look at the root of the tree to know if 0 or not. So their solution gives up immediate zero detection: they use a shared counter with a per core cache of changes to that counter. On the CPU, they keep a cache of changes, rather than values. When a CPU performs an operation on this refcount, goes to its local cache. This means that the true value of the refcount is the sum of its global count and the sum of its local count stored in its reference counches. So the question is, when is the rue count is 0. Make one assumption: when the true count is zero, it will stay zero. So what they do is divide time into epochs. Each poch, all CPUS fush their delata caches, If an object's global count stays zero for all epoch, then it's true count is. So the claim is that refache enabls time and space efficient scalable reference counting with minimal latency.

So the big picture, with the radix tree memoy map, the per core page talbes, and there ference couned physicla pages are the three structures that they use. Page faults lock the faulty pages, recod the faulting CPU, and allocate the backing page, increment the reference counted physical pages for example. munmap also sends down targeted shootdowns, decrements the count in the local cache, and then removes backing pages.

For Metis MultiCore MapReduce bencharmk, Linux fails to scale because of page fault lock contention. RadixVM performs signifcantly better, and fails only because of pairwise sharing. Refcache avoids cache line sharing, they repeatly map/unmpap a sared phsyical page, and demonstrate that RadixVm scales almost linearly.

Claimed contribution: radix trees, per core page tables, refcache for scalble space efficient

Q: looks very similar to guarded page table?
A: not faimilar with those but will take a look
Q: 80 cores, do you have 80x overhead for page tables?
A: in simple approach, worst cast is 80x, But find that most applications not actually have worst case.

Ionel:

- Stress on the kernel Virtual Memory system.
- Every popular OS serializes mmap => malloc/free have scalability problems.
- Goal: Perfectly scalable mmap, page fault, nummap on non-overlapping address space regions.
- Why doesn't scale? 1) TLB Shootdowns broadcast, 2) Locking and 3) Cache contention => All three involve cross-core communication.
- Metadata management: popular OS uses a balanced tree of region objects. However, this involves unnecessary communication.
- How about array-based memory map? Good: Operations on non-overlapping regions are concurrent. Bad: Space usage, time is proportional to region size.
- Range-oriented radix tree in which we fold constant-valued chunks into parent, recursively. In practice the tree ends up only being 2x-3x the size of the balanced region tree.
- TLB shootdown - Which CPUs have a mapping cache? OS doesn't really know that.
- A software-managed TLB would make this easy. Trap & track.
- Can simulate software-managed TLB via per core page tables
- Reference counting for physical pages and radix nodes
- Limit communication between inc/dec counters.
- Solution: Refcache - gives up immediate zero-detection to achieve O(1) space and O(1) zero-detection cost.
- Shared counter with per-core delta cache.
- Divide time into epochs. After eache epoch each core will flush its delta caches.

Q: Looks similar to the garded page tables from L4? What's the difference?
A: I am not familiar with that work. I'll check it up

Failure-Atomic msync(): A Simple and Efficient Mechanism for Preserving the Integrity of Durable Data

Stan Park (University of Rochester), Terence Kelly (HP Labs), and Kai Shen (University of Rochester)

Key question is how to maintain consisteny over application failures. They biuilt is to allow the programmer to evolve durale state fialure atomically, all or nothing, always consistent despite power outages, process crashes, and fail stop kenel panic. With msync, can play well with POSIC: MS_INVALIDATE: rollback functionality for failed transactions.

To implement, need to keep state consistent between msync and keep state consistent during msync. They leverag a journaling baed approach. The journal is a redo log. Each entry is checksummed. Write file updates to journal, out of place write keeps file consisstent until the full upate transaction is durable, and once it is durable, then applyit to the file system. Two potions: eager vs async journaled writeback. Eager writeback will flush all fise system layer direty pages including previously journaled pages. Async writeback distinguishes between unjournaled dirty and journaled dirty pages, which can defer non critical work as a result.

They extend the CFS interface where its possible to have multple noncontiguous pages in a given range. They can support richer journaling in the file system, where can encalsuate all work for failure atomic operations (multiple non contiguous block updates) in a sinle transaction. So its written as a single journal entry. There's issue with the syze of msnc (2mb with default journal, at least 16 mb with 3gb journal). There's the issue with isolation of multi threaded code, and memory pressure.

Composing OS extensions safely and efficiently with BasculeÂ

Andrew Baumann (Microsoft Research), Dongyoon Lee (University of Michigan), Pedro Fonseca (MPI Software Systems), and Jacob R. Lorch, Barry Bond, Reuben Olinsky, and Galen C. Hunt (Microsoft Research)

Ionel:

- Extensions:
- change the runtime behaviour of an app/OS.
- must be safe to admit in the system even if buggy/insecure
- composable at runtime
- In today's software stack there are limited opportunities/options for adding extensions. No "thin waist" in the stack => Goal: to introduce one.
- Bascule: libos - full os personality as user-mode library.
- Narrow binary interface of primitive OS abstractions sits between LibOS and the host OS.
- Extensions loaded in-process interpose on the Bascule ABI.
- Drawbridge: provides secure isolation of existing apps via picoprocesses and the Windows LibOS.
- Why not support extinsibility by modifying the LibOS?
- may not have the source code
- may not be amenable to customization. OSes are not static. They constantly receive updates and patches.
- ABI - stateless and with fixed semantics.
- Two Guest LibOS implementations: Derived from Drawbridge (Windows 8) and Linux 3.2 proof of concept. Demonstrates OS-independent ABI.
- Host implementations: Windows 8 and on Barrelfish.
- Checkpointer extension
- adds migration/ft to unmodified apps and LibOSes.
- track state at runtime (writable/modified VM allocations), outstanding I/O, open streams.
- at checkpoint cancels pending I/O and ABI calls, open file and serializes all the state to file.
- Evaluation:
- Runtime overhead: the base cost is 86 cycles (negligible vs syscall)
- Memory footprint: fet KB per thread

Q: What's the configuration story? How do you compose?
A: When you want to start an app, you say I want to run this app and then the package dependency tells you on which packages your app depends.

Q: OS don't usually start with 1000 calls. What will not make your ABI to grow to that?
A: Two things: careful design and the set of things exposed in the ABI is a set of general things. One solution can be to provide an extension that builds on the current ABI.

Natacha Crooks - xtensions: change the rnutime behaivour of an application/OS, developed by a htird party, but applyed by end user or system integrator. Extension should be safe to admit, even if buggin/insecure. Dififcult to achieve because of today software stack, there are limited places where could interpose the extensions. The syscall abi for example, hudreds to thousands of calls, and very tight coupling with the OS kernel implementation. popular approach today is to run on top of virtual machine, so have a virtual hardware interface. There are many level of indirections between top and bottom. There's no useful thin waist in this stack. Goal is to introduce one.

Bascule uses a library oS. It's a full of user mode libraries that implements the full OS personality. The interface between the libOS and the rest, is an in process interface. It's a narrow binary interface of primitive OS abstractions. Because interface is in process, can support extensions that are loaded in process, interpose on ABI, they are safe and efficient (because of interposition in the same process). One way of viewing this is ot add an extension mechanism to Drawbridge.

LibOS is user mode code, so why not simply modify it? May not have the same source, or may need to apply securityupdates and patches from OS vendors (os are not static), so if extending libOS then becomes difficult to apply patches to extensions.

Bascule ABI is a nestable in process ABI of common OS primitives. The host provides a table of function entry points, and the data structure of startup apramenters. The guest code on the other hand provides the table of upcall entry points from upcalls that come up from the host.

Provide thread and synhcronisation, virtual memory managemnt , io stream abstraction and exception handling. Tricky aspects include shared address space, give there's no protection. So their solution is to have extension locations fixed at startup, and must allocate within a region. There's also challenges with stach use acorss ABI calls, nested exceptio handling, and thread-local storage on x86.,

There are two guest LibOS implementations: Windows 8 drived from Drawbridge, and Linux 3.2.

What can extensions do? the bascule abi is suitable for extensions that monitor or modify execution, interpose on file/network I/O and require control over application state. It's les suitable for applications that require tight coupling with host or guest. The example given is a check pointer extension which adds migration and fault tolerance to unmodified apps and LibOSes. Other example is an architecture adaptation extension.

Claimed contirbutions: Basicule new thin waist OS ABI, where extensions are loaded at runtime bythe end user, guarnatees safety. It avois modifications to LibOS, but enables the extension store model.

Session 3: Miscellaneous

Prefetching Mobile Ads: Can advertising systems afford it?

Prashanth Mohan (UC Berkeley) and Suman Nath and Oriana Riva (Microsoft Research)

Valentin:

Most apps are free, most free apps use ads. Ads consume 23% of the total app energy 95% of which is from communicationn (downloading the ad). This is due to the tail energy, problem, the radio is woken up and stays up after downloading the ad.

Ad prefetching challenge:
-Everytime an app requests an app, there is an online auction from the add exchange, this Â can not be run in advance so the infrastructure has to change.
-Ads have deadlines Â (e.g. bid price change)
-Not all downloaded adds may be shown which violates the SLA

Solutions:
=> Assume minimal change in the infrastructure
=> Look for a reasonable deadline: 30 min -> 0.5% of ads change price
=>Explore the tradoff in the 3D space: energy, SLA violation, revenue loss (discuss this for the rest of the talk)

This depends on the predictability of ad demand. They measure this as the entropy, which turns out to be high (not very predictable). But an hour by hour model of users turns out to work well.

Actual system:
Mobile client talks to a proxy (ask for prefetch). The proxy guesses the number of adds that should be prefetched and requests that number of slots to the add network. The result is sent back to the mobile.

They use an overbooking algorithm to evaluate SLA violation. I didn't really understood how that worked (or really what it did) but as far as I understand it turns the measure of SLA violation into a single parameter (the overbooking penalty called O). The SLA violation is proportional to 1/O.

In the end, they reduce the 3d space above into two dimensions: O and a prefetching aggressiveness k. Â O spans the SLA violation& revenue loss space, and k spans the energy&revenue loss space (I think)

Maygh:Building a CDN from client web browsers

Liang Zhang, Fangfei Zhou, Alan Mislove, and Ravi Sundaram (Northeastern University)

Valentin:

Current solutions to distribute content on the web today:
-Subscription
-Ads

Properties of large website:
-Many users
-Same content viewed by many users
-Content is v static

=> Recruit web clients to distribute the content
Current solution to do this:
-Browser plugin
-Client side software
But users have no interest to do this, so their question is: can we build a system that does not require the user to install extra software.

They present Maygh to solve this.
-Serves as a distributed cache
-Content (image, CSS, javascript...) is named by content hash
-The users run a javascript program that cashes content

To make this work, they use a Maygh coordinator, the coordinator looks for who online has the content and points to a user online that is close and has it. Clients use protocols RTMFP or WebRTC to communicate with the coordinator.

The coordinator:
-Serves as a directory for content
-Allows browsers to communicate together

Security issue:
-can users served forged content? No, as we use content hash names.
-Can user violate the protocol? (claim to have content DDos) Use the same techniques as are used today for these issues e.g. block IPs

The one issue the authors don't quite solve (and are very open about it) is the privacy question, can I know what other people have been watching. Their answer is that users can disable Maygh for certain content. This is weid because from the questions at the end it seemed the user didn't really have a say, it was up to the website. So not sure about this.

Evaluation: Additional latency is the v worse case is 1.6s. Usually around 500ms. WebRTC had significantly better performance than RTMFP (but only works on chrome and firefox). To estimate bandwidth saved they use simulation as Maygh is hard to actually deploy. They show pretty impressive results, a 75% reduction in badnwidth.

Q: What if the content has changed
A: It will have a different name since it is a content hash

Q: Since browsers are not always online, do you need to use replication between browsers to guarantee good performance. What is a good replication number.
A: We only cache, do not replicate by prefetch. Geographical locality of interest also means if people around don't have it, its probably best to download it from the content provide.

Q: How much storage does Maygh use?
A: Maygh uses some space on the Web browser which is usually capped at 5MB

Q:What if Â clients want to opt out?
A: Up to the website. For example can accept at the cost of ads.

Q: Scaling?
A: Can have multiple coordinators which communicate in clever ways. It scales.

Natacha Crooks -
New way to distribute content on the web. Web today is fundamentally a client server and distributes content in that way. Three obtiens for content distriution: serve your own, pay content distirbutionnetworks, or rent cloud services. But this imposes a significant monetary burden on web site operator. Current options are user subscriptons or advertising to support this financial burden.

The idea is for the clients to help dsitirbute the content, why, ebecause typical properties of websites are many users, same content viewed by many users and content is largely static, so the idea is to recruit those web clients who see the same data to help serve content. Their motivation is to bulid a system which requires no additional software.

The goal is to build a CDN that sreves as a cache for static content and works with today's web rowsers. It does not require any additional changes in the client site. They do that by using recent HTML5 browser features. So can reduce bandwidth cost for site manager.

Maygh serves as a distributed cache and always assume content will be available from the origin. Content must be named by content hash. The key challence is that browsers are not designed to communicate directly, so eed to find a way to get client to communicate to eac hother. They use two porotocls RTMFP or WebRTC which are two peer to peer protocols for Web browsers. Maygh introduces a coordinator. The coordinator is run byt he site operator. First the user requests to the site the root html, ut when user gets content, it will irst send request to coordinator and will first look for any other user is online who has the content and will request content from that user rather than site.

The coordinator serves two purposes: servers as a directory for content, and keeps track of content in user's browsers. It will allow browsers to establish direct connections (via RTMFP / WebRTC). On th eclient side, bwosers use RTMFP / WEb RTC to communicate with coordinatior. This allows bi driectional communication. The online client is always connected to the coordinator. Web site operators need to include the Maygh javascropt and make a small change to the loading context in the code.

Security aspects: it is possible to detect forged content using content hash, to avoid users serving false content. Attacks are a risk by clianing to have content, or DoS. Current security techniques work as usual. With regards to privicay, content is secured by its hash, naming content implies access.

Claim contributions substantial monetary burden to host popular Web site. Site operators resort to advertising to pay bills. If you recruit web clients to hep distirbute content. Shows that using Maygh can have significant browser reduction.

Tagged as: Conference, eurosys, liveblog Leave a comment

Comments (0) Trackbacks (0) ( subscribe to comments on this post )

Zuhair Khayyat
April 16th, 2013 - 10:26

I appreciate your live blog; it is really helpful to review the presentations
Anonymous
April 17th, 2013 - 10:21

Any comments on A Compiler-level Intermediate Representation based Binary Analysis and Rewriting System? Thanks

No trackbacks yet.

Liveblog from Eurosys 2013 – Day 3 » « Liveblog from Eurosys 2013 – Day 1