This year's SIGKDD conference returned after 12 years to San Diego, California to host the meeting of Data Mining and Knowledge Discovery experts from around the world. The elite of heavy-weight data scientists was hosted at the largest hotel of the West Coast and together with industry experts and government technologists enumerated more than 1100 attendees, a record number in the conference's history.
The gathering kicked off with tutorials and the parallel of two classics; David Blei's topic models and Jure Leskovec' extensive work on Social Media Analytics. Blei offered a refreshing talk that stretched, from the very basics of text-based learning, to the most up to date extensions of his work with applications in streaming data and the online version of the paradigm that allows one to scale up the model to huge datasets satisfying the requirements of modern data analysis. Leskovec elaborated on a large spectrum of his past work, covering a wide range of topics including the temporal dynamics of news articles, sentiment polarisation analysis in social networks and information diffusion in graphs by modelling the influence of participating nodes. The first day's menu on the social front was completed with Lada Adamic' presentation on the relationship between structure and content in social networks. Her talk at the Mining and Learning with Graphs Workshop provided an empirical analysis on a variety of online domains, that described how the flow of novel content in those systems was evident of variations in the patterns of interaction amongst individuals. The day closed with the conference's plenary open session that featured submission and reviewing highlights and the usual KDD award ceremonies: the latter session honoured the decision trees man, Ross Quilan, who presented a historical overview of his work and a data mining legion of 25 students from NTU that won this year's KDD cup on music recommendations.
After the second night of sleep and repetitive jetlag ignited wake ups, Monday rolled in and the conference opened with sessions on user classification and web user modelling. A follow up in the afternoon with the presentation of the (student) award winning work on the application of topic models for scientific article recommendation attracted the interest of many. The dedicated session of the conference on online social networks also signified the interest of the Data Mining community for the nowadays hot domain. The latter opened with an interesting work on predicting semantic annotations in location-based social networks and in particular the prediction of missing labels in venues that lacked user generated semantic information. While the machine learning part of the work was sound, its applicability as a real problem was doubted, suggesting the need to identify the essential challenges in a relatively new application area. Nonetheless, the keyword of the day was scalability: two talks focused on an ever classic machine learning problem, clustering, introduced in the context of the trendy Map Reduce model. Aline Ene from University of Illinois introduced the basics, whereas the brazilian Robson Cordeiro offered novel insights with a cutting edge algorithm for clustering huge graphs. The work driven by the guru Christos Faloutsos featured the elegance of simplicity with the virtues of effectiveness, showing that for some size does not matter and petabytes of data can be crunched in minutes. A poster session came to shut the curtains of another day. The crowd was not discouraged by the only-one-free drink offer of the conference organisers and a vibrant set of interactions took place. Some were discussing techniques, some were looking for new datasets, while social cliques were also forming in the corners of the hotel's huge Douglas Pavilion.
Day 3 drove the conference participants to the dark technical depths of the well established topic of matrix factorisation, that was succeeded by the user modelling session.Yahoo!'s Bee-Chung Chen gave an intriguing presentation on a user reputation in a comment rating environment, followed by the lucid talk of Panayiotis Tsaparas on the selection of a useful subset of reviews for amazon products that were plagued by tones of reviews. The Boston-based Greek gang of Microsoft Research, also showed how Mechanical Turk can be used to assess the effectiveness of review selection in such systems. Poster session number 2 closed the day and the group's work on link-prediction in location-based social networks was up. The three hour exhaustive but fruitful interaction with location-based enthusiasts, agnostics and doubters was a good opportunity to get the vibe of the community in an up and coming hot topic. For application developers and online service providers the work was an excellent example of how location-based data could be used to drive personalised and geo-temporally aware content to users. For data mining geeks it presents an unexplored territory where existing techniques could be tested and novel ones devised. At the end of the poster session many of the participants headed for a taste of San Diego's downtown outing, whereas the relaxing boat trips at the local gulf were also highly preferred.
The final day of the conference was marked by Kaggle's visionary entrepreneur Jeremy Howard and a panel of experts in data mining competitions. The panel aimed to analyse the problems that were risen during previous competitions and the lessons learned for the creation of new successful ones. Howard presented radical views suggesting that the future of data mining and problem solving would be delivered in the form of competitions. Not only competitions could attract an army of approximately 10 million data analysts around the globe, but the design of them could promise a sustainable economic model that would bring money to all participants (even non-winners) and would perhaps put at stake a respectable number of PhD careers. His philosophy was driven by the idea that to solve challenging problems effectively, you need to awaken the diverse pool of minds that is out there and can constitute an infinite source of innovation.
But KDD attracted not only the interest of scientists and corporate experts, but also that of politicians. Ahead of 2012 elections the Obama data mining team is here and hiring! Rayid Ghani chief scientist at Obama for America highlighted the important role of predictive analytics and optimisation problems in the battle for an electorate body that is traditionally positioned to announce winners by only small margins of difference. It is left to see whether science will beat Tea Party style propaganda and will maximise positive votes in a bumpy and complex socio-political landscape. The political world was also also (quietly) represented by government data scientists and secret service analysts who were seeking to catch up with the state of the art in data mining and knowledge discovery, a vital survival requirement in a world overflowed with data and subsequent leaks...
The full proceedings of KDD 2011 can be found here.
I recently had the opportunity to attend FAST (the USENIX Conference on File and Storage Technologies) in sunny San José. Despite "only" running for two days, the program was packed with presentations of interesting research papers.