AAAI/AIES’18 Trip Report

Posted by Jianxin Zhao

Recently I’m honoured to get the opportunity to present our work “Privacy-preserving Machine Learning Based Data Analytics on Edge Devices” at the AIES'18 conference, which is co-located with AAAI'18, one of the top conference in the field of AI and Machine Learning. Here is a brief review of some of the papers and trends that I find interesting from this conference.

Activity detection is obviously a hot topic. New building blocks are invented. “Action Prediction from Videos via Memorizing Hard-to-Predict Samples” aims to improve the prediction accuracy, since one challenge is that different actions may share similar early-phase pattern. The proposed solution is a new CNN plus augmented LSTM network block. “A Cascaded Inception of Inception Network with Attention Modulated Feature Fusion for Human Pose Estimation” proposes a new “Inception-of-Inception” block to solve current limitations in preserving low level features, adaptively adjusting the importance of different levels of features, and modelling the human perception process. The research focus is also on reduce the computation overhead. “R-C3D: Region Convolutional 3D Network for Temporal Activity Detection” aims to reduce the time of activity detection by sharing convolutional features between the proposal and the classification pipelines. “A Self-Adaptive Proposal Model for Temporal Action Detection based on Reinforcement Learning” proposes that agent can learn to find actions through continuously adjusting the temporal bounds in a self-adaptive way to reduce required computation.

Face identification is also a widely discussed topic. “Dual-reference Face Retrieval” propose a mechanism to enable recognise face at a specific age range. The solution is to take another reference image at certain age range, then search similar face of similar age. Person re-identification associates various person images, captured by different surveillance cameras, to the same person. Its main challenge is the large quantity of noisy video sources. In “Video-based Person Re-identification via Self Paced Weighting”, the authors claims that not every frame in videos should be treated equally. Their approach reduces noises and improves the detection accuracy. In “Graph Correspondence Transfer for person re-identification”, the authors try to solve the problem of spatial misalignment caused by large variations in view angles and human poses.

To improve Deep Neural Networks, many researchers seek to transfer the learned knowledge to new environment. “Region-based Quality Estimation Network for Large-scale Person Re-identification” is another paper on person re-identification. It proposes a training method to learn the lost information from other regions and thus performs good with input of low quality. “Multispectral Transfer Network: Unsupervised Depth Estimation for all day vision” estimate depth image from a single thermal image. “Less-forgetful learning for domain expansion in DNN” enhances DNN so that it can remember previously learned information when learning new data from a new domain. Another line of research is to enhance training data generation. “Mix-and-Match Tuning for Self-Supervised Semantic Segmentation” reduces dataset required for training segmentation network. “Hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map based Feature Extraction for Human Action Recognition” aims to solve of the problem that feature extraction need large-scale labelled data for training. Its solution is to adaptively learn effective features from data without supervision.

One computation theme in these research work is that to reduce the computation overhead. “Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition” achieves it by locating the redundant computation in the region proposal in image recognition. “Auto-Balanced Filter Pruning for Efficient Convolutional Neural Networks” compresses network module by throwing away a large part of filters in the proposed two-pass training approach. Another trend is to combine multiple input sources to improve accuracy. “Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion” combine multiple video streams to achieve more precise feature extraction of different granularity. “Multimodal Keyless Attention Fusion for Video Classification” combines multiple single-modal models such as rgb, flow, and sound models to solve the problem that CNN and RNN models are difficult to be combined to for joint end-to-end training directly on large-scale datasets. “Hierarchical Discriminative Learning for Visible Thermal Person Re-Identification” improves person re-identification by cross-compare normal and thermal video streams.

It is not a surprise that not many system-related papers are presented at this conference. “Computation Error Analysis of Block Floating Point Arithmetic Oriented Convolution Neural Network Accelerator Design” focuses one the challenge of float-point arithmetic overhead in transplant CNNs on FPGA. “AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training” proposes a gradient compression technique to reduce the communication bottleneck in distributed training.

The research from industry takes large portion in this conference. IBM presents a series of papers and demos. For example, the “Dataset evolver” is an interactive Jupyter notebook-based tool to support data scientists perform feature engineering for classification tasks, and “DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers” proposes to automatically take design flow diagrams and tables from existing research and translate them to abstract computational graphs, then to Keras/Caffe implementations. A full list of IBM’s work at AAAI can be seen here. Google presents “Modeling Individual Labelers Improves Classification” and “Learning to Attack: Adversarial Transformation Networks”, and Facebook shows “Efficient Large-scale Multi-modal classification”. Both companies focus on a specific application field compared to IBM wide spectrum of research. Many research work from industry are closely related to application, such as Alibaba’s ”A multi-task learning approach for improving product title compression with User search log data.” Though I’m curious to find that financial companies are not found at the conference.

On the other hand, the top universities tend to focus on theoretical work. “Interpreting CNN Knowledge Via An Explanatory Graph” from UCLA aims to explain a CNN model and improve its transparency. Tokyo University presents “Constructing Hierarchical Bayesian Network with Pooling” and “Alternating circulant random features for semigroup kernels”. CMU presents “Brute-Force Facial Landmark Analysis with A 140,000-way classifier”. Together with ETH Zurich, MIT shows “Streaming Non-monotone submodular maximization: personalized video summarization”. However, the work of UC Berkeley seems to be absent from this conference.

Adversarial learning is one key topic in different vision-related research areas. For example, “Adversarial Discriminative Heterogeneous Face Recognition”, “Extreme Low resolution activity recognition with Multi-siamese embedding learning”, and “End-to-end united video dehazing and detection”. One of the tutorials “Adversarial Machine Learning” gives an excellent introduction to the state of art on this topic. Prof. Zoubin Ghahramani from Uber gives a talk on his vision about Probabilistic AI, which is also one of the trends at this conference.

Best paper of this year goes to “Memory-Augmented Monte Carlo Tree Search” from University of Alberta, and best student paper to “Counterfactual Multi-Agent Policy Gradients” from, ahem, the other place.

These papers is only a scratch of all the papers contained in the AAAI’18 conference, and mostly on the Computer Vision topic that is my personal interest. If you are interested, please refer to the full list of accepted papers.

Tagged as: No Comments