State-of-the-art machine learning systems rely on graph-based models, with the distributed training of these models being the norm in AI-powered production pipelines. The performance of these communication-heavy systems depends on the effective overlap of communication and computation. While the overlap challenge has been addressed in systems with simpler model representations, it remains an open problem in graph-based models. In this work, we develop a system for communication scheduling which realizes near-optimal overlap of communication and computation in graph-based models. Our system is implemented over TensorFlow and requires no changes in the model or developer inputs. Our system improves the throughput by up to 82% in inference and 20% in training, while also reducing straggler effect by up to 2.8x. A part of our implementation is already merged with TensorFlow codebase; the rest is publicly available.
Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future. This can lead to low-quality predictions in real-world settings with stochastic dynamics. In this paper, we develop a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables. To the best of our knowledge, our model is the first to provide effective stochastic multi-frame prediction for real-world video. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple real-world datasets, both action-free and action-conditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. Our SV2P implementation will be open sourced upon publication.
Oct 03 2017 cs.CR
We present a mechanism that puts users in the center of control and empowers them to dictate the access to their collections of data. Revisiting the fundamental mechanisms in security for providing protection, our solution uses capabilities, access lists, and access rights following well-understood formal notions for reasoning about access. This contribution presents a practical, correct, auditable, transparent, distributed, and decentralized mechanism that is well-matched to the current emerging environments including Internet of Things, smart city, precision medicine, and autonomous cars. It is based on well-tested principles and practices used in a distributed authorization, cryptocurrencies, and scalable computing.
In an effort to overcome the data deluge in computational biology and bioinformatics and to facilitate bioinformatics research in the era of big data, we identify some of the most influential algorithms that have been widely used in the bioinformatics community. These top data mining and machine learning algorithms cover classification, clustering, regression, graphical model-based learning, and dimensionality reduction. The goal of this study is to guide the focus of scalable computing experts in the endeavor of applying new storage and scalable computation designs to bioinformatics algorithms that merit their attention most, following the engineering maxim of "optimize the common case".
Sep 01 2017 cs.CR
While there exist many isolation mechanisms that are available to cloud service providers, including virtual machines, containers, etc., the problem of side-channel increases in importance as a remaining security vulnerability, particularly in the presence of shared caches and multicore processors. In this paper we present a hardware-software mechanism that improves the isolation of cloud processes in the presence of shared caches on multicore chips. Combining the Intel CAT architecture that enables cache partitioning on the fly with novel scheduling techniques and state cleansing mechanisms, we enable cache-side-channel free computing for Linux-based containers and virtual machines, in particular, those managed by KVM. We do a preliminary evaluation of our system using a CPU bound workload. Our system allows Simultaneous Multithreading (SMT) to remain enabled and does not require application level changes.
Convolutional autoregressive models have recently demonstrated state-of-the-art performance on a number of generation tasks. While fast, parallel training methods have been crucial for their success, generation is typically implemented in a naïve fashion where redundant computations are unnecessarily repeated. This results in slow generation, making such models infeasible for production environments. In this work, we describe a method to speed up generation in convolutional autoregressive models. The key idea is to cache hidden states to avoid redundant computation. We apply our fast generation method to the Wavenet and PixelCNN++ models and achieve up to $21\times$ and $183\times$ speedups respectively.
Dec 05 2016 cs.DC
During the past decade, machine learning has become extremely popular and can be found in many aspects of our every day life. Nowayadays with explosion of data while rapid growth of computation capacity, Distributed Deep Neural Networks (DDNNs) which can improve their performance linearly with more computation resources, have become hot and trending. However, there has not been an in depth study of the performance of these systems, and how well they scale. In this paper we analyze CNTK, one of the most commonly used DDNNs, by first building a performance model and then evaluating the system two settings: a small cluster with all nodes in a single rack connected to a top of rack switch, and in large scale using Blue Waters with arbitary placement of nodes. Our main focus was the scalability of the system with respect to adding more nodes. Based on our results, this system has an excessive initialization overhead because of poor I/O utilization which dominates the whole execution time. Because of this, the system does not scale beyond a few nodes (4 in Blue Waters). Additionally, due to a single server-multiple worker design the server becomes a bottleneck after 16 nodes limiting the scalability of the CNTK.
Neural networks are usually over-parameterized with significant redundancy in the number of required neurons which results in unnecessary computation and memory usage at inference time. One common approach to address this issue is to prune these big networks by removing extra neurons and parameters while maintaining the accuracy. In this paper, we propose NoiseOut, a fully automated pruning algorithm based on the correlation between activations of neurons in the hidden layers. We prove that adding additional output neurons with entirely random targets results into a higher correlation between neurons which makes pruning by NoiseOut even more efficient. Finally, we test our method on various networks and datasets. These experiments exhibit high pruning rates while maintaining the accuracy of the original network.