- May 25 2017 cs.LG arXiv:1705.08504v1Interpretability has become an important issue as machine learning is increasingly used to inform consequential decisions. We propose an approach for interpreting a blackbox model by extracting a decision tree that approximates the model. Our model extraction algorithm avoids overfitting by leveraging blackbox model access to actively sample new training points. We prove that as the number of samples goes to infinity, the decision tree learned using our algorithm converges to the exact greedy decision tree. In our evaluation, we use our algorithm to interpret random forests and neural nets trained on several datasets from the UCI Machine Learning Repository, as well as control policies learned for three classical reinforcement learning problems. We show that our algorithm improves over a baseline based on CART on every problem instance. Furthermore, we show how an interpretation generated by our approach can be used to understand and debug these models.
- Selective classification techniques (also known as reject option) have not yet been considered in the context of deep neural networks (DNNs). These techniques can potentially significantly improve DNNs prediction performance by trading-off coverage. In this paper we propose a method to construct a selective classifier given a trained neural network. Our method allows a user to set a desired risk level. At test time, the classifier rejects instances as needed, to grant the desired risk (with high probability). Empirical results over CIFAR and ImageNet convincingly demonstrate the viability of our method, which opens up possibilities to operate DNNs in mission-critical applications. For example, using our method an unprecedented 2% error in top-5 ImageNet classification can be guaranteed with probability 99.9%, and almost 60% test coverage.
- May 25 2017 cs.LG arXiv:1705.08499v1We introduce the Prediction Advantage (PA), a novel performance measure for prediction functions under any loss function (e.g., classification or regression). The PA is defined as the performance advantage relative to the Bayesian risk restricted to knowing only the distribution of the labels. We derive the PA for well-known loss functions, including 0/1 loss, cross-entropy loss, absolute loss, and squared loss. In the latter case, the PA is identical to the well-known R-squared measure, widely used in statistics. The use of the PA ensures meaningful quantification of prediction performance, which is not guaranteed, for example, when dealing with noisy imbalanced classification problems. We argue that among several known alternative performance measures, PA is the best (and only) quantity ensuring meaningfulness for all noise and imbalance levels.
- May 25 2017 cs.LG arXiv:1705.08498v1Real-time prediction of clinical interventions remains a challenge within intensive care units (ICUs). This task is complicated by data sources that are noisy, sparse, heterogeneous and outcomes that are imbalanced. In this paper, we integrate data from all available ICU sources (vitals, labs, notes, demographics) and focus on learning rich representations of this data to predict onset and weaning of multiple invasive interventions. In particular, we compare both long short-term memory networks (LSTM) and convolutional neural networks (CNN) for prediction of five intervention tasks: invasive ventilation, non-invasive ventilation, vasopressors, colloid boluses, and crystalloid boluses. Our predictions are done in a forward-facing manner to enable "real-time" performance, and predictions are made with a six hour gap time to support clinically actionable planning. We achieve state-of-the-art results on our predictive tasks using deep architectures. We explore the use of feature occlusion to interpret LSTM models, and compare this to the interpretability gained from examining inputs that maximally activate CNN outputs. We show that our models are able to significantly outperform baselines in intervention prediction, and provide insight into model learning, which is crucial for the adoption of such models in practice.
- We study pool-based active learning with abstention feedbacks, where a labeler can abstain from labeling a queried example. We take a Bayesian approach to the problem and propose a general framework that learns both the target classification problem and the unknown abstention pattern at the same time. As specific instances of the framework, we develop two useful greedy algorithms with theoretical guarantees: they respectively achieve the ${(1-\frac{1}{e})}$ factor approximation of the optimal expected or worst-case value of a useful utility function. Our experiments show the algorithms perform well in various practical scenarios.
- May 25 2017 cs.LG arXiv:1705.08480v1Recurrent Neural Networks architectures excel at processing sequences by modelling dependencies over different timescales. The recently introduced Recurrent Weighted Average (RWA) unit captures long term dependencies far better than an LSTM on several challenging tasks. The RWA achieves this by applying attention to each input and computing a weighted average over the full history of its computations. Unfortunately, the RWA cannot change the attention it has assigned to previous timesteps, and so struggles with carrying out consecutive tasks or tasks with changing requirements. We present the Recurrent Discounted Attention (RDA) unit that builds on the RWA by additionally allowing the discounting of the past. We empirically compare our model to RWA, LSTM and GRU units on several challenging tasks. On tasks with a single output the RWA, RDA and GRU units learn much quicker than the LSTM and with better performance. On the multiple sequence copy task our RDA unit learns the task three times as quickly as the LSTM or GRU units while the RWA fails to learn at all. On the Wikipedia character prediction task the LSTM performs best but it followed closely by our RDA unit. Overall our RDA unit performs well and is sample efficient on a large variety of sequence tasks.
- May 25 2017 cs.CV arXiv:1705.08479v1This paper introduces a new architectural framework, known as input fast-forwarding, that can enhance the performance of deep networks. The main idea is to incorporate a parallel path that sends representations of input values forward to deeper network layers. This scheme is substantially different from "deep supervision" in which the loss layer is re-introduced to earlier layers. The parallel path provided by fast-forwarding enhances the training process in two ways. First, it enables the individual layers to combine higher-level information (from the standard processing path) with lower-level information (from the fast-forward path). Second, this new architecture reduces the problem of vanishing gradients substantially because the fast-forwarding path provides a shorter route for gradient backpropagation. In order to evaluate the utility of the proposed technique, a Fast-Forward Network (FFNet), with 20 convolutional layers along with parallel fast-forward paths, has been created and tested. The paper presents empirical results that demonstrate improved learning capacity of FFNet due to fast-forwarding, as compared to GoogLeNet (with deep supervision) and CaffeNet, which are 4x and 18x larger in size, respectively. All of the source code and deep learning models described in this paper will be made available to the entire research community
- We study transient behaviour in the dynamics of complex systems described by a set of non-linear ODE's. Destabilizing nature of transient trajectories is discussed and its connection with the eigenvalue-based linearization procedure. The complexity is realized as a random matrix drawn from a modified May-Wigner model. Based on the initial response of the system, we identify a novel stable-transient regime. We calculate exact abundances of typical and extreme transient trajectories finding both Gaussian and Tracy-Widom distributions known in extreme value statistics. We identify degrees of freedom driving transient behaviour as connected to the eigenvectors and encoded in a non-orthogonality matrix $T_0$. We accordingly extend the May-Wigner model to contain a phase with typical transient trajectories present. An exact norm of the trajectory is obtained in the vanishing $T_0$ limit where it describes a normal matrix.
- May 25 2017 cs.CY arXiv:1705.08514v1Future health ecosystems demand the integration of emerging data technology with an increased focus on preventive medicine. Cybernetics extracts the full potential of data to serve the spectrum of health care, from acute to chronic problems. Building actionable cybernetic navigation tools can greatly empower optimal health decisions, especially by quantifying lifestyle and environmental data. This data to decisions transformation is powered by intuitive event analysis to offer the best semantic abstraction of dynamic living systems. Achieving the goal of preventive health systems in the cybernetic model occurs through the flow of several components. From personalized models we can predict health status using perpetual sensing and data streams. Given these predictions, we give precise recommendations to best suit the prediction for that individual. To enact these recommendations we use persuasive technology in order to deliver and execute targeted interventions.
- Asynchronous-parallel algorithms have the potential to vastly speed up algorithms by eliminating costly synchronization. However, our understanding to these algorithms is limited because the current convergence of asynchronous (block) coordinate descent algorithms are based on somewhat unrealistic assumptions. In particular, the age of the shared optimization variables being used to update a block is assumed to be independent of the block being updated. Also, it is assumed that the updates are applied to randomly chosen blocks. In this paper, we argue that these assumptions either fail to hold or will imply less efficient implementations. We then prove the convergence of asynchronous-parallel block coordinate descent under more realistic assumptions, in particular, always without the independence assumption. The analysis permits both the deterministic (essentially) cyclic and random rules for block choices. Because a bound on the asynchronous delays may or may not be available, we establish convergence for both bounded delays and unbounded delays. The analysis also covers nonconvex, weakly convex, and strongly convex functions. We construct Lyapunov functions that directly model both objective progress and delays, so delays are not treated errors or noise. A continuous-time ODE is provided to explain the construction at a high level.
- May 25 2017 cs.CY arXiv:1705.08502v1I point to a deep and unjustly ignored relation between culture and computation. I first establish interpretations of Piaget's and Vygotsky's theories of child development with the language of theoretical computer science. Using these interpretations, I argue that the two different possible routes to Piagetian disequilibrium -- a tendency to overaccommodate, and a tendency to overassimilate -- are equivalent to the two distinct cultural tendencies, collectivistism and individualism. I argue that this simple characterization of overaccommodation versus overassimilation provides a satisfying explanation as to why the two cultural tendencies differ in the way they empirically do. All such notions are grounded on a firm mathematical framework for those who prefer the computable, and grounded on my personal history for those who prefer the uncomputable.
- May 25 2017 cs.CY arXiv:1705.08506v1This paper is presented a human gait data collection for analysis and activity recognition consisting of continues recordings of combined activities, such as walking, running, taking stairs up and down, sitting down, and so on; and the data recorded are segmented and annotated. Data were collected from a body sensor network consisting of six wearable inertial sensors (accelerometer and gyroscope) located on the right and left thighs, shins, and feet. Additionally, two electromyography sensors were used on the quadriceps (front thigh) to measure muscle activity. This database can be used not only for activity recognition but also for studying how activities are performed and how the parts of the legs move relative to each other. Therefore, the data can be used (a) to perform health-care-related studies, such as in walking rehabilitation or Parkinson's disease recognition, (b) in virtual reality and gaming for simulating humanoid motion, or (c) for humanoid robotics to model humanoid walking. This dataset is the first of its kind which provides data about human gait in great detail. The database is available free of charge https://github.com/romanchereshnev/HuGaDB.
- Security surveillance is one of the most important issues in smart cities, especially in an era of terrorism. Deploying a number of (video) cameras is a common surveillance approach. Given the never-ending power offered by vehicles to metropolises, exploiting vehicle traffic to design camera placement strategies could potentially facilitate security surveillance. This article constitutes the first effort toward building the linkage between vehicle traffic and security surveillance, which is a critical problem for smart cities. We expect our study could influence the decision making of surveillance camera placement, and foster more research of principled ways of security surveillance beneficial to our physical-world life.
- May 25 2017 math.DG arXiv:1705.08886v1
- May 25 2017 cs.CY arXiv:1705.08884v1
- May 25 2017 math.AP arXiv:1705.08880v1
- May 25 2017 cond-mat.str-el arXiv:1705.08879v1
- May 25 2017 cond-mat.str-el arXiv:1705.08876v1
- May 25 2017 math.AG arXiv:1705.08875v1
- May 25 2017 astro-ph.GA arXiv:1705.08874v1
- May 25 2017 math.AP arXiv:1705.08872v1
- May 25 2017 cond-mat.supr-con arXiv:1705.08871v1
- May 25 2017 cs.SY arXiv:1705.08870v1
- May 25 2017 cond-mat.str-el arXiv:1705.08867v1
- May 25 2017 hep-ph arXiv:1705.08864v1
- May 25 2017 hep-ph arXiv:1705.08863v1
- May 25 2017 math.AG arXiv:1705.08862v1
- May 25 2017 cs.NI arXiv:1705.08861v1
- May 25 2017 math.DS arXiv:1705.08860v1
- May 25 2017 math.CO arXiv:1705.08859v1
- May 25 2017 math.PR arXiv:1705.08857v1
- May 25 2017 cs.DM arXiv:1705.08855v1
- May 25 2017 math.CA arXiv:1705.08854v1
- May 25 2017 astro-ph.SR arXiv:1705.08853v1
- May 25 2017 quant-ph arXiv:1705.08852v1
- May 25 2017 math.NA arXiv:1705.08851v1
- May 25 2017 hep-ph arXiv:1705.08845v1
- May 25 2017 math.NA arXiv:1705.08842v1
- May 25 2017 cond-mat.soft arXiv:1705.08840v1
- May 25 2017 quant-ph arXiv:1705.08839v1