results for au:Hsu_D in:cs

- Feb 20 2018 cs.LG arXiv:1802.06749v1The $n \times d$ design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to observe only a small number $k \ll n$ of the responses, and then produce a weight vector whose sum of square loss over all points is at most $1+\epsilon$ times the minimum. A standard approach to this problem is to use i.i.d. leverage score sampling, but this approach is known to perform poorly when $k$ is small (e.g., $k = d$); in such cases, it is dominated by volume sampling, a joint sampling method that explicitly promotes diversity. How these methods compare for larger $k$ was not previously understood. We prove that volume sampling can have poor behavior for large $k$ - indeed worse than leverage score sampling. We also show how to repair volume sampling using a new padding technique. We prove that padded volume sampling has at least as good a tail bound as leverage score sampling: sample size $k=O(d\log d + d/\epsilon)$ suffices to guarantee total loss at most $1+\epsilon$ times the minimum with high probability. The main technical challenge is proving tail bounds for the sums of dependent random matrices that arise from volume sampling.
- Feb 20 2018 cs.AI arXiv:1802.06215v1Planning under uncertainty is critical for robust robot performance in uncertain, dynamic environments, but it incurs high computational cost. State-of-the-art online search algorithms, such as DESPOT, have vastly improved the computational efficiency of planning under uncertainty and made it a valuable tool for robotics in practice. This work takes one step further by leveraging both CPU and GPU parallelization in order to achieve near real-time online planning performance for complex tasks with large state, action, and observation spaces. Specifically, we propose Hybrid Parallel DESPOT (HyP-DESPOT), a massively parallel online planning algorithm that integrates CPU and GPU parallelism in a multi-level scheme. It performs parallel DESPOT tree search by simultaneously traversing multiple independent paths using multi-core CPUs and performs parallel Monte-Carlo simulations at the leaf nodes of the search tree using GPUs. Experimental results show that HyP-DESPOT speeds up online planning by up to several hundred times, compared with the original DESPOT algorithm, in several challenging robotic tasks in simulation.
- Adversarial examples in machine learning has been a topic of intense research interest, with attacks and defenses being developed in a tight back-and-forth. Most past defenses are best-effort, heuristic approaches that have all been shown to be vulnerable to sophisticated attacks. More recently, rigorous defenses that provide formal guarantees have emerged, but are hard to scale or generalize. A rigorous and general foundation for designing defenses is required to get us off this arms race trajectory. We propose leveraging differential privacy (DP) as a formal building block for robustness against adversarial examples. We observe that the semantic of DP is closely aligned with the formal definition of robustness to adversarial examples. We propose PixelDP, a strategy for learning robust deep neural networks based on formal DP guarantees. PixelDP networks give theoretical guarantees for a subset of their predictions regarding the robustness against adversarial perturbations of bounded size. Our evaluation with MNIST, CIFAR-10, and CIFAR-100 shows that PixelDP networks achieve accuracy under attack on par with the best-performing defense to date, but additionally certify robustness against meaningful-size 1-norm and 2-norm attacks for 40-60% of their predictions. Our experience points to DP as a rigorous, broadly applicable, and mechanism-rich foundation for robust machine learning.
- Trust is essential for human-robot collaboration and user adoption of autonomous systems, such as robot assistants. This paper introduces a computational model which integrates trust into robot decision-making. Specifically, we learn from data a partially observable Markov decision process (POMDP) with human trust as a latent variable. The trust-POMDP model provides a principled approach for the robot to (i) infer the trust of a human teammate through interaction, (ii) reason about the effect of its own actions on human behaviors, and (iii) choose actions that maximize team performance over the long term. We validated the model through human subject experiments on a table-clearing task in simulation (201 participants) and with a real robot (20 participants). The results show that the trust-POMDP improves human-robot team performance in this task. They further suggest that maximizing trust in itself may not improve team performance.
- Oct 17 2017 cs.AI arXiv:1710.05627v2How can a delivery robot navigate reliably to a destination in a new office building, with minimal prior information? To tackle this challenge, this paper introduces a two-level hierarchical approach, which integrates model-free deep learning and model-based path planning. At the low level, a neural-network motion controller, called the intention-net, is trained end-to-end to provide robust local navigation. The intention-net maps images from a single monocular camera and "intentions" directly to robot controls. At the high level, a path planner uses a crude map, e.g., a 2-D floor plan, to compute a path from the robot's current location to the goal. The planned path provides intentions to the intention-net. Preliminary experiments suggest that the learned motion controller is robust against perceptual uncertainty and by integrating with a path planner, it generalizes effectively to new environments and goals.
- The spectral gap $\gamma$ of a finite, ergodic, and reversible Markov chain is an important parameter measuring the asymptotic rate of convergence. In applications, the transition matrix $P$ may be unknown, yet one sample of the chain up to a fixed time $n$ may be observed. We consider here the problem of estimating $\gamma$ from this data. Let $\pi$ be the stationary distribution of $P$, and $\pi_\star = \min_x \pi(x)$. We show that if $n = \tilde{O}\bigl(\frac{1}{\gamma \pi_\star}\bigr)$, then $\gamma$ can be estimated to within multiplicative constants with high probability. When $\pi$ is uniform on $d$ states, this matches (up to logarithmic correction) a lower bound of $\tilde{\Omega}\bigl(\frac{d}{\gamma}\bigr)$ steps required for precise estimation of $\gamma$. Moreover, we provide the first procedure for computing a fully data-dependent interval, from a single finite-length trajectory of the chain, that traps the mixing time $t_{\text{mix}}$ of the chain at a prescribed confidence level. The interval does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{\text{relax}} = 1/\gamma$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $1/\sqrt{n}$ rate, where $n$ is the length of the sample path.
- In this paper, we use variational recurrent neural network to investigate the anomaly detection problem on graph time series. The temporal correlation is modeled by the combination of recurrent neural network (RNN) and variational inference (VI), while the spatial information is captured by the graph convolutional network. In order to incorporate external factors, we use feature extractor to augment the transition of latent variables, which can learn the influence of external factors. With the target function as accumulative ELBO, it is easy to extend this model to on-line method. The experimental study on traffic flow data shows the detection capability of the proposed method.
- Jul 26 2017 cs.NE arXiv:1707.07961v2Time series account for a large proportion of the data stored in financial, medical and scientific databases. The efficient storage of time series is important in practical applications. In this paper, we propose a novel compression scheme for time series. The encoder and decoder are both composed by recurrent neural networks (RNN) such as long short-term memory (LSTM). There is an autoencoder between encoder and decoder, which encodes the hidden state and input together and decodes them at the decoder side. Moreover, we pre-process the original time series by partitioning it into segments with various lengths which have similar total variation. The experimental study shows that the proposed algorithm can achieve competitive compression ratio on real-world time series.
- The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is semantic and spatial grounding, which is to infer objects and their spatial relationships from images and natural language expressions. We introduce a two-stage neural-network grounding pipeline that maps natural language referring expressions directly to objects in the images. The first stage uses visual descriptions in the referring expressions to generate a candidate set of relevant objects. The second stage examines all pairwise relationships between the candidates and predicts the most likely referred object according to the spatial descriptions in the referring expressions. A key feature of our system is that by leveraging a large dataset of images labeled with text descriptions, it allows unrestricted object types and natural language referring expressions. Preliminary results indicate that our system outperforms a near state-of-the-art object comprehension system on standard benchmark datasets. We also present a robot system that follows voice commands to pick and place previously unseen objects.
- We study the problem of compressed sensing with asymmetric prior information, as motivated by networking applications such as identity tracking in an Internet-of-Things environment. We focus on the scenario in which a resource-limited encoder needs to report a small subset $S$ from a universe of $N$ objects to a more powerful decoder. The distinguishing feature of our model is asymmetry: the subset $S$ is an i.i.d. sample from a prior distribution $\mu$, and $\mu$ is only known to the decoder. This scenario implies that the encoder must use an oblivious compression scheme which can, nonetheless, achieve communication comparable to the entropy rate $|S| \cdot H(\mu)$, the standard benchmark when both encoder and decoder have access to the prior $\mu$ (as achieved by the Huffman code). We first show that in order to exploit the prior $\mu$ in a non-trivial way in such asymmetric information scenario, the compression scheme must be randomized. This stands in contrast to the symmetric case (when both the encoder and decoder know $\mu$), where the Huffman code provides a near-optimal deterministic solution. On the other hand, a rather simple argument shows that, when $|S|=k$, a random linear code achieves essentially optimal communication rate of $O(k\cdot H(\mu))$ bits, nearly-matching the communication benchmark in the symmetric case. Alas, the resulting scheme has prohibitive decoding time: about ${N\choose k} \approx (N/k)^k$. Our main result is a computationally efficient and linear coding scheme, which achieves an $O(\lg\lg N)$-competitive communication ratio compared to the optimal benchmark, and runs in $\text{poly}(N,k)$ time. Our "multi-level" coding scheme uses a combination of hashing and syndrome-decoding of Reed-Solomon codes, and relies on viewing the (unknown) prior $\mu$ as a rather small convex combination of uniform ("flat") distributions.
- Jul 11 2017 cs.RO arXiv:1707.02586v2A robot operating in isolation needs to reason over the uncertainty in its model of the world and adapt its own actions to account for this uncertainty. Similarly, a robot interacting with people needs to reason over its uncertainty over the human internal state, as well as over how this state may change, as humans adapt to the robot. This paper summarizes our own work in this area, which depicts the different ways that probabilistic planning and game-theoretic algorithms can enable such reasoning in robotic systems that collaborate with people. We start with a general formulation of the problem as a two-player game with incomplete information. We then articulate the different assumptions within this general formulation, and we explain how these lead to exciting and diverse robot behaviors in real-time interactions with actual human subjects, in a variety of manufacturing, personal robotics and assistive care settings.
- Jul 04 2017 cs.NE arXiv:1707.00666v2In this paper, we use recurrent autoencoder model to predict the time series in single and multiple steps ahead. Previous prediction methods, such as recurrent neural network (RNN) and deep belief network (DBN) models, cannot learn long term dependencies. And conventional long short-term memory (LSTM) model doesn't remember recent inputs. Combining LSTM and autoencoder (AE), the proposed model can capture long-term dependencies across data points and uses features extracted from recent observations for augmenting LSTM at the same time. Based on comprehensive experiments, we show that the proposed methods significantly improves the state-of-art performance on chaotic time series benchmark and also has better performance on real-world data. Both single-output and multiple-output predictions are investigated.
- Jun 06 2017 cs.NA arXiv:1706.01169v1Finding the symmetric and orthogonal decomposition (SOD) of a tensor is a recurring problem in signal processing, machine learning and statistics. In this paper, we review, establish and compare the perturbation bounds for two natural types of incremental rank-one approximation approaches. Numerical experiments and open questions are also presented and discussed.
- This work studies the parameter identification problem for the Markov chain choice model of Blanchet, Gallego, and Goyal used in assortment planning. In this model, the product selected by a customer is determined by a Markov chain over the products, where the products in the offered assortment are absorbing states. The underlying parameters of the model were previously shown to be identifiable from the choice probabilities for the all-products assortment, together with choice probabilities for assortments of all-but-one products. Obtaining and estimating choice probabilities for such large assortments is not desirable in many settings. The main result of this work is that the parameters may be identified from assortments of sizes two and three, regardless of the total number of products. The result is obtained via a simple and efficient parameter recovery algorithm.
- May 31 2017 cs.NA arXiv:1705.10404v1Many idealized problems in signal processing, machine learning and statistics can be reduced to the problem of finding the symmetric canonical decomposition of an underlying symmetric and orthogonally decomposable (SOD) tensor. Drawing inspiration from the matrix case, the successive rank-one approximations (SROA) scheme has been proposed and shown to yield this tensor decomposition exactly, and a plethora of numerical methods have thus been developed for the tensor rank-one approximation problem. In practice, however, the inevitable errors (say) from estimation, computation, and modeling necessitate that the input tensor can only be assumed to be a nearly SOD tensor---i.e., a symmetric tensor slightly perturbed from the underlying SOD tensor. This article shows that even in the presence of perturbation, SROA can still robustly recover the symmetric canonical decomposition of the underlying tensor. It is shown that when the perturbation error is small enough, the approximation errors do not accumulate with the iteration number. Numerical results are presented to support the theoretical findings.
- This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time approximation scheme is given for the natural least squares optimization problem in any constant dimension. Next, in an average-case and noise-free setting where the responses exactly correspond to a linear function of i.i.d. draws from a standard multivariate normal distribution, an efficient algorithm based on lattice basis reduction is shown to exactly recover the unknown linear function in arbitrary dimension. Finally, lower bounds on the signal-to-noise ratio are established for approximate recovery of the unknown linear function by any estimator.
- This paper introduces the QMDP-net, a neural network architecture for planning under partial observability. The QMDP-net combines the strengths of model-free learning and model-based planning. It is a recurrent policy network, but it represents a policy for a parameterized set of tasks by connecting a model with a planning algorithm that solves the model, thus embedding the solution structure of planning in a network learning architecture. The QMDP-net is fully differentiable and allows for end-to-end training. We train a QMDP-net on different tasks so that it can generalize to new ones in the parameterized task set and "transfer" to other similar tasks beyond the set. In preliminary experiments, QMDP-net showed strong performance on several robotic tasks in simulation. Interestingly, while QMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm in the experiments, as a result of end-to-end learning.
- Jan 30 2017 cs.RO arXiv:1701.07851v1Shared autonomy integrates user input with robot autonomy in order to control a robot and help the user to complete a task. Our work aims to improve the performance of such a human-robot team: the robot tries to guide the human towards an effective strategy, sometimes against the human's own preference, while still retaining his trust. We achieve this through a principled human-robot mutual adaptation formalism. We integrate a bounded-memory adaptation model of the human into a partially observable stochastic decision model, which enables the robot to adapt to an adaptable human. When the human is adaptable, the robot guides the human towards a good strategy, maybe unknown to the human in advance. When the human is stubborn and not adaptable, the robot complies with the human's preference in order to retain their trust. In the shared autonomy setting, unlike many other common human-robot collaboration settings, only the robot actions can change the physical state of the world, and the human and robot goals are not fully observable. We address these challenges and show in a human subject experiment that the proposed mutual adaptation formalism improves human-robot team performance, while retaining a high level of user trust in the robot, compared to the common approach of having the robot strictly following participants' preference.
- We study large-scale kernel methods for acoustic modeling in speech recognition and compare their performance to deep neural networks (DNNs). We perform experiments on four speech recognition datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/character error rate). In order to scale kernel methods to these large datasets, we use the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques for improving the performance of kernel acoustic models. First, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection. The method is able to explore a large number of non-linear features while maintaining a compact model more efficiently than existing approaches. Second, we present a number of frame-level metrics which correlate very strongly with recognition performance when computed on the heldout set; we take advantage of these correlations by monitoring these metrics during training in order to decide when to stop learning. This technique can noticeably improve the recognition performance of both DNN and kernel models, while narrowing the gap between them. Additionally, we show that the linear bottleneck method of Sainath et al. (2013) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Together, these three methods dramatically improve the performance of kernel acoustic models, making their performance comparable to DNNs on the tasks we explored.
- Scarce data is a major challenge to scaling robot learning to truly complex tasks, as we need to generalize locally learned policies over different "contexts". Bayesian optimization approaches to contextual policy search (CPS) offer data-efficient policy learning that generalize over a context space. We propose to improve data- efficiency by factoring typically considered contexts into two components: target- type contexts that correspond to a desired outcome of the learned behavior, e.g. target position for throwing a ball; and environment type contexts that correspond to some state of the environment, e.g. initial ball position or wind speed. Our key observation is that experience can be directly generalized over target-type contexts. Based on that we introduce Factored Contextual Policy Search with Bayesian Optimization for both passive and active learning settings. Preliminary results show faster policy generalization on a simulated toy problem.
- Sep 13 2016 cs.AI arXiv:1609.03250v3The partially observable Markov decision process (POMDP) provides a principled general framework for planning under uncertainty, but solving POMDPs optimally is computationally intractable, due to the "curse of dimensionality" and the "curse of history". To overcome these challenges, we introduce the Determinized Sparse Partially Observable Tree (DESPOT), a sparse approximation of the standard belief tree, for online planning under uncertainty. A DESPOT focuses online planning on a set of randomly sampled scenarios and compactly captures the "execution" of all policies under these scenarios. We show that the best policy obtained from a DESPOT is near-optimal, with a regret bound that depends on the representation size of the optimal policy. Leveraging this result, we give an anytime online planning algorithm, which searches a DESPOT for a policy that optimizes a regularized objective function. Regularization balances the estimated value of a policy under the sampled scenarios and the policy size, thus avoiding overfitting. The algorithm demonstrates strong experimental results, compared with some of the best online POMDP algorithms available. It has also been incorporated into an autonomous driving system for real-time vehicle control. The source code for the algorithm is available online.
- Expectation Maximization (EM) is among the most popular algorithms for estimating parameters of statistical models. However, EM, which is an iterative algorithm based on the maximum likelihood principle, is generally only guaranteed to find stationary points of the likelihood objective, and these points may be far from any maximizer. This article addresses this disconnect between the statistical principles behind EM and its algorithmic properties. Specifically, it provides a global analysis of EM for specific models in which the observations comprise an i.i.d. sample from a mixture of two Gaussians. This is achieved by (i) studying the sequence of parameters from idealized execution of EM in the infinite sample limit, and fully characterizing the limit points of the sequence in terms of the initial parameters; and then (ii) based on this convergence analysis, establishing statistical consistency (or lack thereof) for the actual sequence of parameters produced by EM.
- This paper investigates the following natural greedy procedure for clustering in the bi-criterion setting: iteratively grow a set of centers, in each round adding the center from a candidate set that maximally decreases clustering cost. In the case of $k$-medians and $k$-means, the key results are as follows. $\bullet$ When the method considers all data points as candidate centers, then selecting $\mathcal{O}(k\log(1/\varepsilon))$ centers achieves cost at most $2+\varepsilon$ times the optimal cost with $k$ centers. $\bullet$ Alternatively, the same guarantees hold if each round samples $\mathcal{O}(k/\varepsilon^5)$ candidate centers proportionally to their cluster cost (as with $\texttt{kmeans++}$, but holding centers fixed). $\bullet$ In the case of $k$-means, considering an augmented set of $n^{\lceil1/\varepsilon\rceil}$ candidate centers gives $1+\varepsilon$ approximation with $\mathcal{O}(k\log(1/\varepsilon))$ centers, the entire algorithm taking $\mathcal{O}(dk\log(1/\varepsilon)n^{1+\lceil1/\varepsilon\rceil})$ time, where $n$ is the number of data points in $\mathbb{R}^d$. $\bullet$ In the case of Euclidean $k$-medians, generating a candidate set via $n^{\mathcal{O}(1/\varepsilon^2)}$ executions of stochastic gradient descent with adaptively determined constraint sets will once again give approximation $1+\varepsilon$ with $\mathcal{O}(k\log(1/\varepsilon))$ centers in $dk\log(1/\varepsilon)n^{\mathcal{O}(1/\varepsilon^2)}$ time. Ancillary results include: guarantees for cluster costs based on powers of metrics; a brief, favorable empirical evaluation against $\texttt{kmeans++}$; data-dependent bounds allowing $1+\varepsilon$ in the first two bullets above, for example with $k$-medians over finite metric spaces.
- May 03 2016 cs.RO arXiv:1605.00169v1This work addresses the challenge of a robot using real-time feedback from contact sensors to reliably manipulate a movable object on a cluttered tabletop. We formulate contact manipulation as a partially observable Markov decision process (POMDP) in the joint space of robot configurations and object poses. The POMDP formulation enables the robot to actively gather information and reduce the uncertainty on the object pose. Further, it incorporates all major constraints for robot manipulation: kinematic reachability, self-collision, and collision with obstacles. To solve the POMDP, we apply DESPOT, a state-of-the-art online POMDP algorithm. Our approach leverages two key ideas for computational efficiency. First, it performs lazy construction of a configuration-space lattice by interleaving construction of the lattice and online POMDP planning. Second, it combines online and offline POMDP planning by solving relaxed POMDP offline and using the solution to guide the online search algorithm. We evaluated the proposed approach on a seven degree-of-freedom robot arm in simulation environments. It significantly outperforms several existing algorithms, including some commonly used heuristics for contact manipulation under uncertainty.
- Mar 22 2016 cs.RO arXiv:1603.06390v1Object handover is a basic, but essential capability for robots interacting with humans in many applications, e.g., caring for the elderly and assisting workers in manufacturing workshops. It appears deceptively simple, as humans perform object handover almost flawlessly. The success of humans, however, belies the complexity of object handover as collaborative physical interaction between two agents with limited communication. This paper presents a learning algorithm for dynamic object handover, for example, when a robot hands over water bottles to marathon runners passing by the water station. We formulate the problem as contextual policy search, in which the robot learns object handover by interacting with the human. A key challenge here is to learn the latent reward of the handover task under noisy human feedback. Preliminary experiments show that the robot learns to hand over a water bottle naturally and that it adapts to the dynamics of human motion. One challenge for the future is to combine the model-free learning algorithm with a model-based planning approach and enable the robot to adapt over human preferences and object characteristics, such as shape, weight, and surface texture.
- We investigate active learning with access to two distinct oracles: Label (which is standard) and Search (which is not). The Search oracle models the situation where a human searches a database to seed or counterexample an existing solution. Search is stronger than Label while being natural to implement in many situations. We show that an algorithm using both oracles can provide exponentially large problem-dependent improvements over Label alone.
- Feb 17 2016 cs.AI arXiv:1602.04875v3The partially observable Markov decision process (POMDP) provides a principled general model for planning under uncertainty. However, solving a general POMDP is computationally intractable in the worst case. This paper introduces POMDP-lite, a subclass of POMDPs in which the hidden state variables are constant or only change deterministically. We show that a POMDP-lite is equivalent to a set of fully observable Markov decision processes indexed by a hidden parameter and is useful for modeling a variety of interesting robotic tasks. We develop a simple model-based Bayesian reinforcement learning algorithm to solve POMDP-lite models. The algorithm performs well on large-scale POMDP-lite models with up to $10^{20}$ states and outperforms the state-of-the-art general-purpose POMDP algorithms. We further show that the algorithm is near-Bayesian-optimal under suitable conditions.
- Oct 09 2015 cs.CY arXiv:1510.02377v3In a world where traditional notions of privacy are increasingly challenged by the myriad companies that collect and analyze our data, it is important that decision-making entities are held accountable for unfair treatments arising from irresponsible data usage. Unfortunately, a lack of appropriate methodologies and tools means that even identifying unfair or discriminatory effects can be a challenge in practice. We introduce the unwarranted associations (UA) framework, a principled methodology for the discovery of unfair, discriminatory, or offensive user treatment in data-driven applications. The UA framework unifies and rationalizes a number of prior attempts at formalizing algorithmic fairness. It uniquely combines multiple investigative primitives and fairness metrics with broad applicability, granular exploration of unfair treatment in user subgroups, and incorporation of natural notions of utility that may account for observed disparities. We instantiate the UA framework in FairTest, the first comprehensive tool that helps developers check data-driven applications for unfair user treatment. It enables scalable and statistically rigorous investigation of associations between application outcomes (such as prices or premiums) and sensitive user attributes (such as race or gender). Furthermore, FairTest provides debugging capabilities that let programmers rule out potential confounders for observed unfair effects. We report on use of FairTest to investigate and in some cases address disparate impact, offensive labeling, and uneven rates of algorithmic error in four data-driven applications. As examples, our results reveal subtle biases against older populations in the distribution of error in a predictive health application and offensive racial labeling in an image tagger.
- We develop a new active learning algorithm for the streaming setting satisfying three important properties: 1) It provably works for any classifier representation and classification problem including those with severe noise. 2) It is efficiently implementable with an ERM oracle. 3) It is more aggressive than all previous approaches satisfying 1 and 2. To do this we create an algorithm based on a newly defined optimization problem and analyze it. We also conduct the first experimental analysis of all efficient agnostic active learning algorithms, evaluating their strengths and weaknesses in different settings.
- This article provides the first procedure for computing a fully data-dependent interval that traps the mixing time $t_{\text{mix}}$ of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{\text{relax}}$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $\sqrt{n}$ rate, where $n$ is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least $\Omega(t_{\text{relax}})$ times on the average. Finally, future directions of research are identified.
- Oct 17 2014 cs.DS arXiv:1410.4429v1This note gives a simple analysis of the randomized approximation scheme for matrix multiplication of Drineas et al (2006) with a particular sampling distribution over outer products. The result follows from a matrix version of Bernstein's inequality. To approximate the matrix product $AB^\top$ to spectral norm error $\varepsilon\|A\|\|B\|$, it suffices to sample on the order of $(\mathrm{sr}(A) \vee \mathrm{sr}(B)) \log(\mathrm{sr}(A) \wedge \mathrm{sr}(B)) / \varepsilon^2$ outer products, where $\mathrm{sr}(M)$ is the stable rank of a matrix $M$.
- Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.
- A basic problem in the design of privacy-preserving algorithms is the private maximization problem: the goal is to pick an item from a universe that (approximately) maximizes a data-dependent function, all under the constraint of differential privacy. This problem has been used as a sub-routine in many privacy-preserving algorithms for statistics and machine-learning. Previous algorithms for this problem are either range-dependent---i.e., their utility diminishes with the size of the universe---or only apply to very restricted function classes. This work provides the first general-purpose, range-independent algorithm for private maximization that guarantees approximate differential privacy. Its applicability is demonstrated on two fundamental tasks in data mining and machine learning.
- We present an approximate algorithm for matrix multiplication based on matrix sketching techniques. First one of the matrix is chosen and sparsified using the online matrix sketching algorithm, and then the matrix product is calculated using the sparsified matrix. We prove when the sample number grows large compared to the sample dimensions the proposed algorithm achieves similar accuracy bound with a smaller computational cost compared to the state-of-the-art algorithms.
- We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of $K$ actions in response to the observed context, and observes the reward only for that chosen action. Our method assumes access to an oracle for solving fully supervised cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only $\tilde{O}(\sqrt{KT/\log N})$ oracle calls across all $T$ rounds, where $N$ is the number of policies in the policy class we compete against. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We further conduct a proof-of-concept experiment which demonstrates the excellent computational and prediction performance of (an online variant of) our algorithm relative to several baselines.
- Current music recommender systems typically act in a greedy fashion by recommending songs with the highest user ratings. Greedy recommendation, however, is suboptimal over the long term: it does not actively gather information on user preferences and fails to recommend novel songs that are potentially interesting. A successful recommender system must balance the needs to explore user preferences and to exploit this information for recommendation. This paper presents a new approach to music recommendation by formulating this exploration-exploitation trade-off as a reinforcement learning task called the multi-armed bandit. To learn user preferences, it uses a Bayesian model, which accounts for both audio content and the novelty of recommendations. A piecewise-linear approximation to the model and a variational inference algorithm are employed to speed up Bayesian inference. One additional benefit of our approach is a single unified model for both music recommendation and playlist generation. Both simulation results and a user study indicate strong potential for the new approach.
- Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of "higher order" expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allows for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition.
- This work studies applications and generalizations of a simple estimation technique that provides exponential concentration under heavy-tailed distributions, assuming only bounded low-order moments. We show that the technique can be used for approximate minimization of smooth and strongly convex losses, and specifically for least squares linear regression. For instance, our $d$-dimensional estimator requires just $\tilde{O}(d\log(1/\delta))$ random samples to obtain a constant factor approximation to the optimal least squares loss with probability $1-\delta$, without requiring the covariates or noise to be bounded or subgaussian. We provide further applications to sparse linear regression and low-rank covariance matrix estimation with similar allowances on the noise and covariate distributions. The core technique is a generalization of the median-of-means estimator to arbitrary metric spaces.
- Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and provide guaranteed community detection for a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced by Airoldi et al. This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. Moreover, it contains the stochastic block model as a special case. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebraic operations, e.g. singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. As an important special case, our results match the best known scaling requirements for the (homogeneous) stochastic block model.
- We consider the problem of learning a non-negative linear classifier with a $1$-norm of at most $k$, and a fixed threshold, under the hinge-loss. This problem generalizes the problem of learning a $k$-monotone disjunction. We prove that we can learn efficiently in this setting, at a rate which is linear in both $k$ and the size of the threshold, and that this is the best possible rate. We provide an efficient online learning algorithm that achieves the optimal rate, and show that in the batch case, empirical risk minimization achieves this rate as well. The rates we show are tighter than the uniform convergence rate, which grows with $k^2$.
- This note gives a simple analysis of a randomized approximation scheme for matrix multiplication proposed by Sarlos (2006) based on a random rotation followed by uniform column sampling. The result follows from a matrix version of Bernstein's inequality and a tail inequality for quadratic forms in subgaussian random vectors.
- This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
- Unsupervised estimation of latent variable models is a fundamental problem central to numerous applications of machine learning and statistics. This work presents a principled approach for estimating broad classes of such models, including probabilistic topic models and latent linear Bayesian networks, using only second-order observed moments. The sufficient conditions for identifiability of these models are primarily based on weak expansion constraints on the topic-word matrix, for topic models, and on the directed acyclic graph, for Bayesian networks. Because no assumptions are made on the distribution among the latent variables, the approach can handle arbitrary correlations among the topics or latent factors. In addition, a tractable learning method via $\ell_1$ optimization is proposed and studied in numerical experiments.
- X in R^D has mean zero and finite second moments. We show that there is a precise sense in which almost all linear projections of X into R^d (for d < D) look like a scale-mixture of spherical Gaussians -- specifically, a mixture of distributions N(0, sigma^2 I_d) where the weight of the particular sigma component is P (| X |^2 = sigma^2 D). The extent of this effect depends upon the ratio of d to D, and upon a particular coefficient of eccentricity of X's distribution. We explore this result in a variety of experiments.
- Differential privacy is a cryptographically-motivated definition of privacy which has gained significant attention over the past few years. Differentially private solutions enforce privacy by adding random noise to a function computed over the data, and the challenge in designing such algorithms is to control the added noise in order to optimize the privacy-accuracy-sample size tradeoff. This work studies differentially-private statistical estimation, and shows upper and lower bounds on the convergence rates of differentially private approximations to statistical estimators. Our results reveal a formal connection between differential privacy and the notion of Gross Error Sensitivity (GES) in robust statistics, by showing that the convergence rate of any differentially private approximation to an estimator that is accurate over a large class of distributions has to grow with the GES of the estimator. We then provide an upper bound on the convergence rate of a differentially private approximation to an estimator with bounded range and bounded GES. We show that the bounded range condition is necessary if we wish to ensure a strict form of differential privacy.
- Bayesian reinforcement learning (BRL) encodes prior knowledge of the world in a model and represents uncertainty in model parameters by maintaining a probability distribution over them. This paper presents Monte Carlo BRL (MC-BRL), a simple and general approach to BRL. MC-BRL samples a priori a finite set of hypotheses for the model parameter values and forms a discrete partially observable Markov decision process (POMDP) whose state space is a cross product of the state space for the reinforcement learning task and the sampled model parameter space. The POMDP does not require conjugate distributions for belief representation, as earlier works do, and can be solved relatively easily with point-based approximation algorithms. MC-BRL naturally handles both fully and partially observable worlds. Theoretical and experimental results show that the discrete POMDP approximates the underlying BRL task well with guaranteed performance.
- Jun 27 2012 cs.AI arXiv:1206.5928v1We apply decision theoretic techniques to construct non-player characters that are able to assist a human player in collaborative games. The method is based on solving Markov decision processes, which can be difficult when the game state is described by many variables. To scale to more complex games, the method allows decomposition of a game task into subtasks, each of which can be modelled by a Markov decision process. Intention recognition is used to infer the subtask that the human is currently performing, allowing the helper to assist the human in performing the correct task. Experiments show that the method can be effective, giving near-human level performance in helping a human in a collaborative game.
- This work provides a computationally efficient and statistically consistent moment-based estimator for mixtures of spherical Gaussians. Under the condition that component means are in general position, a simple spectral decomposition technique yields consistent parameter estimates from low-order observable moments, without additional minimum separation assumptions needed by previous computationally efficient estimation procedures. Thus computational and information-theoretic barriers to efficient estimation in mixture models are precluded when the mixture components have means in general position and spherical covariances. Some connections are made to estimation problems related to independent component analysis.
- This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models.
- The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on $k\times k$ matrices, where $k$ is the number of latent factors (e.g. the number of topics), rather than in the $d$-dimensional observed space (typically $d \gg k$).
- We study the tracking problem, namely, estimating the hidden state of an object over time, from unreliable and noisy measurements. The standard framework for the tracking problem is the generative framework, which is the basis of solutions such as the Bayesian algorithm and its approximation, the particle filters. However, these solutions can be very sensitive to model mismatches. In this paper, motivated by online learning, we introduce a new framework for tracking. We provide an efficient tracking algorithm for this framework. We provide experimental results comparing our algorithm to the Bayesian algorithm on simulated data. Our experiments show that when there are slight model mismatches, our algorithm outperforms the Bayesian algorithm.
- Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations. The current practice for estimating the parameters of such models relies on local search heuristics (e.g., the EM algorithm) which are prone to failure, and existing consistent methods are unfavorable due to their high computational and sample complexity which typically scale exponentially with the number of mixture components. This work develops an efficient method of moments approach to parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians (such as mixtures of axis-aligned Gaussians) and hidden Markov models. The new method leads to rigorous unsupervised learning results for mixture models that were not achieved by previous works; and, because of its simplicity, it offers a viable alternative to EM for practical deployment.
- We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable corresponding to the mixture components is hidden and each mixture component over the observed variables can have a potentially different Markov graph structure and parameters. We propose a novel approach for estimating the mixture components, and our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. Our method is efficient when the union graph, which is the union of the Markov graphs of the mixture components, has sparse vertex separators between any pair of observed variables. This includes tree mixtures and mixtures of bounded degree graphs. For such models, we prove that our method correctly recovers the union graph structure and the tree structures corresponding to maximum-likelihood tree approximations of the mixture components. The sample and computational complexities of our method scale as $\poly(p, r)$, for an $r$-component mixture of $p$-variate graphical models. We further extend our results to the case when the union graph has sparse local separators between any pair of observed variables, such as mixtures of locally tree-like graphs, and the mixture components are in the regime of correlation decay.
- We prove an exponential probability tail inequality for positive semidefinite quadratic forms in a subgaussian random vector. The bound is analogous to one that holds when the vector has independent Gaussian entries.
- This paper addresses the problem of minimizing a convex, Lipschitz function $f$ over a convex, compact set $\xset$ under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value $f(x)$ at any query point $x \in \xset$. The quantity of interest is the regret of the algorithm, which is the sum of the function values at algorithm's query points minus the optimal function value. We demonstrate a generalization of the ellipsoid algorithm that incurs $\otil(\poly(d)\sqrt{T})$ regret. Since any algorithm has regret at least $\Omega(\sqrt{T})$ on this problem, our algorithm is optimal in terms of the scaling with $T$.
- This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our finite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics.
- This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.
- We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time $\mathrm{polylog}(N)$, where $N$ is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.
- We derive exponential tail inequalities for sums of random matrices with no dependence on the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity that can be small even when the dimension is large or infinite. Some applications to principal component analysis and approximate matrix multiplication are given to illustrate the utility of the new bounds.
- Mar 23 2011 cs.LG arXiv:1103.4204v1In this work we study parallelization of online learning, a core primitive in machine learning. In a parallel environment all known approaches for parallel online learning lead to delayed updates, where the model is updated using out-of-date information. In the worst case, or when examples are temporally correlated, delay can have a very adverse effect on the learning algorithm. Here, we analyze and present preliminary empirical results on a set of learning architectures based on a feature sharding approach that present various tradeoffs between delay, degree of parallelism, representation power and empirical performance.
- Suppose a given observation matrix can be decomposed as the sum of a low-rank matrix and a sparse matrix (outliers), and the goal is to recover these individual components from the observed sum. Such additive decompositions have applications in a variety of numerical problems including system identification, latent variable graphical modeling, and principal components analysis. We study conditions under which recovering such a decomposition is possible via a combination of $\ell_1$ norm and trace norm minimization. We are specifically interested in the question of how many outliers are allowed so that convex programming can still achieve accurate recovery, and we obtain stronger recovery guarantees than previous studies. Moreover, we do not assume that the spatial pattern of outliers is random, which stands in contrast to related analyses under such assumptions via matrix completion.
- Jun 15 2010 cs.LG arXiv:1006.2588v1We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification.
- We study the tracking problem, namely, estimating the hidden state of an object over time, from unreliable and noisy measurements. The standard framework for the tracking problem is the generative framework, which is the basis of solutions such as the Bayesian algorithm and its approximation, the particle filters. However, the problem with these solutions is that they are very sensitive to model mismatches. In this paper, motivated by online learning, we introduce a new framework -- an \em explanatory framework -- for tracking. We provide an efficient tracking algorithm for this framework. We provide experimental results comparing our algorithm to the Bayesian algorithm on simulated data. Our experiments show that when there are slight model mismatches, our algorithm vastly outperforms the Bayesian algorithm.
- We study the problem of decision-theoretic online learning (DTOL). Motivated by practical applications, we focus on DTOL when the number of actions is very large. Previous algorithms for learning in this framework have a tunable learning rate parameter, and a barrier to using online-learning in practical applications is that it is not understood how to set this parameter optimally, particularly when the number of actions is large. In this paper, we offer a clean solution by proposing a novel and completely parameter-free algorithm for DTOL. We introduce a new notion of regret, which is more natural for applications with a large number of actions. We show that our algorithm achieves good performance with respect to this new notion of regret; in addition, it also achieves performance close to that of the best bounds achieved by previous algorithms with optimally-tuned parameters, according to previous notions of regret.
- Feb 10 2009 cs.LG arXiv:0902.1284v2We consider multi-label prediction problems with large output spaces under the assumption of output sparsity -- that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that the number of subproblems need only be logarithmic in the total number of possible labels, making this approach radically more efficient than others. We also state and prove robustness guarantees for this method in the form of regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting.
- Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations---it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple, employing only a singular value decomposition and matrix multiplications.
- We present a new online learning algorithm for cumulative discounted gain. This learning algorithm does not use exponential weights on the experts. Instead, it uses a weighting scheme that depends on the regret of the master algorithm relative to the experts. In particular, experts whose discounted cumulative gain is smaller (worse) than that of the master algorithm receive zero weight. We also sketch how a regret-based algorithm can be used as an alternative to Bayesian averaging in the context of inferring latent random variables.