We present a new method to translate videos to commands for robotic manipulation using Deep Recurrent Neural Networks (RNN). Our framework first extracts deep features from the input video frames with a deep Convolutional Neural Networks (CNN). Two RNN layers with an encoder-decoder architecture are then used to encode the visual features and sequentially generate the output words as the command. We demonstrate that the translation accuracy can be improved by allowing a smooth transaction between two RNN layers and using the state-of-the-art feature extractor. The experimental results on our new challenging dataset show that our approach outperforms recent methods by a fair margin. Furthermore, we combine the proposed translation module with the vision and planning system to let a robot perform various manipulation tasks. Finally, we demonstrate the effectiveness of our framework on a full-size humanoid robot WALK-MAN.
We propose AffordanceNet, a new deep learning approach to simultaneously detect multiple objects and their affordances from RGB images. Our AffordanceNet has two branches: an object detection branch to localize and classify the object, and an affordance detection branch to assign each pixel in the object to its most probable affordance label. The proposed framework employs three key components for effectively handling the multiclass problem in the affordance mask: a sequence of deconvolutional layers, a robust resizing strategy, and a multi-task loss function. The experimental results on the public datasets show that our AffordanceNet outperforms recent state-of-the-art methods by a fair margin, while its end-to-end architecture allows the inference at the speed of 150ms per image. This makes our AffordanceNet is well suitable for real-time robotic applications. Furthermore, we demonstrate the effectiveness of AffordanceNet in different testing environments and in real robotic applications. The source code is available at https://github.com/nqanh/affordance-net.
We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than many competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.
We present a new method to estimate the 6DOF pose of the event camera solely based on the event stream. Our method first creates the event image from a list of events that occurs in a very short time interval, then a Stacked Spatial LSTM Network (SP-LSTM) is used to learn and estimate the camera pose. Our SP-LSTM comprises a CNN to learn deep features from the event images and a stack of LSTM to learn spatial dependencies in the image features space. We show that the spatial dependency plays an important role in the pose estimation task and the SP-LSTM can effectively learn that information. The experimental results on the public dataset show that our approach outperforms recent methods by a substantial margin. Overall, our proposed method reduces about 6 times the position error and 3 times the orientation error over the state of the art. The source code and trained models will be released.
Jul 10 2017 cs.CV
Seizure prediction has attracted a growing attention as one of the most challenging predictive data analysis efforts in order to improve the life of patients living with drug-resistant epilepsy and tonic seizures. Many outstanding works have been reporting great results in providing a sensible indirect (warning systems) or direct (interactive neural-stimulation) control over refractory seizures, some of which achieved high performance. However, many works put heavily handcraft feature extraction and/or carefully tailored feature engineering to each patient to achieve very high sensitivity and low false prediction rate for a particular dataset. This limits the benefit of their approaches if a different dataset is used. In this paper we apply Convolutional Neural Networks (CNNs) on different intracranial and scalp electroencephalogram (EEG) datasets and proposed a generalized retrospective and patient-specific seizure prediction method. We use Short-Time Fourier Transform (STFT) on 30-second EEG windows with 50% overlapping to extract information in both frequency and time domains. A standardization step is then applied on STFT components across the whole frequency range to prevent high frequencies features being influenced by those at lower frequencies. A convolutional neural network model is used for both feature extraction and classification to separate preictal segments from interictal ones. The proposed approach achieves sensitivity of 89.8% and false prediction rate (FPR) of 0.17/h on Freiburg Hospital intracranial EEG (iEEG) dataset, and sensitivity of 89.1% and FPR of 0.09/h on Children's Hospital of Boston-MIT scalp EEG (sEEG) dataset.
Jun 02 2017 cs.HC
Three-dimensional (3D) applications have come to every corner of life. We present 3DTouch, a novel 3D wearable input device worn on the fingertip for interacting with 3D applications. 3DTouch is self-contained, and designed to universally work on various 3D platforms. The device employs touch input for the benefits of passive haptic feedback, and movement stability. Moreover, with touch interaction, 3DTouch is conceptually less fatiguing to use over many hours than 3D spatial input devices such as Kinect. Our approach relies on relative positioning technique using an optical laser sensor and a 9-DOF inertial measurement unit. We implemented a set of 3D interaction techniques including selection, translation, and rotation using 3DTouch. An evaluation also demonstrates the device's tracking accuracy of 1.10 mm and 2.33 degrees for subtle touch interaction in 3D space. With 3DTouch project, we would like to provide an input device that reduces the gap between 3D applications and users.
Having accurate, detailed, and up-to-date information about the location and behavior of animals in the wild would revolutionize our ability to study and conserve ecosystems. We investigate the ability to automatically, accurately, and inexpensively collect such data, which could transform many fields of biology, ecology, and zoology into "big data" sciences. Motion sensor "camera traps" enable collecting wildlife pictures inexpensively, unobtrusively, and frequently. However, extracting information from these pictures remains an expensive, time-consuming, manual task. We demonstrate that such information can be automatically extracted by deep learning, a cutting-edge type of artificial intelligence. We train deep convolutional neural networks to identify, count, and describe the behaviors of 48 species in the 3.2-million-image Snapshot Serengeti dataset. Our deep neural networks automatically identify animals with over 93.8% accuracy, and we expect that number to improve rapidly in years to come. More importantly, if our system classifies only images it is confident about, our system can automate animal identification for 99.3% of the data while still performing at the same 96.6% accuracy as that of crowdsourced teams of human volunteers, saving more than 8.4 years (at 40 hours per week) of human labeling effort (i.e. over 17,000 hours) on this 3.2-million-image dataset. Those efficiency gains immediately highlight the importance of using deep neural networks to automate data extraction from camera-trap images. Our results suggest that this technology could enable the inexpensive, unobtrusive, high-volume, and even real-time collection of a wealth of information about vast numbers of animals in the wild.
Dec 30 2016 cs.SI
This paper studies marijuana-related tweets in social network Twitter. We collected more than 300,000 marijuana related tweets during November 2016 in our study. Our text-mining based algorithms and data analysis unveil some interesting patterns including: (i) users' attitudes (e.g., positive or negative) can be characterized by the existence of outer links in a tweet; (ii) 67% users use their mobile phones to post their messages while many users publish their messages using third-party automatic posting services; and (3) the number of tweets during weekends is much higher than during weekdays. Our data also showed the impact of the political events such as the U.S. presidential election or state marijuana legalization votes on the marijuana-related tweeting frequencies.
Dec 30 2016 cs.NI
Reliable broadcasting data to multiple receivers over lossy wireless channels is challenging due to the heterogeneity of the wireless link conditions. Automatic Repeat-reQuest (ARQ) based retransmission schemes are bandwidth inefficient due to data duplication at receivers. Network coding (NC) has been shown to be a promising technique for improving network bandwidth efficiency by combining multiple lost data packets for retransmission. However, it is challenging to accurately determine which lost packets should be combined together due to disrupted feedback channels. This paper proposes an adaptive data encoding scheme at the transmitter by joining network coding and machine learning (NCML) for retransmission of lost packets. Our proposed NCML extracts the important features from historical feedback signals received by the transmitter to train a classifier. The constructed classifier is then used to predict states of transmitted data packets at different receivers based on their corrupted feedback signals for effective data mixing. We have conducted extensive simulations to collaborate the efficiency of our proposed approach. The simulation results show that our machine learning algorithm can be trained efficiently and accurately. The simulation results show that on average the proposed NCML can correctly classify 90% of the states of transmitted data packets at different receivers. It achieves significant bandwidth gain compared with the ARQ and NC based schemes in different transmission terrains, power levels, and the distances between the transmitter and receivers.
Recent studies have shown that information mined from Craigslist can be used for informing public health policy or monitoring risk behavior. This paper presents a text-mining method for conducting public health surveillance of marijuana use concerns in the U.S. using online classified ads in Craigslist. We collected more than 200 thousands of rental ads in the housing categories in Craigslist and devised text-mining methods for efficiently and accurately extract rental ads associated with concerns about the uses of marijuana in different states across the U.S. We linked the extracted ads to their geographic locations and computed summary statistics of the ads having marijuana use concerns. Our data is then compared with the State Marijuana Laws Map published by the U.S. government and marijuana related keywords search in Google to verify our collected data with respect to the demographics of marijuana use concerns. Our data not only indicates strong correlations between Craigslist ads, Google search and the State Marijuana Laws Map in states where marijuana uses are legal, but also reveals some hidden world of marijuana use concerns in other states where marijuana use is illegal. Our approach can be utilized as a marijuana surveillance tool for policy makers to develop public health policy and regulations.
Dec 02 2016 cs.CV
Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.
Nov 23 2016 cs.RO
While autonomous multirotor micro aerial vehicles (MAVs) are uniquely well suited for certain types of missions benefiting from stationary flight capabilities, their more widespread usage still faces many hurdles, due in particular to their limited range and the difficulty of fully automating their deployment and retrieval. In this paper we address these issues by solving the problem of the automated landing of a quadcopter on a ground vehicle moving at relatively high speed. We present our system architecture, including the structure of our Kalman filter for the estimation of the relative position and velocity between the quadcopter and the landing pad, as well as our controller design for the full rendezvous and landing maneuvers. The system is experimentally validated by successfully landing in multiple trials a commercial quadcopter on the roof of a car moving at speeds of up to 50 km/h.
Nov 22 2016 cs.IR
A recent "third wave" of Neural Network (NN) approaches now delivers state-of-the-art performance in many machine learning tasks, spanning speech recognition, computer vision, and natural language processing. Because these modern NNs often comprise multiple interconnected layers, this new NN research is often referred to as deep learning. Stemming from this tide of NN work, a number of researchers have recently begun to investigate NN approaches to Information Retrieval (IR). While deep NNs have yet to achieve the same level of success in IR as seen in other areas, the recent surge of interest and work in NNs for IR suggest that this state of affairs may be quickly changing. In this work, we survey the current landscape of Neural IR research, paying special attention to the use of learned representations of queries and documents (i.e., neural embeddings). We highlight the successes of neural IR thus far, catalog obstacles to its wider adoption, and suggest potentially promising directions for future research.
Jul 12 2016 cs.CL
This study investigates the use of unsupervised word embeddings and sequence features for sample representation in an active learning framework built to extract clinical concepts from clinical free text. The objective is to further reduce the manual annotation effort while achieving higher effectiveness compared to a set of baseline features. Unsupervised features are derived from skip-gram word embeddings and a sequence representation approach. The comparative performance of unsupervised features and baseline hand-crafted features in an active learning framework are investigated using a wide range of selection criteria including least confidence, information diversity, information density and diversity, and domain knowledge informativeness. Two clinical datasets are used for evaluation: the i2b2/VA 2010 NLP challenge and the ShARe/CLEF 2013 eHealth Evaluation Lab. Our results demonstrate significant improvements in terms of effectiveness as well as annotation effort savings across both datasets. Using unsupervised features along with baseline features for sample representation lead to further savings of up to 9% and 10% of the token and concept annotation rates, respectively.
Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right - similar to why we study the human brain - and will enable researchers to further improve DNNs. One path to understanding how a neural network functions internally is to study what each of its neurons has learned to detect. One such method is called activation maximization (AM), which synthesizes an input (e.g. an image) that highly activates a neuron. Here we dramatically improve the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN). The algorithm (1) generates qualitatively state-of-the-art synthetic images that look almost real, (2) reveals the features learned by each neuron in an interpretable way, (3) generalizes well to new datasets and somewhat well to different network architectures without requiring the prior to be relearned, and (4) can be considered as a high-quality generative method (in this case, by generating novel, creative, interesting, recognizable images).
This paper presents a strategy to guide a mobile ground robot equipped with a camera or depth sensor, in order to autonomously map the visible part of a bounded three-dimensional structure. We describe motion planning algorithms that determine appropriate successive viewpoints and attempt to fill holes automatically in a point cloud produced by the sensing and perception layer. The emphasis is on accurately reconstructing a 3D model of a structure of moderate size rather than mapping large open environments, with applications for example in architecture, construction and inspection. The proposed algorithms do not require any initialization in the form of a mesh model or a bounding box, and the paths generated are well adapted to situations where the vision sensor is used simultaneously for mapping and for localizing the robot, in the absence of additional absolute positioning system. We analyze the coverage properties of our policy, and compare its performance to the classic frontier based exploration algorithm. We illustrate its efficacy for different structure sizes, levels of localization accuracy and range of the depth sensor, and validate our design on a real-world experiment.
We can better understand deep neural networks by identifying which features each of their neurons have learned to detect. To do so, researchers have created Deep Visualization techniques including activation maximization, which synthetically generates inputs (e.g. images) that maximally activate each neuron. A limitation of current techniques is that they assume each neuron detects only one type of feature, but we know that neurons can be multifaceted, in that they fire in response to many different types of features: for example, a grocery store class neuron must activate either for rows of produce or for a storefront. Previous activation maximization techniques constructed images without regard for the multiple different facets of a neuron, creating inappropriate mixes of colors, parts of objects, scales, orientations, etc. Here, we introduce an algorithm that explicitly uncovers the multiple facets of each neuron by producing a synthetic visualization of each of the types of images that activate a neuron. We also introduce regularization methods that produce state-of-the-art results in terms of the interpretability of images obtained by activation maximization. By separately synthesizing each type of image a neuron fires in response to, the visualizations have more appropriate colors and coherent global structure. Multifaceted feature visualization thus provides a clearer and more comprehensive description of the role of each neuron.
Recent years have produced great advances in training large, deep neural networks (DNNs), including notable successes in training convolutional neural networks (convnets) to recognize natural images. However, our understanding of how these models work, especially what computations they perform at intermediate layers, has lagged behind. Progress in the field will be further accelerated by the development of better tools for visualizing and interpreting neural nets. We introduce two such tools here. The first is a tool that visualizes the activations produced on each layer of a trained convnet as it processes an image or video (e.g. a live webcam stream). We have found that looking at live activations that change in response to user input helps build valuable intuitions about how convnets work. The second tool enables visualizing features at each layer of a DNN via regularized optimization in image space. Because previous versions of this idea produced less recognizable images, here we introduce several new regularization methods that combine to produce qualitatively clearer, more interpretable visualizations. Both tools are open source and work on a pre-trained convnet with minimal setup.
Providing feedback, both assessing final work and giving hints to stuck students, is difficult for open-ended assignments in massive online classes which can range from thousands to millions of students. We introduce a neural network method to encode programs as a linear mapping from an embedded precondition space to an embedded postcondition space and propose an algorithm for feedback at scale using these linear maps as features. We apply our algorithm to assessments from the Code.org Hour of Code and Stanford University's CS1 course, where we propagate human comments on student assignments to orders of magnitude more submissions.
Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects, which we call "fooling images" (more generally, fooling examples). Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.
Dec 01 2014 cs.NI
A reliable and scalable mechanism to provide protection against a link or node failure has additional requirements in the context of SDN and OpenFlow. Not only it has to minimize the load on the controller, but it must be able to react even when the controller is unreachable. In this paper we present a protection scheme based on precomputed backup paths and inspired by MPLS crankback routing, that guarantees instantaneous recovery times and aims at zero packet-loss after failure detection, regardless of controller reachability, even when OpenFlow's "fast-failover" feature cannot be used. The proposed mechanism is based on OpenState, an OpenFlow extension that allows a programmer to specify how forwarding rules should autonomously adapt in a stateful fashion, reducing the need to rely on remote controllers. We present the scheme as well as two different formulations for the computation of backup paths.
We present 3DTouch, a novel 3D wearable input device worn on the fingertip for 3D manipulation tasks. 3DTouch is designed to fill the missing gap of a 3D input device that is self-contained, mobile, and universally working across various 3D platforms. This paper presents a low-cost solution to designing and implementing such a device. Our approach relies on relative positioning technique using an optical laser sensor and a 9-DOF inertial measurement unit. 3DTouch is self-contained, and designed to universally work on various 3D platforms. The device employs touch input for the benefits of passive haptic feedback, and movement stability. On the other hand, with touch interaction, 3DTouch is conceptually less fatiguing to use over many hours than 3D spatial input devices. We propose a set of 3D interaction techniques including selection, translation, and rotation using 3DTouch. An evaluation also demonstrates the device's tracking accuracy of 1.10 mm and 2.33 degrees for subtle touch interaction in 3D space. Modular solutions like 3DTouch opens up a whole new design space for interaction techniques to further develop on.
With the evolution of mobile devices, and smart-phones in particular, comes the ability to create new experiences that enhance the way we see, interact, and manipulate objects, within the world that surrounds us. It is now possible to blend data from our senses and our devices in numerous ways that simply were not possible before using Augmented Reality technology. In a near future, when all of the office devices as well as your personal electronic gadgets are on a common wireless network, operating them using a universal remote controller would be possible. This paper presents an off-the-shelf, low-cost prototype that leverages the Augmented Reality technology to deliver a novel and interactive way of operating office network devices around using a mobile device. We believe this type of system may provide benefits to controlling multiple integrated devices and visualizing interconnectivity or utilizing visual elements to pass information from one device to another, or may be especially beneficial to control devices when interacting with them physically may be difficult or pose danger or harm.
Feb 05 2013 cs.CY
Online retailing (a model of B2C e-commerce) is growing world-wide, with companies in many countries showing increased sales and productivity as a result. It has great potential within the global economy. This paper looks at the current status of online retailing in Saudi Arabia, with particular focus on what inhibits or enables both the customers and retailers. It also analyses the status of Government involvement and proposes a layered model, known as the Wheel of Online Retailing which illustrates how Government intervention can benefit the e-commerce in Saudi Arabia.
This paper looks at the present standing of ecommerce in Saudi Arabia, as well as the challenges and strengths of Business to Customers (B2C) electronic commerce. Many studies have been conducted around the world in order to gain a better understanding of the demands, needs and effectiveness of online commerce. A study was undertaken to review the literature identifying the factors influencing the adoption and diffusion of B2C e-commerce. It found four distinct categories: businesses, customers, environmental and governmental support, which must all be considered when creating an e-commerce infrastructure. A concept matrix was used to provide a comparison of important factors in different parts of the world. The study found that e-commerce in Saudi Arabia was lacking in Governmental support as well as relevant involvement by both customers and retailers.
Jan 05 2013 cs.CY
The IS-Impact Measurement Model, developed by Gable, Sedera and Chan in 2008, represents the to-date and expected stream of net profits from a given information system (IS), as perceived by all major user classes. Although this model has been stringently validated in previous studies, its generalizability and verified effectiveness are enhanced through this new application in e-learning. This paper focuses on the re-validation of the findings of the IS-Impact Model in two universities in the Kingdom of Saudi Arabia (KSA). Among the users of 2 universities e-learning systems, 528 students were recruited. A formative validation measurement with SmartPLS, a graphical structural equation modeling tool was used to analyse the collected data. On the basis of the SmartPLS results, as well as with the aid of data-supported IS impact measurements and dimensions, we confirmed the validity of the IS-Impact Model for assessing the effect of e-learning systems in KSA universities. The newly constructed model is more understandable, its use was proved as robust and applicable to various circumstances.
Nov 13 2012 cs.CY
This paper presents the preliminary findings of a study researching the diffusion and the adoption of online retailing in Saudi Arabia. It reports new research that identifies and explores the key issues that positively and negatively influence the decision of Saudi customers to buy from online retailers in Saudi Arabia. Although Saudi Arabia has the largest and fastest growth of ICT marketplaces in the Arab region, e-commerce activities are not progressing at the same speed. While the overall research project involves exploratory research using mixed methods, the focus of this paper is on a quantitative analysis of responses obtained from a survey of Saudi customers, with the design of the questionnaire instrument being based on the findings of a qualitative analysis reported in a previous paper. The main findings of the current analysis include a list of key factors that affect Saudi customers' purchase from Saudi online retailers, and quantitative indications of the relative strengths of the various relationships.
Nov 13 2012 cs.CY
This paper presents some findings from a study researching the diffusion and adoption of online retailing in Saudi Arabia. Although the country has the largest and fastest growing ICT marketplace in the Arab region, e-commerce activities have not progressed at a similar speed. In general, Saudi retailers have not responded actively to the global growth of online retailing. Accordingly new research has been conducted to identify and explore key issues that positively and negatively influence Saudi retailers in deciding whether to adopt the online channel. While the overall research project uses mixed methods, the focus of this paper is on a quantitative analysis of responses obtained from a survey of retailers in Saudi Arabia, with the design of the questionnaire instrument being based on the findings of a qualitative analysis reported in a previous paper. The main findings of the current analysis include a list of key factors that affect retailers decision to adopt e-commerce, and quantitative indications of the relative strengths of the various relationships.
Aug 03 2012 cs.DS
We present a practical algorithm for the cyclic longest common subsequence (CLCS) problem that runs in O(mn) time, where m and n are the lengths of the two input strings. While this is not necessarily an asymptotic improvement over the existing record, it is far simpler to understand and to implement.