results for au:Chen_Y in:cs

- We present MoodSwipe, a soft keyboard that suggests text messages given the user-specified emotions utilizing the real dialog data. The aim of MoodSwipe is to create a convenient user interface to enjoy the technology of emotion classification and text suggestion, and at the same time to collect labeled data automatically for developing more advanced technologies. While users select the MoodSwipe keyboard, they can type as usual but sense the emotion conveyed by their text and receive suggestions for their message as a benefit. In MoodSwipe, the detected emotions serve as the medium for suggested texts, where viewing the latter is the incentive to correcting the former. We conduct several experiments to show the superiority of the emotion classification models trained on the dialog data, and further to verify good emotion cues are important context for text suggestion.
- Jul 25 2017 cs.CV arXiv:1707.07584v1Foreground segmentation in video sequences is a classic topic in computer vision. Due to the lack of semantic and prior knowledge, it is difficult for existing methods to deal with sophisticated scenes well. Therefore, in this paper, we propose an end-to-end two-stage deep convolutional neural network (CNN) framework for foreground segmentation in video sequences. In the first stage, a convolutional encoder-decoder sub-network is employed to reconstruct the background images and encode rich prior knowledge of background scenes. In the second stage, the reconstructed background and current frame are input into a multi-channel fully-convolutional sub-network (MCFCN) for accurate foreground segmentation. In the two-stage CNN, the reconstruction loss and segmentation loss are jointly optimized. The background images and foreground objects are output simultaneously in an end-to-end way. Moreover, by incorporating the prior semantic knowledge of foreground and background in the pre-training process, our method could restrain the background noise and keep the integrity of foreground objects at the same time. Experiments on CDNet 2014 show that our method outperforms the state-of-the-art by 4.9%.
- We propose a minority route choice game to investigate the effect of the network structure on traffic network performance under the assumption of drivers' bounded rationality. We investigate ring-and-hub topologies to capture the nature of traffic networks in cities, and employ a minority game-based inductive learning process to model the characteristic behavior under the route choice scenario. Through numerical experiments, we find that topological changes in traffic networks induce a phase transition from an uncongested phase to a congested phase. Understanding this phase transition is helpful in planning new traffic networks.
- In this paper, the one-sided secrecy of two-way wiretap channel with feedback is investigated, where the confidential messages of one user through multiple transmissions is guaranteed secure against an external eavesdropper. For one thing, one-sided secrecy satisfies the secure demand of many practical scenarios. For another, the secrecy is measured over many blocks since the correlation between eavesdropper's observation and the confidential messages in successive blocks, instead of secrecy measurement of one block in previous works. Thus, firstly, an achievable secrecy rate region is derived for the general two-way wiretap channel with feedback through multiple transmissions under one-sided secrecy. Secondly, outer bounds on the secrecy capacity region are also obtained. The gap between inner and outer bounds on the secrecy capacity region is explored via the binary input two-way wiretap channels. Most notably, the secrecy capacity regions are established for the XOR channel. Furthermore, the result shows that the achievable rate region with feedback is larger than that without feedback. Therefore, the benefit role of feedback is precisely characterized for two-way wiretap channel with feedback under one-sided secrecy.
- Jul 19 2017 cs.CV arXiv:1707.05495v1In this paper, we propose the joint learning attention and recurrent neural network (RNN) models for multi-label classification. While approaches based on the use of either model exist (e.g., for the task of image captioning), training such existing network architectures typically require pre-defined label sequences. For multi-label classification, it would be desirable to have a robust inference process, so that the prediction error would not propagate and thus affect the performance. Our proposed model uniquely integrates attention and Long Short Term Memory (LSTM) models, which not only addresses the above problem but also allows one to identify visual objects of interests with varying sizes without the prior knowledge of particular label ordering. More importantly, label co-occurrence information can be jointly exploited by our LSTM model. Finally, by advancing the technique of beam search, prediction of multiple labels can be efficiently achieved by our proposed network model.
- Security-critical tasks require proper isolation from untrusted software. Chip manufacturers design and include trusted execution environments (TEEs) in their processors to secure these tasks. The integrity and security of the software in the trusted environment depend on the verification process of the system. We find a form of attack that can be performed on the current implementations of the widely deployed ARM TrustZone technology. The attack exploits the fact that the trustlet (TA) or TrustZone OS loading verification procedure may use the same verification key and may lack proper rollback prevention across versions. If an exploit works on an out-of-date version, but the vulnerability is patched on the latest version, an attacker can still use the same exploit to compromise the latest system by downgrading the software to an older and exploitable version. We did experiments on popular devices on the market including those from Google, Samsung and Huawei, and found that all of them have the risk of being attacked. Also, we show a real-world example to exploit Qualcomm's QSEE. In addition, in order to find out which device images share the same verification key, pattern matching schemes for different vendors are analyzed and summarized.
- Sensor-based activity recognition seeks the profound high-level knowledge about human activity from multitudes of low-level sensor readings. Conventional pattern recognition approaches have made tremendous progress in the past years. However, most of those approaches heavily rely on heuristic hand-crafted feature extraction methods, which dramatically hinder their generalization performance. Additionally, those methods often produce unsatisfactory results for unsupervised and incremental learning tasks. Meanwhile, the recent advancement of deep learning makes it possible to perform automatic high-level feature extraction thus achieves promising performance in many areas. Since then, deep learning based methods have been widely adopted for the sensor-based activity recognition tasks. In this paper, we survey and highlight the recent advancement of deep learning approaches for sensor-based activity recognition. Specifically, we summarize existing literatures from three aspects: sensor modality, deep model and application. We also present a detailed discussion and propose grand challenges for future direction.
- Jul 07 2017 cs.CV arXiv:1707.01629v1In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with more than 3 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications.
- Jul 07 2017 cs.GT arXiv:1707.01590v1Recent literature on computational notions of fairness has been broadly divided into two distinct camps, supporting interventions that address either individual-based or group-based fairness. Rather than privilege a single definition, we seek to resolve both within the particular domain of employment discrimination. To this end, we construct a dual labor market model composed of a Temporary Labor Market, in which firm strategies are constrained to ensure group-level fairness, and a Permanent Labor Market, in which individual worker fairness is guaranteed. We show that such restrictions on hiring practices induces an equilibrium that Pareto-dominates those arising from strategies that employ statistical discrimination or a "group-blind" criterion. Individual worker reputations produce externalities for collective reputation, generating a feedback loop termed a "self-fulfilling prophecy." Our model produces its own feedback loop, raising the collective reputation of an initially disadvantaged group via a fairness intervention that need not be permanent. Moreover, we show that, contrary to popular assumption, the asymmetric equilibria resulting from hiring practices that disregard group-fairness may be immovable without targeted intervention. The enduring nature of such equilibria that are both inequitable and Pareto inefficient suggest that fairness interventions are of critical importance in moving the labor market to be more socially just and efficient.
- Jul 07 2017 cs.CV arXiv:1707.01691v1We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining. To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs. To deal with (b), we propose the objectness prior to significantly reduce the searching space of objects. We optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the challenging PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO benchmarks demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384X384 input size, the network gets 81.3% mAP on PASCAL VOC 2007, 80.7% mAP on PASCAL VOC 2012 datasets. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. With 1.5G GPU memory at test phase, the speed of the network is 15 FPS, 3X faster than the Faster R-CNN counterpart.
- We have witnessed rapid evolution of deep neural network architecture design in the past years. These latest progresses greatly facilitate the developments in various areas such as computer vision, natural language processing, etc. However, along with the extraordinary performance, these state-of-the-art models also bring in expensive computational cost. Directly deploying these models into applications with real-time requirement is still infeasible. Recently, Hinton etal. have shown that the dark knowledge within a powerful teacher model can significantly help the training of a smaller and faster student network. These knowledge are vastly beneficial to improve the generalization ability of the student model. Inspired by their work, we introduce a new type of knowledge -- cross sample similarities for model compression and acceleration. This knowledge can be naturally derived from deep metric learning model. To transfer them, we bring the learning to rank technique into deep metric learning formulation. We test our proposed DarkRank on the pedestrian re-identification task. The results are quite encouraging. Our DarkRank can improve over the baseline method by a large margin. Moreover, it is fully compatible with other existing methods. When combined, the performance can be further boosted.
- Jul 04 2017 cs.GT arXiv:1707.00208v1Selfish routing is a central problem in algorithmic game theory, with one of the principal applications being that of routing in road networks. Inspired by the emergence of routing technologies and autonomous driving, we revisit selfish routing and consider three possible outcomes of it: (i) $\theta$-Positive Nash Equilibrium flow, where every path that has non-zero flow on all of its edges has cost no greater than $\theta$ times the cost of any other path, (ii) $\theta$-Used Nash Equilibrium flow, where every used path that appears in the path flow decomposition has cost no greater than $\theta$ times the cost of any other path, and (iii) $\theta$-Envy Free flow, where every path that appears in the path flow decomposition has cost no greater than $\theta$ times the cost of any other path in the path flow decomposition. We first examine the relations of these outcomes among each other and then measure their possible impact on the network's performance. Afterwards, we examine the computational complexity of finding such flows of minimum social cost and give a range for $\theta$ for which this task is easy and a range for $\theta$ for which this task is NP-hard. Finally, we propose deterministic strategies which, in a worst case approach, can be used by a central planner in order to provide good such flows, and further introduce a natural idea for randomly routing players after giving them specific guarantees about their costs in the randomized routing, as a tool for the central planner to implement a desired flow.
- Jul 04 2017 cs.CV arXiv:1707.00383v1In this paper, we propose an alternative method to estimate room layouts of cluttered indoor scenes. This method enjoys the benefits of two novel techniques. The first one is semantic transfer (ST), which is: (1) a formulation to integrate the relationship between scene clutter and room layout into convolutional neural networks; (2) an architecture that can be end-to-end trained; (3) a practical strategy to initialize weights for very deep networks under unbalanced training data distribution. ST allows us to extract highly robust features under various circumstances, and in order to address the computation redundance hidden in these features we develop a principled and efficient inference scheme named physics inspired optimization (PIO). PIO's basic idea is to formulate some phenomena observed in ST features into mechanics concepts. Evaluations on public datasets LSUN and Hedau show that the proposed method is more accurate than state-of-the-art methods.
- Jun 30 2017 cs.CL arXiv:1706.09742v1We present the data profile and the evaluation plan of the second oriental language recognition (OLR) challenge AP17-OLR. Compared to the event last year (AP16-OLR), the new challenge involves more languages and focuses more on short utterances. The data is offered by SpeechOcean and the NSFC M2ASR project. Two types of baselines are constructed to assist the participants, one is based on the i-vector model and the other is based on various neural networks. We report the baseline results evaluated with various metrics defined by the AP17-OLR evaluation plan and demonstrate that the combined database is a reasonable data resource for multilingual research. All the data is free for participants, and the Kaldi recipes for the baselines have been published online.
- Objective: Amyotrophic lateral sclerosis (ALS) is a rare disease, but is also one of the most common motor neuron diseases, and people of all races and ethnic backgrounds are affected. There is currently no cure. Brain computer interfaces (BCIs) can establish a communication channel directly between the brain and an external device by recognizing brain activities that reflect user intent. Therefore, this technology could help ALS patients in promoting functional independence through BCI-based speller systems and motor assistive devices. Methods: In this paper, two kinds of ERP-based speller systems were tested on 18 ALS patients to: (1) assess performance when they spelled 42 characters online continuously, without a break; and (2) to compare performance between a matrix-based speller paradigm (MS-P, mean visual angle 6 degree) and a new speller paradigm that used a larger visual angle called the large visual angle speller paradigm (LS-P, mean visual angle 8 degree). Results: Although results showed that there were no significant differences between the two paradigms in accuracy trend over continuous use (p>0.05), the fatigue during the LS-P condition was significantly lower than that of MS-P (p<0.05). Results also showed that continuous use slightly reduced the performance of this ERP-based BCI. Conclusion: 15 subjects obtained higher than 80% feedback accuracy (online output accuracy) and 9 subjects obtained higher than 90% feedback accuracy in one of the two paradigms, thus validating the BCI approaches in this study. Significance: Most ALS subjects in this study could spell effectively after continuous use of an ERP-based BCI. The new LS-P display may be easier for subjects to use, resulting in lower fatigue.
- We present an efficient algorithm for recent generalizations of optimal mass transport theory to matrix-valued and vector-valued densities. These generalizations lead to several applications including diffusion tensor imaging, color images processing, and multi-modality imaging. The algorithm is based on sequential quadratic programming (SQP). By approximating the Hessian of the cost and solving each iteration in an inexact manner, we are able to solve each iteration with relatively low cost while still maintaining a fast convergent rate. The core of the algorithm is solving a weighted Poisson equation, where different efficient preconditioners may be employed. We utilize incomplete Cholesky factorization, which yields an efficient and straightforward solver for our problem. Several illustrative examples are presented for both the matrix and vector-valued cases.
- This paper proposes a speaker recognition (SRE) task with trivial speech events, such as cough and laugh. These trivial events are ubiquitous in conversations and less subjected to intentional change, therefore offering valuable particularities to discover the genuine speaker from disguised speech. However, trivial events are often short and idiocratic in spectral patterns, making SRE extremely difficult. Fortunately, we found a very powerful deep feature learning structure that can extract highly speaker-sensitive features. By employing this tool, we studied the SRE performance on three types of trivial events: cough, laugh and "Wei" (a short Chinese "Hello"). The results show that there is rich speaker information within these trivial events, even for cough that is intuitively less speaker distinguishable. With the deep feature approach, the EER can reach 10%-14% with the three trivial events, despite their extremely short durations (0.2-1.0 seconds).
- Jun 22 2017 cs.CV arXiv:1706.06792v1Deep Convolutional Neural Networks (CNNs) are capable of learning unprecedentedly effective features from images. Some researchers have struggled to enhance the parameters' efficiency using grouped convolution. However, the relation between the optimal number of convolutional groups and the recognition performance remains an open problem. In this paper, we propose a series of Basic Units (BUs) and a two-level merging strategy to construct deep CNNs, referred to as a joint Grouped Merging Net (GM-Net), which can produce joint grouped and reused deep features while maintaining the feature discriminability for classification tasks. Our GM-Net architectures with the proposed BU_A (dense connection) and BU_B (straight mapping) lead to significant reduction in the number of network parameters and obtain performance improvement in image classification tasks. Extensive experiments are conducted to validate the superior performance of the GM-Net than the state-of-the-arts on the benchmark datasets, e.g., MNIST, CIFAR-10, CIFAR-100 and SVHN.
- Although the recent progress in the deep neural network has led to the development of learnable local feature descriptors, there is no explicit answer for estimation of the necessary size of a neural network. Specifically, the local feature is represented in a low dimensional space, so the neural network should have more compact structure. The small networks required for local feature descriptor learning may be sensitive to initial conditions and learning parameters and more likely to become trapped in local minima. In order to address the above problem, we introduce an adaptive pruning Siamese Architecture based on neuron activation to learn local feature descriptors, making the network more computationally efficient with an improved recognition rate over more complex networks. Our experiments demonstrate that our learned local feature descriptors outperform the state-of-art methods in patch matching.
- Jun 13 2017 cs.SY arXiv:1706.03612v1We propose a framework to engineer synthetic-inertia and droop-control parameters for distributed energy resources (DERs) so that the system frequency in a network composed of DERs and synchronous generators conforms to prescribed transient and steady-state performance specifications. Our approach is grounded in a second-order lumped-parameter model that captures the dynamics of synchronous generators and frequency-responsive DERs endowed with inertial and droop control. A key feature of this reduced-order model is that its parameters can be related to those of the originating higher-order dynamical model. This allows one to systematically design the DER inertial and droop-control coefficients leveraging classical frequency-domain response characteristics of second-order systems. Time-domain simulations validate the accuracy of the model-reduction method and demonstrate how DER controllers can be designed to meet steady-state-regulation and transient-performance specifications.
- Jun 13 2017 cs.NE arXiv:1706.03609v1We extended the work of proposed activation function, Noisy Softplus, to fit into training of layered up spiking neural networks (SNNs). Thus, any ANN employing Noisy Softplus neurons, even of deep architecture, can be trained simply by the traditional algorithm, for example Back Propagation (BP), and the trained weights can be directly used in the spiking version of the same network without any conversion. Furthermore, the training method can be generalised to other activation units, for instance Rectified Linear Units (ReLU), to train deep SNNs off-line. This research is crucial to provide an effective approach for SNN training, and to increase the classification accuracy of SNNs with biological characteristics and to close the gap between the performance of SNNs and ANNs.
- We study Markov chain models where the transition mechanism depends nonlinearly on the current state. One specific choice for such a model, where the state represents "belief," was proposed in \citejia2015opinion to model opinion dynamics and is referred to as the DeGroot-Friedkin model. Herein, we consider a general class of such nonlinear Markov chain models and develop a theory for assessing stability. Our approach relies on establishing that the differential of the nonlinear dynamics (under suitable analyticity conditions) is contractive in the $\ell_1$ metric. We apply the theory to two type of nonlinear random walks, i.e., nonlinearly adapting the transition probabilities, where the adaptation is exponential and linear, respectively. The latter includes the DeGroot-Friedkin model and generalizations. We also discuss continuous-time generalization as well as interacting (particle) models and discuss their relevance with regard to modeling social dynamics over influence networks. Finally, we view the nonlinear adaptation of the transition mechanism as feedback and quantify the effect of external bias on the stationary distribution.
- Jun 09 2017 cs.SY arXiv:1706.02695v1Stand-alone direct current (DC) microgrids may belong to different owners and adopt various control strategies. This brings great challenge to its optimal operation due to the difficulty of implementing a unified control. This paper addresses the distributed optimal control of DC microgrids, which intends to break the restriction of diversity to some extent. Firstly, we formulate the optimal power flow (OPF) problem of stand-alone DC microgrids as an exact second order cone program (SOCP) and prove the uniqueness of the optimal solution. Then a dynamic solving algorithm based on primal-dual decomposition method is proposed, the convergence of which is proved theoretically as well as the optimality of its equilibrium point. It should be stressed that the algorithm can provide control commands for the three types of microgrids: (i) power control, (ii) voltage control and (iii) droop control. This implies that each microgrid does not need to change its original control strategy in practice, which is less influenced by the diversity of microgrids. Moreover, the control commands for power controlled and voltage controlled microgrids satisfy generation limits and voltage limits in both transient process and steady state. Finally, a six-microgrid DC system based on the microgrid benchmark is adopted to validate the effectiveness and plug-n-play property of our designs.
- Jun 09 2017 cs.CV arXiv:1706.02425v1In this paper, we investigated a C-arm tomographic technique as a new three dimensional (3D) kidney imaging method for nephrolithiasis and kidney stone detection over view angle less than 180o. Our C-arm tomographic technique provides a series of two dimensional (2D) images with a single scan over 40o view angle. Experimental studies were performed with a kidney phantom that was formed from a pig kidney with two embedded kidney stones. Different reconstruction methods were developed for C-arm tomographic technique to generate 3D kidney information including: point by point back projection (BP), filtered back projection (FBP), simultaneous algebraic reconstruction technique (SART) and maximum likelihood expectation maximization (MLEM). Computer simulation study was also done with simulated 3D spherical object to evaluate the reconstruction results. Preliminary results demonstrated the capability of our C-arm tomographic technique to generate 3D kidney information for kidney stone detection with low exposure of radiation. The kidney stones are visible on reconstructed planes with identifiable shapes and sizes.
- Jun 08 2017 cs.SD arXiv:1706.02101v1For practical automatic speaker verification (ASV) systems, replay attack poses a true risk. By replaying a pre-recorded speech signal of the genuine speaker, ASV systems tend to be easily fooled. An effective replay detection method is therefore highly desirable. In this study, we investigate a major difficulty in replay detection: the over-fitting problem caused by variability factors in speech signal. An F-ratio probing tool is proposed and three variability factors are investigated using this tool: speaker identity, speech content and playback & recording device. The analysis shows that device is the most influential factor that contributes the highest over-fitting risk. A frequency warping approach is studied to alleviate the over-fitting problem, as verified on the ASV-spoof 2017 database.
- Convolutional neural networks (CNNs) with deep architectures have substantially advanced the state-of-the-art in computer vision tasks. However, deep networks are typically resource-intensive and thus difficult to be deployed on mobile devices. Recently, CNNs with binary weights have shown compelling efficiency to the community, whereas the accuracy of such models is usually unsatisfactory in practice. In this paper, we introduce network sketching as a novel technique of pursuing binary-weight CNNs, targeting at more faithful inference and better trade-off for practical applications. Our basic idea is to exploit binary structure directly in pre-trained filter banks and produce binary-weight models via tensor expansion. The whole process can be treated as a coarse-to-fine model approximation, akin to the pencil drawing steps of outlining and shading. To further speedup the generated models, namely the sketches, we also propose an associative implementation of binary tensor convolutions. Experimental results demonstrate that a proper sketch of AlexNet (or ResNet) outperforms the existing binary-weight models by large margins on the ImageNet large scale classification task, while the committed memory for network parameters only exceeds a little.
- Speech signals are complex intermingling of various informative factors, and this information blending makes decoding any of the individual factors extremely difficult. A natural idea is to factorize each speech frame into independent factors, though it turns out to be even more difficult than decoding each individual factor. A major encumbrance is that the speaker trait, a major factor in speech signals, has been suspected to be a long-term distributional pattern and so not identifiable at the frame level. In this paper, we demonstrated that the speaker factor is also a short-time spectral pattern and can be largely identified with just a few frames using a simple deep neural network (DNN). This discovery motivated a cascade deep factorization (CDF) framework that infers speech factors in a sequential way, and factors previously inferred are used as conditional variables when inferring other factors. Our experiment on an automatic emotion recognition (AER) task demonstrated that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered with high accuracy. This factorization and reconstruction approach provides a novel tool for many speech processing tasks.
- Jun 06 2017 cs.DC arXiv:1706.01022v1Traditionally, a regional dispatch center uses the equivalent method to deal with external grids, which fails to reflect the interactions among regions. This paper proposes a distributed N-1 contingency analysis (DCA) solution, where dispatch centers join a coordinated computation using their private data and computing resources. A distributed screening method is presented to determine the Critical Contingency Set (DCCS) in DCA. Then, the distributed power flow is formulated as a set of boundary equations, which is solved by a Jacobi-Free Newton-GMRES (JFNG) method. During solving the distributed power flow, only boundary conditions are exchanged. Acceleration techniques are also introduced, including reusing preconditioners and optimal resource scheduling during parallel processing of multiple contingencies. The proposed method is implemented on a real EMS platform, where tests using the Southwest Regional Grid of China are carried out to validate its feasibility.
- Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test. Indeed, Wilks' theorem asserts that whenever we have a fixed number $p$ of variables, twice the log-likelihood ratio (LLR) $2\Lambda$ is distributed as a $\chi^2_k$ variable in the limit of large sample sizes $n$; here, $k$ is the number of variables being tested. In this paper, we prove that when $p$ is not negligible compared to $n$, Wilks' theorem does not hold and that the chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis). Assume that $n$ and $p$ grow large in such a way that $p/n\rightarrow\kappa$ for some constant $\kappa < 1/2$. We prove that for a class of logistic models, the LLR converges to a rescaled chi-square, namely, $2\Lambda~\stackrel{\mathrm{d}}{\rightarrow}~\alpha(\kappa)\chi_k^2$, where the scaling factor $\alpha(\kappa)$ is greater than one as soon as the dimensionality ratio $\kappa$ is positive. Hence, the LLR is larger than classically assumed. For instance, when $\kappa=0.3$, $\alpha(\kappa)\approx1.5$. In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, non-asymptotic random matrix theory and convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes. Finally, all the results from this paper extend to some other regression models such as the probit regression model.
- Jun 05 2017 cs.CY arXiv:1706.00487v1Objectives: The fee-for-service approach to healthcare leads to the management of a patient's conditions in an independent manner, inducing various negative consequences. It is recognized that a bundled care approach to healthcare-one that manages a collection of health conditions together-may enable greater efficacy and cost savings. However, it is not always evident which sets of conditions should be managed in a bundled program. Study Design: Retrospective inference of clusters of health conditions from an electronic medical record (EMR) system. A survey of healthcare experts to ascertain the plausibility of the clusters for bundled care programs. Methods: We designed a data-driven framework to infer clusters of health conditions via their shared clinical workflows according to EMR utilization by healthcare employees. We evaluated the framework with approximately 16,500 inpatient stays from a large medical center. The plausibility of the clusters for bundled care was assessed through a survey of a panel of healthcare experts using an analysis of variance (ANOVA) under a 95% confidence interval. Results: The framework inferred four condition clusters: 1) fetal abnormalities, 2) late pregnancies, 3) prostate problems, and 4) chronic diseases (with congestive heart failure featuring prominently). Each cluster was deemed plausible by the experts for bundled care. Conclusions: The findings suggest that data from EMRs may provide a basis for discovering new directions in bundled care. Still, translating such findings into actual care management will require further refinement, implementation, and evaluation.
- Some recent works revealed that deep neural networks (DNNs) are vulnerable to so-called adversarial attacks where input examples are intentionally perturbed to fool DNNs. In this work, we revisit the DNN training process that includes adversarial examples into the training dataset so as to improve DNN's resilience to adversarial attacks, namely, adversarial training. Our experiments show that different adversarial strengths, i.e., perturbation levels of adversarial examples, have different working zones to resist the attack. Based on the observation, we propose a multi-strength adversarial training method (MAT) that combines the adversarial training examples with different adversarial strengths to defend adversarial attacks. Two training structures - mixed MAT and parallel MAT - are developed to facilitate the tradeoffs between training time and memory occupation. Our results show that MAT can substantially minimize the accuracy degradation of deep learning systems to adversarial attacks on MNIST, CIFAR-10, CIFAR-100, and SVHN.
- May 30 2017 cs.CY arXiv:1705.09713v1OBJECTIVE: To test the hypothesis that variation in care coordination is related to LOS. DESIGN We applied a spectral co-clustering methodology to simultaneously infer groups of patients and care coordination patterns, in the form of interaction networks of health care professionals, from electronic medical record (EMR) utilization data. The care coordination pattern for each patient group was represented by standard social network characteristics and its relationship with hospital LOS was assessed via a negative binomial regression with a 95% confidence interval. SETTING AND PATIENTS This study focuses on 5,588 adult patients hospitalized for trauma at the Vanderbilt University Medical Center. The EMRs were accessed by healthcare professionals from 179 operational areas during 158,467 operational actions. MAIN OUTCOME MEASURES: Hospital LOS for trauma inpatients, as an indicator of care coordination efficiency. RESULTS: Three general types of care coordination patterns were discovered, each of which was affiliated with a specific patient group. The first patient group exhibited the shortest hospital LOS and was managed by a care coordination pattern that involved the smallest number of operational areas (102 areas, as opposed to 125 and 138 for the other patient groups), but exhibited the largest number of collaborations between operational areas (e.g., an average of 27.1 connections per operational area compared to 22.5 and 23.3 for the other two groups). The hospital LOS for the second and third patient groups was 14 hours (P = 0.024) and 10 hours (P = 0.042) longer than the first patient group, respectively.
- The capacity of a fractal wireless network with direct social interactions is studied in this paper. Specifically, we mathematically formulate the self-similarity of a fractal wireless network by a power-law degree distribution $ P(k) $, and we capture the connection feature between two nodes with degree $ k_{1} $ and $ k_{2} $ by a joint probability distribution $ P(k_{1},k_{2}) $. It is proved that if the source node communicates with one of its direct contacts randomly, the maximum capacity is consistent with the classical result $ \Theta\left(\frac{1}{\sqrt{n\log n}}\right) $ achieved by Kumar \citeGupta2000The. On the other hand, if the two nodes with distance $ d $ communicate according to the probability $ d^{-\beta} $, the maximum capacity can reach up to $ \Theta\left(\frac{1}{\log n}\right) $, which exhibits remarkable improvement compared with the well-known result in \citeGupta2000The.
- May 30 2017 cs.CV arXiv:1705.09882v1This work targets person re-identification (ReID) from depth sensors such as Kinect. Since depth is invariant to illumination and less sensitive than color to day-by-day appearance changes, a natural question is whether depth is an effective modality for Person ReID, especially in scenarios where individuals wear different colored clothes or over a period of several months. We explore the use of recurrent Deep Neural Networks for learning high-level shape information from low-resolution depth images. In order to tackle the small sample size problem, we introduce regularization and a hard temporal attention unit. The whole model can be trained end to end with a hybrid supervised loss. We carry out a thorough experimental evaluation of the proposed method on three person re-identification datasets, which include side views, views from the top and sequences with varying degree of partial occlusion, pose and viewpoint variations. To that end, we introduce a new dataset with RGB-D and skeleton data. In a scenario where subjects are recorded after three months with new clothes, we demonstrate large performance gains attained using Depth ReID compared to a state-of-the-art Color ReID. Finally, we show further improvements using the temporal attention unit in multi-shot setting.
- The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixel-wise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data. The dense transformer networks employ an encoder-decoder architecture, and a pair of dense transformer modules are inserted into each of the encoder and decoder paths. The novelty of this work is that we provide technical solutions for learning the shapes and sizes of patches from data and efficiently restoring the spatial correspondence required for dense prediction. The proposed dense transformer modules are differentiable, thus the entire network can be trained. We apply the proposed networks on natural and biological image segmentation tasks and show superior performance is achieved in comparison to baseline methods.
- In this paper we consider the cluster estimation problem under the Stochastic Block Model. We show that the semidefinite programming (SDP) formulation for this problem achieves an error rate that decays exponentially in the signal-to-noise ratio. The error bound implies weak recovery in the sparse graph regime with bounded expected degrees, as well as exact recovery in the dense regime. An immediate corollary of our results yields error bounds under the Censored Block Model. Moreover, these error bounds are robust, continuing to hold under heterogeneous edge probabilities and a form of the so-called monotone attack. Significantly, this error rate is achieved by the SDP solution itself without any further pre- or post-processing, and improves upon existing polynomially-decaying error bounds proved using the Grothendieck\textquoteright s inequality. Our analysis has two key ingredients: (i) showing that the graph has a well-behaved spectrum, even in the sparse regime, after discounting an exponentially small number of edges, and (ii) an order-statistics argument that governs the final error rate. Both arguments highlight the implicit regularization effect of the SDP formulation.
- May 23 2017 cs.CC arXiv:1705.07312v1We study the computational complexity of the infinite-horizon discounted-reward Markov Decision Problem (MDP) with a finite state space $|\mathcal{S}|$ and a finite action space $|\mathcal{A}|$. We show that any randomized algorithm needs a running time at least $\Omega(|\mathcal{S}|^2|\mathcal{A}|)$ to compute an $\epsilon$-optimal policy with high probability. We consider two variants of the MDP where the input is given in specific data structures, including arrays of cumulative probabilities and binary trees of transition probabilities. For these cases, we show that the complexity lower bound reduces to $\Omega\left( \frac{|\mathcal{S}| |\mathcal{A}|}{\epsilon} \right)$. These results reveal a surprising observation that the computational complexity of the MDP depends on the data structure of input.
- High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels -1,0,1 which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence. Our experiments show that applying TernGrad on AlexNet does not incur any accuracy loss and can even improve accuracy. The accuracy loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a performance model is proposed to study the scalability of TernGrad. Experiments show significant speed gains for various deep neural networks.
- We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning -- a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function.
- Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.
- Pure acoustic neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID). However, the phonetic information has been largely overlooked by most of existing neural LID models, although this information has been used in the conventional phonetic LID systems with a great success. We present a phone-aware neural LID architecture, which is a deep LSTM-RNN LID system but accepts output from an RNN-based ASR system. By utilizing the phonetic knowledge, the LID performance can be significantly improved. Interestingly, even if the test language is not involved in the ASR training, the phonetic knowledge still presents a large contribution. Our experiments conducted on four languages within the Babel corpus demonstrated that the phone-aware approach is highly effective.
- Deep neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID). However, the phonetic information has been largely overlooked by most of existing neural LID methods, although this information has been used in the conventional phonetic LID systems with a great success. We present a phonetic temporal neural model for LID, which is an LSTM-RNN LID system but accepts phonetic features produced by a phone-discriminative DNN as the input, rather than raw acoustic features. This new model is a reminiscence of the old phonetic LID methods, but the phonetic knowledge here is much richer: it is at the frame level and involves compacted information of all phones. Our experiments conducted on the Babel database and the AP16-OLR database demonstrate that the temporal phonetic neural approach is very effective, and significantly outperforms existing acoustic neural models. It also outperforms the conventional i-vector approach on short utterances and in noisy conditions.
- In this paper, we address a major challenge confronting the Cloud Service Providers (CSPs) utilizing a tiered storage architecture - how to maximize their overall profit over a variety of storage tiers that offer distinct characteristics, as well as file placement and access request scheduling policies.
- Autoencoders have been successful in learning meaningful representations from image datasets. However, their performance on text datasets has not been widely studied. Traditional autoencoders tend to learn possibly trivial representations of text documents due to their confounding properties such as high-dimensionality, sparsity and power-law word distributions. In this paper, we propose a novel k-competitive autoencoder, called KATE, for text documents. Due to the competition between the neurons in the hidden layer, each neuron becomes specialized in recognizing specific data patterns, and overall the model can learn meaningful representations of textual data. A comprehensive set of experiments show that KATE can learn better representations than traditional autoencoders including denoising, contractive, variational, and k-sparse autoencoders. Our model also outperforms deep generative models, probabilistic topic models, and even word representation models (e.g., Word2Vec) in terms of several downstream tasks such as document classification, regression, and retrieval.
- May 03 2017 cs.CL arXiv:1705.00753v1While end-to-end neural machine translation (NMT) has made remarkable progress recently, it still suffers from the data scarcity problem for low-resource language pairs and domains. In this paper, we propose a method for zero-resource NMT by assuming that parallel sentences have close probabilities of generating a sentence in a third language. Based on this assumption, our method is able to train a source-to-target NMT model ("student") without parallel corpora available, guided by an existing pivot-to-target NMT model ("teacher") on a source-pivot parallel corpus. Experimental results show that the proposed method significantly improves over a baseline pivot-based model by +3.0 BLEU points across various language pairs.
- May 02 2017 cs.CV arXiv:1705.00389v2For human pose estimation in monocular images, joint occlusions and overlapping upon human bodies often result in deviated pose predictions. Under these circumstances, biologically implausible pose predictions may be produced. In contrast, human vision is able to predict poses by exploiting geometric constraints of joint inter-connectivity. To address the problem by incorporating priors about the structure of human bodies, we propose a novel structure-aware convolutional network to implicitly take such priors into account during training of the deep network. Explicit learning of such constraints is typically challenging. Instead, we design discriminators to distinguish the real poses from the fake ones (such as biologically implausible ones). If the pose generator (G) generates results that the discriminator fails to distinguish from real ones, the network successfully learns the priors.
- Despite the recent success of deep-learning based semantic segmentation, deploying a pre-trained road scene segmenter to a city whose images are not presented in the training set would not achieve satisfactory performance due to dataset biases. Instead of collecting a large number of annotated images of each city of interest to train or refine the segmenter, we propose an unsupervised learning approach to adapt road scene segmenters across different cities. By utilizing Google Street View and its time-machine feature, we can collect unannotated images for each road scene at different times, so that the associated static-object priors can be extracted accordingly. By advancing a joint global and class-specific domain adversarial learning framework, adaptation of pre-trained segmenters to that city can be achieved without the need of any user annotation or interaction. We show that our method improves the performance of semantic segmentation in multiple cities across continents, while it performs favorably against state-of-the-art approaches requiring annotated training data.
- Apr 26 2017 cs.CL arXiv:1704.07441v1This paper presents the first attempt, up to our knowledge, to classify English writing styles on this scale with the challenge of classifying day to day language written by writers with different backgrounds covering various areas of topics.The paper proposes simple machine learning algorithms and simple to generate features to solve hard problems. Relying on the scale of the data available from large sources of knowledge like Wikipedia. We believe such sources of data are crucial to generate robust solutions for the web with high accuracy and easy to deploy in practice. The paper achieves 74\% accuracy classifying native versus non native speakers writing styles. Moreover, the paper shows some interesting observations on the similarity between different languages measured by the similarity of their users English writing styles. This technique could be used to show some well known facts about languages as in grouping them into families, which our experiments support.
- Apr 26 2017 cs.CL arXiv:1704.07427v1Wikipedia is a useful knowledge source that benefits many applications in language processing and knowledge representation. An important feature of Wikipedia is that of categories. Wikipedia pages are assigned different categories according to their contents as human-annotated labels which can be used in information retrieval, ad hoc search improvements, entity ranking and tag recommendations. However, important pages are usually assigned too many categories, which makes it difficult to recognize the most important ones that give the best descriptions. In this paper, we propose an approach to recognize the most descriptive Wikipedia categories. We observe that historical figures in a precise category presumably are mutually similar and such categorical coherence could be evaluated via texts or Wikipedia links of corresponding members in the category. We rank descriptive level of Wikipedia categories according to their coherence and our ranking yield an overall agreement of 88.27% compared with human wisdom.
- Apr 26 2017 cs.CV arXiv:1704.07502v2Segmenting blood vessels in fundus imaging plays an important role in medical diagnosis. Many algorithms have been proposed. While deep Neural Networks have been attracting enormous attention from computer vision community recent years and several novel works have been done in terms of its application in retinal blood vessel segmentation, most of them are based on supervised learning which requires amount of labeled data, which is both scarce and expensive to obtain. We leverage the power of Deep Convolutional Neural Networks (DCNN) in feature learning, in this work, to achieve this ultimate goal. The highly efficient feature learning of DCNN inspires our novel approach that trains the networks with automatically-generated samples to achieve desirable performance on real-world fundus images. For this, we design a set of rules abstracted from the domain-specific prior knowledge to generate these samples. We argue that, with the high efficiency of DCNN in feature learning, one can achieve this goal by constructing the training dataset with prior knowledge, no manual labeling is needed. This approach allows us to take advantages of supervised learning without labeling. We also build a naive DCNN model to test it. The results on standard benchmarks of fundus imaging show it is competitive to the state-of-the-art methods which implies a potential way to leverage the power of DCNN in feature learning.
- Computational notions of entropy have many applications in cryptography and complexity theory. These notions measure how much (min-)entropy a source $X$ has from the eyes of a computationally bounded party who may hold certain "leakage information" $B$ that is correlated with $X$. In this work, we initiate the study of computational entropy in the quantum setting, where $X$ and/or $B$ may become quantum states and the computationally bounded observer is modeled as a small quantum circuit. Specifically, we investigate to what extent the classical notions of computational entropy generalize to the quantum setting, and whether quantum analogues of classical theorems still hold. For example, we show: - There exist quantum states that are pseudorandom i.e. they cannot be distinguished from the maximally mixed state by polynomial-sized quantum circuits) but have actual entropy 0. - We extend the classical Leakage Chain Rule for pseudoentropy to the case where the leakage information $B$ is quantum (while $X$ remains classical). - We show that a general form of the classical Dense Model Theorem (interpreted as showing the equivalence between two definitions of pseudo-relative-min-entropy) does not extend to quantum states. - As an application, we construct the first quantum leakage-resilient stream-cipher in the bounded-quantum-storage model, assuming the existence of a quantum-secure PRG. Along the way, we develop quantum analogues of a number of classical techniques (e.g. the Leakage Simulation Lemma, proven using a Non-uniform Min-Max Theorem or Boosting), and also identify classical techniques (namely, Gap Amplification) that do not hold in the quantum setting. Moreover, we introduce a variety of notions that combine quantum information and quantum complexity, which raise a number of directions for future work.
- In this work, we introduce $\beta$-expansion, a notion borrowed from number theory, as a theoretical framework to study fast construction of polar codes based on a recursive structure of universal partial order (UPO) and polarization weight (PW) algorithm. We show that polar codes can be recursively constructed from UPO by continuously solving several polynomial equations at each recursive step. From these polynomial equations, we can extract an interval for $\beta$, such that ranking the synthetic channels through a closed-form $\beta$-expansion preserves the property of nested frozen sets, which is a desired feature for low-complex construction. In an example of AWGN channels, we show that this interval for $\beta$ converges to a constant close to $1.1892 \approx 2^{1/4}$ when the code block-length trends to infinity. Both asymptotic analysis and simulation results validate our theoretical claims.
- Apr 20 2017 cs.CV arXiv:1704.05643v1Action recognition from well-segmented 3D skeleton video has been intensively studied. However, due to the difficulty in representing the 3D skeleton video and the lack of training data, action detection from streaming 3D skeleton video still lags far behind its recognition counterpart and image based object detection. In this paper, we propose a novel approach for this problem, which leverages both effective skeleton video encoding and deep regression based object detection from images. Our framework consists of two parts: skeleton-based video image mapping, which encodes a skeleton video to a color image in a temporal preserving way, and an end-to-end trainable fast skeleton action detector (Skeleton Boxes) based on image detection. Experimental results on the latest and largest PKU-MMD benchmark dataset demonstrate that our method outperforms the state-of-the-art methods with a large margin. We believe our idea would inspire and benefit future research in this important area.
- Apr 20 2017 cs.CV arXiv:1704.05645v2This paper presents an image classification based approach for skeleton-based video action recognition problem. Firstly, A dataset independent translation-scale invariant image mapping method is proposed, which transformes the skeleton videos to colour images, named skeleton-images. Secondly, A multi-scale deep convolutional neural network (CNN) architecture is proposed which could be built and fine-tuned on the powerful pre-trained CNNs, e.g., AlexNet, VGGNet, ResNet etal.. Even though the skeleton-images are very different from natural images, the fine-tune strategy still works well. At last, we prove that our method could also work well on 2D skeleton video data. We achieve the state-of-the-art results on the popular benchmard datasets e.g. NTU RGB+D, UTD-MHAD, MSRC-12, and G3D. Especially on the largest and challenge NTU RGB+D, UTD-MHAD, and MSRC-12 dataset, our method outperforms other methods by a large margion, which proves the efficacy of the proposed method.
- Apr 18 2017 cs.CL arXiv:1704.04601v1This paper proposes to address the word sense ambiguity issue in an unsupervised manner, where word sense representations are learned along a word sense selection mechanism given contexts. Prior work about learning multi-sense embeddings suffered from either ambiguity of different-level embeddings or inefficient sense selection. The proposed modular framework, MUSE, implements flexible modules to optimize distinct mechanisms, achieving the first purely sense-level representation learning system with linear-time sense selection. We leverage reinforcement learning to enable joint training on the proposed modules, and introduce various exploration techniques on sense selection for better robustness. The experiments on benchmark data show that the proposed approach achieves the state-of-the-art performance on synonym selection as well as on contextual word similarities in terms of MaxSimC.
- We propose a novel automata model over the alphabet of rational numbers, which we call register automata over the rationals (RA-Q). It reads a sequence of rational numbers and outputs another rational number. RA-Q is an extension of the well-known register automata (RA) over infinite alphabets, which are finite automata equipped with a finite number of registers/variables for storing values. Like in the standard RA, the RA-Q model allows both equality and ordering tests between values. It, moreover, allows to perform linear arithmetic between certain variables. The model is quite expressive: in addition to the standard RA, it also generalizes other well-known models such as affine programs and arithmetic circuits. The main feature of RA-Q is that despite the use of linear arithmetic, the so-called invariant problem---a generalization of the standard non-emptiness problem---is decidable. We also investigate other natural decision problems, namely, commutativity, equivalence, and reachability. For deterministic RA-Q, commutativity and equivalence are polynomial-time inter-reducible with the invariant problem.
- Output-agreement mechanisms such as ESP Game have been widely used in human computation to obtain reliable human-generated labels. In this paper, we argue that a "time-limited" output-agreement mechanism can be used to create a fast and robust crowd-powered component in interactive systems, particularly dialogue systems, to extract key information from user utterances on the fly. Our experiments on Amazon Mechanical Turk using the Airline Travel Information System (ATIS) dataset showed that the proposed approach achieves high-quality results with an average response time shorter than 9 seconds.
- Apr 12 2017 cs.LO arXiv:1704.03167v1For every $q\in \mathbb N$ let $\textrm{FO}_q$ denote the class of sentences of first-order logic FO of quantifier rank at most $q$. If a graph property can be defined in $\textrm{FO}_q$, then it can be decided in time $O(n^q)$. Thus, minimizing $q$ has favorable algorithmic consequences. Many graph properties amount to the existence of a certain set of vertices of size $k$. Usually this can only be expressed by a sentence of quantifier rank at least $k$. We use the color-coding method to demonstrate that some (hyper)graph problems can be defined in $\textrm{FO}_q$ where $q$ is independent of $k$. This property of a graph problem is equivalent to the question of whether the corresponding parameterized problem is in the class $\textrm{para-AC}^0$. It is crucial for our results that the FO-sentences have access to built-in addition and multiplication. It is known that then FO corresponds to the circuit complexity class uniform $\textrm{AC}^0$. We explore the connection between the quantifier rank of FO-sentences and the depth of $\textrm{AC}^0$-circuits, and prove that $\textrm{FO}_q \subsetneq \textrm{FO}_{q+1}$ for structures with built-in addition and multiplication.
- Apr 10 2017 cs.CV arXiv:1704.02071v1We propose a principled convolutional neural pyramid (CNP) framework for general low-level vision and image processing tasks. It is based on the essential finding that many applications require large receptive fields for structure understanding. But corresponding neural networks for regression either stack many layers or apply large kernels to achieve it, which is computationally very costly. Our pyramid structure can greatly enlarge the field while not sacrificing computation efficiency. Extra benefit includes adaptive network depth and progressive upsampling for quasi-realtime testing on VGA-size input. Our method profits a broad set of applications, such as depth/RGB image restoration, completion, noise/artifact removal, edge refinement, image filtering, image enhancement and colorization.
- Apr 07 2017 cs.CV arXiv:1704.01926v1This paper tackles the problem of semi-supervised video object segmentation, that is, segmenting an object in a sequence given its mask in the first frame. One of the main challenges in this scenario is the change of appearance of the objects of interest. Their semantics, on the other hand, do not vary. This paper investigates how to take advantage of such invariance via the introduction of a semantic prior that guides the appearance model. Specifically, given the segmentation mask of the first frame of a sequence, we estimate the semantics of the object of interest, and propagate that knowledge throughout the sequence to improve the results based on an appearance model. We present Semantically-Guided Video Object Segmentation (SGV), which improves results over previous state of the art on two different datasets using a variety of evaluation metrics, while running in half a second per frame.
- Being able to predict whether a song can be a hit has impor- tant applications in the music industry. Although it is true that the popularity of a song can be greatly affected by exter- nal factors such as social and commercial influences, to which degree audio features computed from musical signals (whom we regard as internal factors) can predict song popularity is an interesting research question on its own. Motivated by the recent success of deep learning techniques, we attempt to ex- tend previous work on hit song prediction by jointly learning the audio features and prediction models using deep learning. Specifically, we experiment with a convolutional neural net- work model that takes the primitive mel-spectrogram as the input for feature learning, a more advanced JYnet model that uses an external song dataset for supervised pre-training and auto-tagging, and the combination of these two models. We also consider the inception model to characterize audio infor- mation in different scales. Our experiments suggest that deep structures are indeed more accurate than shallow structures in predicting the popularity of either Chinese or Western Pop songs in Taiwan. We also use the tags predicted by JYnet to gain insights into the result of different models.
- Apr 06 2017 cs.CV arXiv:1704.01502v1This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences. The proposed method is trained without explicit annotation of fine-grained sentence to video region-sequence correspondence, but is only based on weak video-level sentence annotations. It differs from existing video captioning systems in three technical aspects. First, we propose lexical fully convolutional neural networks (Lexical-FCN) with weakly supervised multi-instance multi-label learning to weakly link video regions with lexical labels. Second, we introduce a novel submodular maximization scheme to generate multiple informative and diverse region-sequences based on the Lexical-FCN outputs. A winner-takes-all scheme is adopted to weakly associate sentences to region-sequences in the training phase. Third, a sequence-to-sequence learning based language model is trained with the weakly supervised information obtained through the association process. We show that the proposed method can not only produce informative and diverse dense captions, but also outperform state-of-the-art single video captioning methods by a large margin.
- Apr 04 2017 cs.SY arXiv:1704.00189v1In this paper, the structural controllability of the systems over F(z) is studied using a new mathematical method-matroids. Firstly, a vector matroid is defined over F(z). Secondly, the full rank conditions of [sI-A|B] are derived in terms of the concept related to matroid theory, such as rank, base and union. Then the sufficient condition for the linear system and composite system over F(z) to be structurally controllable is obtained. Finally, this paper gives several examples to demonstrate that the married-theoretic approach is simpler than other existing approaches.
- Mar 30 2017 cs.CV arXiv:1703.09746v2Very large-scale Deep Neural Networks (DNNs) have achieved remarkable successes in a large variety of computer vision tasks. However, the high computation intensity of DNNs makes it challenging to deploy these models on resource-limited systems. Some studies used low-rank approaches that approximate the filters by low-rank basis to accelerate the testing. Those works directly decomposed the pre-trained DNNs by Low-Rank Approximations (LRA). How to train DNNs toward lower-rank space for more efficient DNNs, however, remains as an open area. To solve the issue, in this work, we propose Force Regularization, which uses attractive forces to enforce filters so as to coordinate more weight information into lower-rank space. We mathematically and empirically prove that after applying our technique, standard LRA methods can reconstruct filters using much lower basis and thus result in faster DNNs. The effectiveness of our approach is comprehensively evaluated in ResNets, AlexNet, and GoogLeNet. In AlexNet, for example, Force Regularization gains 2x speedup on modern GPU without accuracy loss and 4.05x speedup on CPU by paying small accuracy degradation. Moreover, Force Regularization better initializes the low-rank DNNs such that the fine-tuning can converge faster toward higher accuracy. The obtained lower-rank DNNs can be further sparsified, proving that Force Regularization can be integrated with state-of-the-art sparsity-based acceleration methods.
- Mar 28 2017 cs.CV arXiv:1703.09039v1Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of deep neural network to improve energy-efficiency and throughput without sacrificing performance accuracy or increasing hardware cost are critical to enabling the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various platforms and architectures that support DNNs, and highlight key trends in recent efficient processing techniques that reduce the computation cost of DNNs either solely via hardware design changes or via joint hardware design and network algorithm changes. It will also summarize various development resources that can enable researchers and practitioners to quickly get started on DNN design, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-design, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand trade-offs between various architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand of recent implementation trends and opportunities.
- For robotic vehicles to navigate safely and efficiently in pedestrian-rich environments, it is important to model subtle human behaviors and navigation rules. However, while instinctive to humans, socially compliant navigation is still difficult to quantify due to the stochasticity in people's behaviors. Existing works are mostly focused on using feature-matching techniques to describe and imitate human paths, but often do not generalize well since the feature values can vary from person to person, and even run to run. This work notes that while it is challenging to directly specify the details of what to do (precise mechanisms of human navigation), it is straightforward to specify what not to do (violations of social norms). Specifically, using deep reinforcement learning, this work develops a time-efficient navigation policy that respects common social norms. The proposed method is shown to enable fully autonomous navigation of a robotic vehicle moving at human walking speed in an environment with many pedestrians.
- Mar 28 2017 cs.CV arXiv:1703.08837v1The challenge of person re-identification (re-id) is to match individual images of the same person captured by different non-overlapping camera views against significant and unknown cross-view feature distortion. While a large number of distance metric/subspace learning models have been developed for re-id, the cross-view transformations they learned are view-generic and thus potentially less effective in quantifying the feature distortion inherent to each camera view. Learning view-specific feature transformations for re-id (i.e., view-specific re-id), an under-studied approach, becomes an alternative resort for this problem. In this work, we formulate a novel view-specific person re-identification framework from the feature augmentation point of view, called Camera coRrelation Aware Feature augmenTation (CRAFT). Specifically, CRAFT performs cross-view adaptation by automatically measuring camera correlation from cross-view visual data distribution and adaptively conducting feature augmentation to transform the original features into a new adaptive space. Through our augmentation framework, view-generic learning algorithms can be readily generalized to learn and optimize view-specific sub-models whilst simultaneously modelling view-generic discrimination information. Therefore, our framework not only inherits the strength of view-generic model learning but also provides an effective way to take into account view specific characteristics. Our CRAFT framework can be extended to jointly learn view-specific feature transformations for person re-id across a large network with more than two cameras, a largely under-investigated but realistic re-id setting. Additionally, we present a domain-generic deep person appearance representation which is designed particularly to be towards view invariant for facilitating cross-view adaptation by CRAFT.
- Mar 28 2017 cs.GT arXiv:1703.08636v1We propose definitions of substitutes and complements for pieces of information ("signals") in the context of a decision or optimization problem, with game-theoretic and algorithmic applications. In a game-theoretic context, substitutes capture diminishing marginal value of information to a rational decision maker. We use the definitions to address the question of how and when information is aggregated in prediction markets. Substitutes characterize "best-possible" equilibria with immediate information aggregation, while complements characterize "worst-possible", delayed aggregation. Game-theoretic applications also include settings such as crowdsourcing contests and Q\&A forums. In an algorithmic context, where substitutes capture diminishing marginal improvement of information to an optimization problem, substitutes imply efficient approximation algorithms for a very general class of (adaptive) information acquisition problems. In tandem with these broad applications, we examine the structure and design of informational substitutes and complements. They have equivalent, intuitive definitions from disparate perspectives: submodularity, geometry, and information theory. We also consider the design of scoring rules or optimization problems so as to encourage substitutability or complementarity, with positive and negative results. Taken as a whole, the results give some evidence that, in parallel with substitutable items, informational substitutes play a natural conceptual and formal role in game theory and algorithms.
- Language understanding is a key component in a spoken dialogue system. In this paper, we investigate how the language understanding module influences the dialogue system performance by conducting a series of systematic experiments on a task-oriented neural dialogue system in a reinforcement learning based setting. The empirical study shows that among different types of language understanding errors, slot-level errors can have more impact on the overall performance of a dialogue system compared to intent-level errors. In addition, our experiments demonstrate that the reinforcement learning based dialogue system is able to learn when and what to confirm in order to achieve better performance and greater robustness.
- Mar 21 2017 cs.CC arXiv:1703.06423v1The dichotomy conjecture for the parameterized embedding problem states that the problem of deciding whether a given graph $G$ from some class $K$ of "pattern graphs" can be embedded into a given graph $H$ (that is, is isomorphic to a subgraph of $H$) is fixed-parameter tractable if $K$ is a class of graphs of bounded tree width and $W[1]$-complete otherwise. Towards this conjecture, we prove that the embedding problem is $W[1]$-complete if $K$ is the class of all grids or the class of all walls.