results for au:Yin_X in:cs

- Generating captions for images is a task that has recently received considerable attention. In this work we focus on caption generation for abstract scenes, or object layouts where the only information provided is a set of objects and their locations. We propose OBJ2TEXT, a sequence-to-sequence model that encodes a set of objects and their locations as an input sequence using an LSTM network, and decodes this representation using an LSTM language model. We show that our model, despite encoding object layouts as a sequence, can represent spatial relationships between objects, and generate descriptions that are globally coherent and semantically relevant. We test our approach in a task of object-layout captioning by using only object annotations as inputs. We additionally show that our model, combined with a state-of-the-art object detector, improves an image captioning model from 0.863 to 0.950 (CIDEr score) in the test benchmark of the standard MS-COCO Captioning task.
- Jun 28 2017 cs.CV arXiv:1706.08564v1Pedestrian detection is a critical problem in computer vision with significant impact on safety in urban autonomous driving. In this work, we explore how semantic segmentation can be used to boost pedestrian detection accuracy while having little to no impact on network efficiency. We propose a segmentation infusion network to enable joint supervision on semantic segmentation and pedestrian detection. When placed properly, the additional supervision helps guide features in shared layers to become more sophisticated and helpful for the downstream pedestrian detector. Using this approach, we find weakly annotated boxes to be sufficient for considerable performance gains. We provide an in-depth analysis to demonstrate how shared layers are shaped by the segmentation supervision. In doing so, we show that the resulting feature maps become more semantically meaningful and robust to shape and occlusion. Overall, our simultaneous detection and segmentation framework achieves a considerable gain over the state-of-the-art on the Caltech pedestrian dataset, competitive performance on KITTI, and executes 2x faster than competitive methods.
- Jun 01 2017 cs.CV arXiv:1705.11136v1The large pose discrepancy between two face images is one of the fundamental challenges in automatic face recognition. Conventional approaches to pose-invariant face recognition either perform face frontalization on, or learn a pose-invariant representation from, a non-frontal face image. We argue that it is more desirable to perform both tasks jointly to allow them to leverage each other. To this end, this paper proposes a Disentangled Representation learning-Generative Adversarial Network (DR-GAN) with three distinct novelties. First, the encoder-decoder structure of the generator enables DR-GAN to learn a representation that is both generative and discriminative, which can be used for face image synthesis and pose-invariant face recognition. Second, this representation is explicitly disentangled from other face variations such as pose, through the pose code provided to the decoder and pose estimation in the discriminator. Third, DR-GAN can take one or multiple images as the input, and generate one unified identity representation along with an arbitrary number of synthetic face images. Extensive quantitative and qualitative evaluation on a number of controlled and in-the-wild databases demonstrate the superiority of DR-GAN over the state of the art in both learning representations and rotating large-pose face images.
- Apr 21 2017 cs.CV arXiv:1704.06244v3Despite recent advances in face recognition using deep learning, severe accuracy drops are observed for large pose variations in unconstrained environments. Learning pose-invariant features is one solution, but needs expensively labeled large-scale data and carefully designed feature learning algorithms. In this work, we focus on frontalizing faces in the wild under various head poses, including extreme profile views. We propose a novel deep 3D Morphable Model (3DMM) conditioned Face Frontalization Generative Adversarial Network (GAN), termed as FF-GAN, to generate neutral head pose face images. Our framework differs from both traditional GANs and 3DMM based modeling. Incorporating 3DMM into the GAN structure provides shape and appearance priors for fast convergence with less training data, while also supporting end-to-end training. The 3DMM-conditioned GAN employs not only the discriminator and generator loss but also a new masked symmetry loss to retain visual quality under occlusions, besides an identity loss to recover high frequency information. Experiments on face recognition, landmark localization and 3D reconstruction consistently show the advantage of our frontalization method on faces in the wild datasets.
- Feb 16 2017 cs.CV arXiv:1702.04710v2This paper explores multi-task learning (MTL) for face recognition. We answer the questions of how and why MTL can improve the face recognition performance. First, we propose a multi-task Convolutional Neural Network (CNN) for face recognition where identity classification is the main task and pose, illumination, and expression estimations are the side tasks. Second, we develop a dynamic-weighting scheme to automatically assign the loss weight to each side task, which is a crucial problem in MTL. Third, we propose a pose-directed multi-task CNN by grouping different poses to learn pose-specific identity features, simultaneously across all poses. Last but not least, we propose an energy-based weight analysis method to explore how CNN-based MTL works. We observe that the side tasks serve as regularizations to disentangle the variations from the learnt identity features. Extensive experiments on the entire Multi-PIE dataset demonstrate the effectiveness of the proposed approach. To the best of our knowledge, this is the first work using all data in Multi-PIE for face recognition. Our approach is also applicable to in-the-wild datasets for pose-invariant face recognition and achieves comparable or better performance than state of the art on LFW, CFP, and IJB-A datasets.
- Jan 18 2017 cs.SY arXiv:1701.04537v1There is a rapidly growing interest in the use of cloud computing for automotive vehicles to facilitate computation and data intensive tasks. Efficient utilization of on-demand cloud resources holds a significant potential to improve future vehicle safety, comfort, and fuel economy. In the meanwhile, issues like cyber security and resource allocation pose great challenges. In this paper, we treat the resource allocation problem for cloud-based automotive systems. Both private and public cloud paradigms are considered where a private cloud provides an internal, company-owned internet service dedicated to its own vehicles while a public cloud serves all subscribed vehicles. This paper establishes comprehensive models of cloud resource provisioning for both private and public cloud- based automotive systems. Complications such as stochastic communication delays and task deadlines are explicitly considered. In particular, a centralized resource provisioning model is developed for private cloud and chance constrained optimization is exploited to utilize the cloud resources for best Quality of Services. On the other hand, a decentralized auction-based model is developed for public cloud and reinforcement learning is employed to obtain an optimal bidding policy for a "selfish" agent. Numerical examples are presented to illustrate the effectiveness of the developed techniques.
- Jan 13 2017 cs.SY arXiv:1701.03343v1In this work, we investigate a state estimation problem for a full-car semi-active suspension system. To account for the complex calculation and optimization problems, a vehicle-to- cloud-to-vehicle (V2C2V) scheme is utilized. Moving horizon estimation is introduced for the state estimation system design. All the optimization problems are solved in a remotely-embedded agent with high computational ability. Measurements and state estimates are transmitted between the vehicle and the remote agent via networked communication channels. The effectiveness of the proposed method is illustrated via a set of simulations.
- User-perceived quality-of-experience (QoE) is critical in internet video delivery systems. Extensive prior work has studied the design of client-side bitrate adaptation algorithms to maximize single-player QoE. However, multiplayer QoE fairness becomes critical as the growth of video traffic makes it more likely that multiple players share a bottleneck in the network. Despite several recent proposals, there is still a series of open questions. In this paper, we bring the problem space to light from a control theory perspective by formalizing the multiplayer QoE fairness problem and addressing two key questions in the broader problem space. First, we derive the sufficient conditions of convergence to steady state QoE fairness under TCP-based bandwidth sharing scheme. Based on the insight from this analysis that in-network active bandwidth allocation is needed, we propose a non-linear MPC-based, router-assisted bandwidth allocation algorithm that regards each player as closed-loop systems. We use trace-driven simulation to show the improvement over existing approaches. We identify several research directions enabled by the control theoretic modeling and envision that control theory can play an important role on guiding real system design in adaptive video streaming.
- Efficient solutions to NP-complete problems would significantly benefit both science and industry. However, such problems are intractable on digital computers based on the von Neumann architecture, thus creating the need for alternative solutions to tackle such problems. Recently, a deterministic, continuous-time dynamical system (CTDS) was proposed (Nature Physics, 7(12), 966 (2011)) to solve a representative NP-complete problem, Boolean Satisfiability (SAT). This solver shows polynomial analog time-complexity on even the hardest benchmark $k$-SAT ($k \geq 3$) formulas, but at an energy cost through exponentially driven auxiliary variables. With some modifications to the CTDS equations, here we present a novel analog hardware SAT solver, AC-SAT, implementing the CTDS. AC-SAT is intended to be used as a co-processor, and with its modular design can be readily extended to different problem sizes. The circuit is designed and simulated based on a 32nm CMOS technology. SPICE simulation results show speedup factors of $\sim$10$^4$ on even the hardest 3-SAT problems, when compared with a state-of-the-art SAT solver on digital computers. For example, for hard problems with $N=50$ variables and $M=212$ clauses, solutions are found within from a few $ns$ to a few hundred $ns$ with an average power consumption of ${130}$ $mW$.
- Vector linear network coding (LNC) is a generalization of the conventional scalar LNC, such that the data unit transmitted on every edge is an $L$-dimensional vector of data symbols over a base field GF($q$). Vector LNC enriches the choices of coding operations at intermediate nodes, and there is a popular conjecture on the benefit of vector LNC over scalar LNC in terms of alphabet size of data units: there exist (single-source) multicast networks that are vector linearly solvable of dimension $L$ over GF($q$) but not scalar linearly solvable over any field of size $q' \leq q^L$. This paper introduces a systematic way to construct such multicast networks, and subsequently establish explicit instances to affirm the positive answer of this conjecture for \emphinfinitely many alphabet sizes $p^L$ with respect to an \empharbitrary prime $p$. On the other hand, this paper also presents explicit instances with the special property that they do not have a vector linear solution of dimension $L$ over GF(2) but have scalar linear solutions over GF($q'$) for some $q' < 2^L$, where $q'$ can be odd or even. This discovery also unveils that over a given base field, a multicast network that has a vector linear solution of dimension $L$ does not necessarily have a vector linear solution of dimension $L' > L$.
- Mar 17 2016 cs.DC arXiv:1603.05163v1Distributed storage systems provide large-scale reliable data storage services by spreading redundancy across a large group of storage nodes. In such a large system, node failures take place on a regular basis. When a storage node breaks down, a replacement node is expected to regenerate the redundant data as soon as possible in order to maintain the same level of redundancy. Previous results have been mainly focused on the minimization of network traffic in regeneration. However, in practical networks, where link capacities vary in a wide range, minimizing network traffic does not always yield the minimum regeneration time. In this paper, we investigate two approaches to the problem of minimizing regeneration time in networks with heterogeneous link capacities. The first approach is to download different amounts of repair data from the helping nodes according to the link capacities. The second approach generalizes the conventional star-structured regeneration topology to tree-structured topologies so that we can utilize the links between helping nodes with bypassing low-capacity links. Simulation results show that the flexible tree-structured regeneration scheme that combines the advantages of both approaches can achieve a substantial reduction in the regeneration time.
- Analyzing TCP Throughput Stability and Predictability with Implications for Adaptive Video StreamingJun 19 2015 cs.NI arXiv:1506.05541v1Recent work suggests that TCP throughput stability and predictability within a video viewing session can inform the design of better video bitrate adaptation algorithms. Despite a rich tradition of Internet measurement, however, our understanding of throughput stability and predictability is quite limited. To bridge this gap, we present a measurement study of throughput stability using a large-scale dataset from a video service provider. Drawing on this analysis, we propose a simple-but-effective prediction mechanism based on a hidden Markov model and demonstrate that it outperforms other approaches. We also show the practical implications in improving the user experience of adaptive video streaming.
- May 05 2015 cs.CV arXiv:1505.00353v2This paper proposes a novel framework for fluorescence plant video processing. The plant research community is interested in the leaf-level photosynthetic analysis within a plant. A prerequisite for such analysis is to segment all leaves, estimate their structures, and track them over time. We identify this as a joint multi-leaf segmentation, alignment, and tracking problem. First, leaf segmentation and alignment are applied on the last frame of a plant video to find a number of well-aligned leaf candidates. Second, leaf tracking is applied on the remaining frames with leaf candidate transformation from the previous frame. We form two optimization problems with shared terms in their objective functions for leaf alignment and tracking respectively. A quantitative evaluation framework is formulated to evaluate the performance of our algorithm with four metrics. Two models are learned to predict the alignment accuracy and detect tracking failure respectively in order to provide guidance for subsequent plant biology analysis. The limitation of our algorithm is also studied. Experimental results show the effectiveness, efficiency, and robustness of the proposed method.
- Oct 30 2014 cs.LG arXiv:1410.7835v2A Relational Dependency Network (RDN) is a directed graphical model widely used for multi-relational data. These networks allow cyclic dependencies, necessary to represent relational autocorrelations. We describe an approach for learning both the RDN's structure and its parameters, given an input relational database: First learn a Bayesian network (BN), then transform the Bayesian network to an RDN. Thus fast Bayes net learning can provide fast RDN learning. The BN-to-RDN transform comprises a simple, local adjustment of the Bayes net structure and a closed-form transform of the Bayes net parameters. This method can learn an RDN for a dataset with a million tuples in minutes. We empirically compare our approach to state-of-the art RDN learning methods that use functional gradient boosting, on five benchmark datasets. Learning RDNs via BNs scales much better to large datasets than learning RDNs with boosting, and provides competitive accuracy in predictions.
- Classifier ensemble generally should combine diverse component classifiers. However, it is difficult to give a definitive connection between diversity measure and ensemble accuracy. Given a list of available component classifiers, how to adaptively and diversely ensemble classifiers becomes a big challenge in the literature. In this paper, we argue that diversity, not direct diversity on samples but adaptive diversity with data, is highly correlated to ensemble accuracy, and we propose a novel technology for classifier ensemble, learning to diversify, which learns to adaptively combine classifiers by considering both accuracy and diversity. Specifically, our approach, Learning TO Diversify via Weighted Kernels (L2DWK), performs classifier combination by optimizing a direct but simple criterion: maximizing ensemble accuracy and adaptive diversity simultaneously by minimizing a convex loss function. Given a measure formulation, the diversity is calculated with weighted kernels (i.e., the diversity is measured on the component classifiers' outputs which are kernelled and weighted), and the kernel weights are automatically learned. We minimize this loss function by estimating the kernel weights in conjunction with the classifier weights, and propose a self-training algorithm for conducting this convex optimization procedure iteratively. Extensive experiments on a variety of 32 UCI classification benchmark datasets show that the proposed approach consistently outperforms state-of-the-art ensembles such as Bagging, AdaBoost, Random Forests, Gasen, Regularized Selective Ensemble, and Ensemble Pruning via Semi-Definite Programming.
- Alignment-free sequence analysis approaches provide important alternatives over multiple sequence alignment (MSA) in biological sequence analysis because alignment-free approaches have low computation complexity and are not dependent on high level of sequence identity, however, most of the existing alignment-free methods do not employ true full information content of sequences and thus can not accurately reveal similarities and differences among DNA sequences. We present a novel alignment-free computational method for sequence analysis based on Ramanujan-Fourier transform (RFT), in which complete information of DNA sequences is retained. We represent DNA sequences as four binary indicator sequences and apply RFT on the indicator sequences to convert them into frequency domain. The Euclidean distance of the complete RFT coefficients of DNA sequences are used as similarity measure. To address the different lengths in Euclidean space of RFT coefficients, we pad zeros to short DNA binary sequences so that the binary sequences equal the longest length in the comparison sequence data. Thus, the DNA sequences are compared in the same dimensional frequency space without information loss. We demonstrate the usefulness of the proposed method by presenting experimental results on hierarchical clustering of genes and genomes. The proposed method opens a new channel to biological sequence analysis, classification, and structural module identification.
- In an acyclic multicast network, it is well known that a linear network coding solution over GF($q$) exists when $q$ is sufficiently large. In particular, for each prime power $q$ no smaller than the number of receivers, a linear solution over GF($q$) can be efficiently constructed. In this work, we reveal that a linear solution over a given finite field does \emphnot necessarily imply the existence of a linear solution over all larger finite fields. Specifically, we prove by construction that: (i) For every source dimension no smaller than 3, there is a multicast network linearly solvable over GF(7) but not over GF(8), and another multicast network linearly solvable over GF(16) but not over GF(17); (ii) There is a multicast network linearly solvable over GF(5) but not over such GF($q$) that $q > 5$ is a Mersenne prime plus 1, which can be extremely large; (iii) A multicast network linearly solvable over GF($q^{m_1}$) and over GF($q^{m_2}$) is \emphnot necessarily linearly solvable over GF($q^{m_1+m_2}$); (iv) There exists a class of multicast networks with a set $T$ of receivers such that the minimum field size $q_{min}$ for a linear solution over GF($q_{min}$) is lower bounded by $\Theta(\sqrt{|T|})$, but not every larger field than GF($q_{min}$) suffices to yield a linear solution. The insight brought from this work is that not only the field size, but also the order of subgroups in the multiplicative group of a finite field affects the linear solvability of a multicast network.
- As storage systems grow in size, device failures happen more frequently than ever before. Given the commodity nature of hard drives employed, a storage system needs to tolerate a certain number of disk failures while maintaining data integrity, and to recover lost data with minimal interference to normal disk I/O operations. RAID-6, which can tolerate up to two disk failures with the minimum redundancy, is becoming widespread. However, traditional RAID-6 codes suffer from high disk I/O overhead during recovery. In this paper, we propose a new family of RAID-6 codes, the Minimum Disk I/O Repairable (MDR) codes, which achieve the optimal disk I/O overhead for single failure recoveries. Moreover, we show that MDR codes can be encoded with the minimum number of bit-wise XOR operations. Simulation results show that MDR codes help to save about half of disk read operations than traditional RAID-6 codes, and thus can reduce the recovery time by up to 40%.
- Network Coding encourages information coding across a communication network. While the necessity, benefit and complexity of network coding are sensitive to the underlying graph structure of a network, existing theory on network coding often treats the network topology as a black box, focusing on algebraic or information theoretic aspects of the problem. This work aims at an in-depth examination of the relation between algebraic coding and network topologies. We mathematically establish a series of results along the direction of: if network coding is necessary/beneficial, or if a particular finite field is required for coding, then the network must have a corresponding hidden structure embedded in its underlying topology, and such embedding is computationally efficient to verify. Specifically, we first formulate a meta-conjecture, the NC-Minor Conjecture, that articulates such a connection between graph theory and network coding, in the language of graph minors. We next prove that the NC-Minor Conjecture is almost equivalent to the Hadwiger Conjecture, which connects graph minors with graph coloring. Such equivalence implies the existence of $K_4$, $K_5$, $K_6$, and $K_{O(q/\log{q})}$ minors, for networks requiring $\mathbb{F}_3$, $\mathbb{F}_4$, $\mathbb{F}_5$ and $\mathbb{F}_q$, respectively. We finally prove that network coding can make a difference from routing only if the network contains a $K_4$ minor, and this minor containment result is tight. Practical implications of the above results are discussed.
- In this work we present a flexible, probabilistic and reference-free method of error correction for high throughput DNA sequencing data. The key is to exploit the high coverage of sequencing data and model short sequence outputs as independent realizations of a Hidden Markov Model (HMM). We pose the problem of error correction of reads as one of maximum likelihood sequence detection over this HMM. While time and memory considerations rule out an implementation of the optimal Baum-Welch algorithm (for parameter estimation) and the optimal Viterbi algorithm (for error correction), we propose low-complexity approximate versions of both. Specifically, we propose an approximate Viterbi and a sequential decoding based algorithm for the error correction. Our results show that when compared with Reptile, a state-of-the-art error correction method, our methods consistently achieve superior performances on both simulated and real data sets.
- Text detection in natural scene images is an important prerequisite for many content-based image analysis tasks. In this paper, we propose an accurate and robust method for detecting texts in natural scene images. A fast and effective pruning algorithm is designed to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations. Character candidates are grouped into text candidates by the ingle-link clustering algorithm, where distance weights and threshold of the clustering algorithm are learned automatically by a novel self-training distance metric learning algorithm. The posterior probabilities of text candidates corresponding to non-text are estimated with an character classifier; text candidates with high probabilities are then eliminated and finally texts are identified with a text classifier. The proposed system is evaluated on the ICDAR 2011 Robust Reading Competition dataset; the f measure is over 76% and is significantly better than the state-of-the-art performance of 71%. Experimental results on a publicly available multilingual dataset also show that our proposed method can outperform the other competitive method with the f measure increase of over 9 percent. Finally, we have setup an online demo of our proposed scene text detection system at http://kems.ustb.edu.cn/learning/yin/dtext.