results for au:Lin_D in:cs

- Apr 18 2018 cs.CV arXiv:1804.06253v1Recently, weighted patch representation has been widely studied for alleviating the impact of background information included in bounding box to improve visual tracking results. However, existing weighted patch representation models generally exploit spatial structure information among patches in each frame separately which ignore (1) unary featureof each patch and (2) temporal correlation among patches in different frames. To address this problem, we propose a novel unified temporal coherence and graph optimized ranking model for weighted patch representation in visual tracking problem. There are three main contributions of this paper. First, we propose to employ a flexible graph ranking for patch weight computation which exploits both structure information among patches and unary feature of each patch simultaneously. Second, we propose a new more discriminative ranking model by further considering the temporal correlation among patches in different frames. Third, a neighborhood preserved, low-rank graph is learned and incorporated to build a unified optimized ranking model. Experiments on two benchmark datasets show the benefits of our model.
- Apr 17 2018 cs.CV arXiv:1804.05472v1High-performance object detection relies on expensive convolutional networks to compute features, often leading to significant challenges in applications, e.g. those that require detecting objects from video streams in real time. The key to this problem is to trade accuracy for efficiency in an effective way, i.e. reducing the computing cost while maintaining competitive performance. To seek a good balance, previous efforts usually focus on optimizing the model architectures. This paper explores an alternative approach, that is, to reallocate the computation over a scale-time space. The basic idea is to perform expensive detection sparsely and propagate the results across both scales and time with substantially cheaper networks, by exploiting the strong correlations among them. Specifically, we present a unified framework that integrates detection, temporal propagation, and across-scale refinement on a Scale-Time Lattice. On this framework, one can explore various strategies to balance performance and cost. Taking advantage of this flexibility, we further develop an adaptive scheme with the detector invoked on demand and thus obtain improved tradeoff. On ImageNet VID dataset, the proposed method can achieve a competitive mAP 79.6% at 20 fps, or 79.0% at 62 fps as a performance/speed tradeoff.
- Apr 03 2018 cs.CV arXiv:1804.00389v1Recent years have seen remarkable progress in semantic segmentation. Yet, it remains a challenging task to apply segmentation techniques to video-based applications. Specifically, the high throughput of video streams, the sheer cost of running fully convolutional networks, together with the low-latency requirements in many real-world applications, e.g. autonomous driving, present a significant challenge to the design of the video segmentation framework. To tackle this combined challenge, we develop a framework for video semantic segmentation, which incorporates two novel components: (1) a feature propagation module that adaptively fuses features over time via spatially variant convolution, thus reducing the cost of per-frame computation; and (2) an adaptive scheduler that dynamically allocate computation based on accuracy prediction. Both components work together to ensure low latency while maintaining high segmentation quality. On both Cityscapes and CamVid, the proposed framework obtained competitive performance compared to the state of the art, while substantially reducing the latency, from 360 ms to 119 ms.
- Solving nonlinear algebraic equations is a classic mathematics problem, and common in scientific researches and engineering applications. There are many numeric, symbolic and numeric-symbolic methods of solving (real) solutions. Unlucky, these methods are constrained by some factors, e.g., high complexity, slow serial calculation, and the notorious intermediate expression expansion. Especially when the count of variables is larger than six, the efficiency is decreasing drastically. In this paper, according to the property of physical world, we pay attention to nonlinear algebraic equations whose variables are in fixed constraints, and get meaningful real solutions. Combining with parallelism of GPGPU, we present an efficient algorithm, by searching the solution space globally and solving the nonlinear algebraic equations with real interval solutions. Furthermore, we realize the Hansen-Sengupta method on GPGPU. The experiments show that our method can solve many nonlinear algebraic equations, and the results are accurate and more efficient compared to traditional serial methods.
- Jan 24 2018 cs.CV arXiv:1801.07455v2Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.
- Jan 08 2018 cs.CV arXiv:1801.01687v1Massive classification, a classification task defined over a vast number of classes (hundreds of thousands or even millions), has become an essential part of many real-world systems, such as face recognition. Existing methods, including the deep networks that achieved remarkable success in recent years, were mostly devised for problems with a moderate number of classes. They would meet with substantial difficulties, e.g. excessive memory demand and computational cost, when applied to massive problems. We present a new method to tackle this problem. This method can efficiently and accurately identify a small number of "active classes" for each mini-batch, based on a set of dynamic class hierarchies constructed on the fly. We also develop an adaptive allocation scheme thereon, which leads to a better tradeoff between performance and cost. On several large-scale benchmarks, our method significantly reduces the training cost and memory demand, while maintaining competitive performance.
- In this paper, we construct two classes of permutation polynomials over $\mathbb{F}_{q^2}$ with odd characteristic from rational Rédei functions. A complete characterization of their compositional inverses is also given. These permutation polynomials can be generated recursively. As a consequence, we can generate recursively permutation polynomials with arbitrary number of terms. More importantly, the conditions of these polynomials being permutations are very easy to characterize. For wide applications in practice, several classes of permutation binomials and trinomials are given. With the help of a computer, we find that the number of permutation polynomials of these types is very large.
- The quest for performant networks has been a significant force that drives the advancements of deep learning in recent years. While rewarding, improving network design has never been an easy journey. The large design space combined with the tremendous cost required for network training poses a major obstacle to this endeavor. In this work, we propose a new approach to this problem, namely, predicting the performance of a network before training, based on its architecture. Specifically, we develop a unified way to encode individual layers into vectors and bring them together to form an integrated description via LSTM. Taking advantage of the recurrent network's strong expressive power, this method can reliably predict the performances of various network architectures. Our empirical studies showed that it not only achieved accurate predictions but also produced consistent rankings across datasets -- a key desideratum in performance prediction.
- Nov 20 2017 cs.LO arXiv:1711.06541v1We present Symbolic Quick Error Detection (Symbolic QED), a structured approach for logic bug detection and localization which can be used both during pre-silicon design verification as well as post-silicon validation and debug. This new methodology leverages prior work on Quick Error Detection (QED) which has been demonstrated to drastically reduce the latency, in terms of the number of clock cycles, of error detection following the activation of a logic (or electrical) bug. QED works through software transformations, including redundant execution and control flow checking, of the applied tests. Symbolic QED combines these error-detecting QED transformations with bounded model checking-based formal analysis to generate minimal-length bug activation traces that detect and localize any logic bugs in the design. We demonstrate the practicality and effectiveness of Symbolic QED using the OpenSPARC T2, a 500-million-transistor open-source multicore System-on-Chip (SoC) design, and using "difficult" logic bug scenarios observed in various state-of-the-art commercial multicore SoCs. Our results show that Symbolic QED: (i) is fully automatic, unlike manual techniques in use today that can be extremely time-consuming and expensive; (ii) requires only a few hours in contrast to manual approaches that might take days (or even months) or formal techniques that often take days or fail completely for large designs; and (iii) generates counter-examples (for activating and detecting logic bugs) that are up to 6 orders of magnitude shorter than those produced by traditional techniques. Significantly, this new approach does not require any additional hardware.
- Sparsity inducing regularization is an important part for learning over-complete visual representations. Despite the popularity of $\ell_1$ regularization, in this paper, we investigate the usage of non-convex regularizations in this problem. Our contribution consists of three parts. First, we propose the leaky capped norm regularization (LCNR), which allows model weights below a certain threshold to be regularized more strongly as opposed to those above, therefore imposes strong sparsity and only introduces controllable estimation bias. We propose a majorization-minimization algorithm to optimize the joint objective function. Second, our study over monocular 3D shape recovery and neural networks with LCNR outperforms $\ell_1$ and other non-convex regularizations, achieving state-of-the-art performance and faster convergence. Third, we prove a theoretical global convergence speed on the 3D recovery problem. To the best of our knowledge, this is the first convergence analysis of the 3D recovery problem.
- Nov 07 2017 cs.CR arXiv:1711.01858v2This paper analyzes the security of an image encryption algorithm proposed by Ye and Huang [\textitIEEE MultiMedia, vol. 23, pp. 64-71, 2016]. The Ye-Huang algorithm uses electrocardiography (ECG) signals to generate the initial key for a chaotic system and applies an autoblocking method to divide a plain image into blocks of certain sizes suitable for subsequent encryption. The designers claim that the proposed algorithm is "strong and flexible enough for practical applications". In this paper, we perform a thorough analysis of their algorithm from the view point of modern cryptography. We find it is vulnerable to the known plaintext attack: based on one pair of a known plain-image and its corresponding cipher-image, an adversary is able to derive a mask image, which can be used as an equivalent secret key to successfully decrypt other cipher-images encrypted under the same key with a non-negligible probability of 1/256. Using this as a typical counterexample, we summarize security defects in the design of the Ye-Huang algorithm. The lessons are generally applicable to many other image encryption schemes.
- Oct 23 2017 cs.CV arXiv:1710.07346v1We present a novel and effective approach for generating new clothing on a wearer through generative adversarial learning. Given an input image of a person and a sentence describing a different outfit, our model "redresses" the person as desired, while at the same time keeping the wearer and her/his pose unchanged. Generating new outfits with precise regions conforming to a language description while retaining wearer's body structure is a new challenging task. Existing generative adversarial networks are not ideal in ensuring global coherence of structure given both the input photograph and language description as conditions. We address this challenge by decomposing the complex generative process into two conditional stages. In the first stage, we generate a plausible semantic segmentation map that obeys the wearer's pose as a latent spatial arrangement. An effective spatial constraint is formulated to guide the generation of this semantic segmentation map. In the second stage, a generative model with a newly proposed compositional mapping layer is used to render the final image with precise regions and textures conditioned on this map. We extended the DeepFashion dataset [8] by collecting sentence descriptions for 79K images. We demonstrate the effectiveness of our approach through both quantitative and qualitative evaluations. A user study is also conducted. The codes and the data are available at http://mmlab.ie.cuhk. edu.hk/projects/FashionGAN/.
- Oct 10 2017 cs.CV arXiv:1710.02534v1Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learning (CL), for image captioning. Specifically, via two constraints formulated on top of a reference model, the proposed method can encourage distinctiveness, while maintaining the overall quality of the generated captions. We tested our method on two challenging datasets, where it improves the baseline model by significant margins. We also showed in our studies that the proposed method is generic and can be used for models with various structures.
- Sep 21 2017 cs.CR arXiv:1709.06724v2At EUROCRYPT 2011, Gentry and Halevi implemented a variant of Gentry's fully homomorphic encryption scheme. The core part in their key generation is to generate an odd-determinant ideal lattice having a particular type of Hermite Normal Form. However, they did not give a rigorous proof for the correctness. We present a better key generation algorithm, improving their algorithm from two aspects. -We show how to deterministically generate ideal lattices with odd determinant, thus increasing the success probability close to 1. -We give a rigorous proof for the correctness. To be more specific, we present a simpler condition for checking whether the ideal lattice has the desired Hermite Normal Form. Furthermore, our condition can be checked more efficiently. As a result, our key generation is about 1.5 times faster. We also give experimental results supporting our claims. Our optimizations are based on the properties of ideal lattices, which might be of independent interests.
- We consider the estimation of Dirichlet Process Mixture Models (DPMMs) in distributed environments, where data are distributed across multiple computing nodes. A key advantage of Bayesian nonparametric models such as DPMMs is that they allow new components to be introduced on the fly as needed. This, however, posts an important challenge to distributed estimation -- how to handle new components efficiently and consistently. To tackle this problem, we propose a new estimation method, which allows new components to be created locally in individual computing nodes. Components corresponding to the same cluster will be identified and merged via a probabilistic consolidation scheme. In this way, we can maintain the consistency of estimation with very low communication cost. Experiments on large real-world data sets show that the proposed method can achieve high scalability in distributed and asynchronous environments without compromising the mixing performance.
- Specialized classifiers, namely those dedicated to a subset of classes, are often adopted in real-world recognition systems. However, integrating such classifiers is nontrivial. Existing methods, e.g. weighted average, usually implicitly assume that all constituents of an ensemble cover the same set of classes. Such methods can produce misleading predictions when used to combine specialized classifiers. This work explores a novel approach. Instead of combining predictions from individual classifiers directly, it first decomposes the predictions into sets of pairwise preferences, treating them as transition channels between classes, and thereon constructs a continuous-time Markov chain, and use the equilibrium distribution of this chain as the final prediction. This way allows us to form a coherent picture over all specialized predictions. On large public datasets, the proposed method obtains considerable improvement compared to mainstream ensemble methods, especially when the classifier coverage is highly unbalanced.
- Aug 01 2017 cs.CV arXiv:1707.09593v1Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task. Detectors learned from a public dataset can only work with a fixed list of categories, while training from scratch usually requires a large amount of training data with detailed annotations. This work aims to explore a novel approach -- learning object detectors from documentary films in a weakly supervised manner. This is inspired by the observation that documentaries often provide dedicated exposition of certain object categories, where visual presentations are aligned with subtitles. We believe that object detectors can be learned from such a rich source of information. Towards this goal, we develop a joint probabilistic framework, where individual pieces of information, including video frames and subtitles, are brought together via both visual and linguistic links. On top of this formulation, we further derive a weakly supervised learning algorithm, where object model learning and training set mining are unified in an optimization procedure. Experimental results on a real world dataset demonstrate that this is an effective approach to learning new object detectors.
- May 09 2017 cs.CV arXiv:1705.02953v1Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
- Apr 21 2017 cs.CV arXiv:1704.06228v2Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
- Apr 12 2017 cs.CV arXiv:1704.03114v2Relationships among objects play a crucial role in image understanding. Despite the great success of deep learning techniques in recognizing individual objects, reasoning about the relationships among objects remains a challenging task. Previous methods often treat this as a classification problem, considering each type of relationship (e.g. "ride") or each distinct visual phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with significant difficulties caused by the high diversity of visual appearance for each kind of relationships or the large number of distinct visual phrases. We propose an integrated framework to tackle this problem. At the heart of this framework is the Deep Relational Network, a novel formulation designed specifically for exploiting the statistical dependencies between objects and their relationships. On two large datasets, the proposed method achieves substantial improvement over state-of-the-art.
- The millimeter-wave (mmWave) communication is envisioned to provide orders of magnitude capacity improvement. However, it is challenging to realize a sufficient link margin due to high path loss and blockages. To address this difficulty, in this paper, we explore the potential gain of ultra-densification for enhancing mmWave communications from a network-level perspective. By deploying the mmWave base stations (BSs) in an extremely dense and amorphous fashion, the access distance is reduced and the choice of serving BSs is enriched for each user, which are intuitively effective for mitigating the propagation loss and blockages. Nevertheless, co-channel interference under this model will become a performance-limiting factor. To solve this problem, we propose a large-scale channel state information (CSI) based interference coordination approach. Note that the large-scale CSI is highly location-dependent, and can be obtained with a quite low cost. Thus, the scalability of the proposed coordination framework can be guaranteed. Particularly, using only the large-scale CSI of interference links, a coordinated frequency resource block allocation problem is formulated for maximizing the minimum achievable rate of the users, which is uncovered to be a NP-hard integer programming problem. To circumvent this difficulty, a greedy scheme with polynomial-time complexity is proposed by adopting the bisection method and linear integer programming tools. Simulation results demonstrate that the proposed coordination scheme based on large-scale CSI only can still offer substantial gains over the existing methods. Moreover, although the proposed scheme is only guaranteed to converge to a local optimum, it performs well in terms of both user fairness and system efficiency.
- Mar 20 2017 cs.CV arXiv:1703.06029v3Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect.Sentences produced by existing methods, e.g. those based on RNNs, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages high resemblance to the "ground-truth" captions while suppressing other reasonable descriptions. Conventional evaluation metrics, e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with the aim to improve the naturalness and diversity -- two essential properties of human expression. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content. It is noteworthy that training a sequence generator is nontrivial. We overcome the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedback along the way. We tested our method on two large datasets, where it performed competitively against real people in our user study and outperformed other methods on various tasks.
- Mar 10 2017 cs.CV arXiv:1703.03329v2Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.
- Mar 09 2017 cs.CV arXiv:1703.02716v1Detecting activities in untrimmed videos is an important but challenging task. The performance of existing methods remains unsatisfactory, e.g., they often meet difficulties in locating the beginning and end of a long complex action. In this paper, we propose a generic framework that can accurately detect a wide variety of activities from untrimmed videos. Our first contribution is a novel proposal scheme that can efficiently generate candidates with accurate temporal boundaries. The other contribution is a cascaded classification pipeline that explicitly distinguishes between relevance and completeness of a candidate instance. On two challenging temporal activity detection datasets, THUMOS14 and ActivityNet, the proposed framework significantly outperforms the existing state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling activities with various temporal structures.
- For binary $[n,k,d]$ linear locally repairable codes (LRCs), two new upper bounds on $k$ are derived. The first one applies to LRCs with disjoint local repair groups, for general values of $n,d$ and locality $r$, containing some previously known bounds as special cases. The second one is based on solving an optimization problem and applies to LRCs with arbitrary structure of local repair groups. Particularly, an explicit bound is derived from the second bound when $d\geq 5$. A specific comparison shows this explicit bound outperforms the Cadambe-Mazumdar bound for $5\leq d\leq 8$ and large values of $n$. Moreover, a construction of binary linear LRCs with $d\geq6$ attaining our second bound is provided.
- Dec 30 2016 cs.CV arXiv:1612.08879v3With the development of deep learning, supervised learning has frequently been adopted to classify remotely sensed images using convolutional networks (CNNs). However, due to the limited amount of labeled data available, supervised learning is often difficult to carry out. Therefore, we proposed an unsupervised model called multiple-layer feature-matching generative adversarial networks (MARTA GANs) to learn a representation using only unlabeled data. MARTA GANs consists of both a generative model $G$ and a discriminative model $D$. We treat $D$ as a feature extractor. To fit the complex properties of remote sensing data, we use a fusion layer to merge the mid-level and global features. $G$ can produce numerous images that are similar to the training data; therefore, $D$ can learn better representations of remotely sensed images using the training data provided by $G$. The classification results on two widely used remote sensing image databases show that the proposed method significantly improves the classification performance compared with other state-of-the-art methods.
- Collaborative Filtering (CF) is widely used in large-scale recommendation engines because of its efficiency, accuracy and scalability. However, in practice, the fact that recommendation engines based on CF require interactions between users and items before making recommendations, make it inappropriate for new items which haven't been exposed to the end users to interact with. This is known as the cold-start problem. In this paper we introduce a novel approach which employs deep learning to tackle this problem in any CF based recommendation engine. One of the most important features of the proposed technique is the fact that it can be applied on top of any existing CF based recommendation engine without changing the CF core. We successfully applied this technique to overcome the item cold-start problem in Careerbuilder's CF based recommendation engine. Our experiments show that the proposed technique is very efficient to resolve the cold-start problem while maintaining high accuracy of the CF recommendations.
- Nov 18 2016 cs.CV arXiv:1611.05725v2A number of studies have shown that increasing the depth or width of convolutional networks is a rewarding approach to improve the performance of image recognition. In our study, however, we observed difficulties along both directions. On one hand, the pursuit for very deep networks is met with a diminishing return and increased training difficulty; on the other hand, widening a network would result in a quadratic growth in both computational cost and memory demand. These difficulties motivate us to explore structural diversity in designing deep networks, a new dimension beyond just depth and width. Specifically, we present a new family of modules, namely the PolyInception, which can be flexibly inserted in isolation or in a composition as replacements of different parts of a network. Choosing PolyInception modules with the guidance of architectural efficiency can improve the expressive power while preserving comparable computational cost. The Very Deep PolyNet, designed following this direction, demonstrates substantial improvements over the state-of-the-art on the ILSVRC 2012 benchmark. Compared to Inception-ResNet-v2, it reduces the top-5 validation error on single crops from 4.9% to 4.25%, and that on multi-crops from 3.7% to 3.45%.
- Markov Random Fields (MRFs), a formulation widely used in generative image modeling, have long been plagued by the lack of expressive power. This issue is primarily due to the fact that conventional MRFs formulations tend to use simplistic factors to capture local patterns. In this paper, we move beyond such limitations, and propose a novel MRF model that uses fully-connected neurons to express the complex interactions among pixels. Through theoretical analysis, we reveal an inherent connection between this model and recurrent neural networks, and thereon derive an approximated feed-forward network that couples multiple RNNs along opposite directions. This formulation combines the expressive power of deep neural networks and the cyclic dependency structure of MRF in a unified model, bringing the modeling capability to a new level. The feed-forward approximation also allows it to be efficiently learned from data. Experimental results on a variety of low-level vision tasks show notable improvement over state-of-the-arts.
- Aug 03 2016 cs.CV arXiv:1608.00859v1Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
- Aug 03 2016 cs.CV arXiv:1608.00797v1This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.
- It is known that training deep neural networks, in particular, deep convolutional networks, with aggressively reduced numerical precision is challenging. The stochastic gradient descent algorithm becomes unstable in the presence of noisy gradient updates resulting from arithmetic with limited numeric precision. One of the well-accepted solutions facilitating the training of low precision fixed point networks is stochastic rounding. However, to the best of our knowledge, the source of the instability in training neural networks with noisy gradient updates has not been well investigated. This work is an attempt to draw a theoretical connection between low numerical precision and training algorithm stability. In doing so, we will also propose and verify through experiments methods that are able to improve the training performance of deep convolutional networks in fixed point.
- Jul 07 2016 cs.CR arXiv:1607.01642v2Position scrambling (permutation) is widely used in multimedia encryption schemes and some international encryption standards, such as the Data Encryption Standard and the Advanced Encryption Standard. In this article, the authors re-evaluate the security of a typical image-scrambling encryption algorithm (ISEA). Using the internal correlation remaining in the cipher image, they disclose important visual information of the corresponding plain image in a ciphertext-only attack scenario. Furthermore, they found that the real scrambling domain--the position-scrambling scope of ISEA's scrambled elements--can be used to support an efficient known or chosen-plaintext attack on it. Detailed experimental results have verified these points and demonstrate that some advanced multimedia processing techniques can facilitate the cryptanalysis of multimedia encryption algorithms.
- Jun 21 2016 cs.CL arXiv:1606.06274v1Semantic roles play an important role in extracting knowledge from text. Current unsupervised approaches utilize features from grammar structures, to induce semantic roles. The dependence on these grammars, however, makes it difficult to adapt to noisy and new languages. In this paper we develop a data-driven approach to identifying semantic roles, the approach is entirely unsupervised up to the point where rules need to be learned to identify the position the semantic role occurs. Specifically we develop a modified-ADIOS algorithm based on ADIOS Solan et al. (2005) to learn grammar structures, and use these grammar structures to learn the rules for identifying the semantic roles based on the context in which the grammar structures appeared. The results obtained are comparable with the current state-of-art models that are inherently dependent on human annotated data.
- Apr 19 2016 cs.CV arXiv:1604.05144v1Large-scale data is of crucial importance for learning semantic segmentation models, but annotating per-pixel masks is a tedious and inefficient procedure. We note that for the topic of interactive image segmentation, scribbles are very widely used in academic research and commercial software, and are recognized as one of the most user-friendly ways of interacting. In this paper, we propose to use scribbles to annotate images, and develop an algorithm to train convolutional networks for semantic segmentation supervised by scribbles. Our algorithm is based on a graphical model that jointly propagates information from scribbles to unmarked pixels and learns network parameters. We present competitive object semantic segmentation results on the PASCAL VOC dataset by using scribbles as annotations. Scribbles are also favored for annotating stuff (e.g., water, sky, grass) that has no well-defined shape, and our method shows excellent results on the PASCAL-CONTEXT dataset thanks to extra inexpensive scribble annotations. Our scribble annotations on PASCAL VOC are available at http://research.microsoft.com/en-us/um/people/jifdai/downloads/scribble_sup
- Recently, linear codes with few weights have been constructed and extensively studied. In this paper, for an odd prime p, we determined the complete weight enumerator of two classes of p-ary linear codes constructed from defining set. Results show that the codes are at almost seven-weight linear codes and they may have applications in secret sharing schemes.
- Linear codes have been an interesting subject of study for many years. Recently, linear codes with few weights have been constructed and extensively studied. In this paper, for an odd prime p, a class of three-weight linear codes over Fp are constructed. The weight distributions of the linear codes are settled. These codes have applications in authentication codes, association schemes and data storage systems.
- Nov 23 2015 cs.LG arXiv:1511.06393v3In recent years increasingly complex architectures for deep convolution networks (DCNs) have been proposed to boost the performance on image recognition tasks. However, the gains in performance have come at a cost of substantial increase in computation and model storage resources. Fixed point implementation of DCNs has the potential to alleviate some of these complexities and facilitate potential deployment on embedded hardware. In this paper, we propose a quantizer design for fixed point implementation of DCNs. We formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers. Our experiments show that in comparison to equal bit-width settings, the fixed point DCNs with optimized bit width allocation offer >20% reduction in the model size without any loss in accuracy on CIFAR-10 benchmark. We also demonstrate that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. In doing so, we report a new state-of-the-art fixed point performance of 6.78% error-rate on CIFAR-10 benchmark.
- Binary representation is desirable for its memory efficiency, computation speed and robustness. In this paper, we propose adjustable bounded rectifiers to learn binary representations for deep neural networks. While hard constraining representations across layers to be binary makes training unreasonably difficult, we softly encourage activations to diverge from real values to binary by approximating step functions. Our final representation is completely binary. We test our approach on MNIST, CIFAR10, and ILSVRC2012 dataset, and systematically study the training dynamics of the binarization process. Our approach can binarize the last layer representation without loss of performance and binarize all the layers with reasonably small degradations. The memory space that it saves may allow more sophisticated models to be deployed, thus compensating the loss. To the best of our knowledge, this is the first work to report results on current deep network architectures using complete binary middle representations. Given the learned representations, we find that the firing or inhibition of a binary neuron is usually associated with a meaningful interpretation across different classes. This suggests that the semantic structure of a neural network may be manifested through a guided binarization process.
- Recently, linear codes with few weights have been widely studied, since they have applications in data storage systems, communication systems and consumer electronics. In this paper, we present a class of three-weight and five-weight linear codes over Fp, where p is an odd prime and Fp denotes a finite field with p elements. The weight distributions of the linear codes constructed in this paper are also settled. Moreover, the linear codes illustrated in the paper may have applications in secret sharing schemes.
- This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account the coherence among sentences. Experiments on the augmented NYU-v2 dataset show that our framework can generate natural descriptions with substantially higher ROGUE scores compared to those produced by the baseline.
- The generalized Hamming weight (GHW) $d_r(C)$ of linear codes $C$ is a natural generalization of the minimum Hamming distance $d(C)(=d_1(C))$ and has become one of important research objects in coding theory since Wei's originary work [23] in 1991. In this paper two general formulas on $d_r(C)$ for irreducible cyclic codes are presented by using Gauss sums and the weight hierarchy $\{d_1(C), d_2(C), \ldots, d_k(C)\}$ $(k=\dim C)$ are completely determined for several cases.
- Oct 02 2014 cs.SC arXiv:1410.0105v1The GVW algorithm, presented by Gao et al., is a signature-based algorithm for computing Gröbner bases. In this paper, a variant of GVW is presented. This new algorithm is called a monomial-oriented GVW algorithm or mo-GVW algorithm for short. The mo-GVW algorithm presents a new frame of GVW and regards \em labeled monomials instead of \em labeled polynomials as basic elements of the algorithm. Being different from the original GVW algorithm, for each labeled monomial, the mo-GVW makes efforts to find the smallest signature that can generate this monomial. The mo-GVW algorithm also avoids generating J-pairs, and uses efficient methods of searching reducers and checking criteria. Thus, the mo-GVW algorithm has a better performance during practical implementations.
- May 20 2014 cs.SC arXiv:1405.4596v3An improved characteristic set algorithm for solving Boolean polynomial systems is pro- posed. This algorithm is based on the idea of converting all the polynomials into monic ones by zero decomposition, and using additions to obtain pseudo-remainders. Three important techniques are applied in the algorithm. The first one is eliminating variables by new gener- ated linear polynomials. The second one is optimizing the strategy of choosing polynomial for zero decomposition. The third one is to compute add-remainders to eliminate the leading variable of new generated monic polynomials. By analyzing the depth of the zero decompo- sition tree, we present some complexity bounds of this algorithm, which are lower than the complexity bounds of previous characteristic set algorithms. Extensive experimental results show that this new algorithm is more efficient than previous characteristic set algorithms for solving Boolean polynomial systems.
- Apr 08 2014 cs.SC arXiv:1404.1428v2The GVW algorithm is a signature-based algorithm for computing Gröbner bases. If the input system is not homogeneous, some J-pairs with higher signatures but lower degrees are rejected by GVW's Syzygy Criterion, instead, GVW have to compute some J-pairs with lower signatures but higher degrees. Consequently, degrees of polynomials appearing during the computations may unnecessarily grow up higher and the computation become more expensive. In this paper, a variant of the GVW algorithm, called M-GVW, is proposed and mutant pairs are introduced to overcome inconveniences brought by inhomogeneous input polynomials. Some techniques from linear algebra are used to improve the efficiency. Both GVW and M-GVW have been implemented in C++ and tested by many examples from boolean polynomial rings. The timings show M-GVW usually performs much better than the original GVW algorithm when mutant pairs are found. Besides, M-GVW is also compared with intrinsic Gröbner bases functions on Maple, Singular and Magma. Due to the efficient routines from the M4RI library, the experimental results show that M-GVW is very efficient.
- Let $p$ be an odd prime with $2$-adic expansion $\sum_{i=0}^kp_i\cdot2^i$. For a sequence $\underline{a}=(a(t))_{t\ge 0}$ over $\mathbb{F}_{p}$, each $a(t)$ belongs to $\{0,1,\ldots, p-1\}$ and has a unique $2$-adic expansion $$a(t)=a_0(t)+a_1(t)⋅2+⋯+a_k(t)\cdot2^k,$$ with $a_i(t)\in\{0, 1\}$. Let $\underline{a_i}$ denote the binary sequence $(a_i(t))_{t\ge 0}$ for $0\le i\le k$. Assume $i_0$ is the smallest index $i$ such that $p_{i}=0$ and $\underline{a}$ and $\underline{b}$ are two different m-sequences generated by a same primitive characteristic polynomial over $\mathbb{F}_p$. We prove that for $i\neq i_0$ and $0\le i\le k$, $\underline{a_i}=\underline{b_i}$ if and only if $\underline{a}=\underline{b}$, and for $i=i_0$, $\underline{a_{i_0}}=\underline{b_{i_0}}$ if and only if $\underline{a}=\underline{b}$ or $\underline{a}=-\underline{b}$. Then the period of $\underline{a_i}$ is equal to the period of $\underline{a}$ if $i\ne i_0$ and half of the period of $\underline{a}$ if $i=i_0$. We also discuss a possible application of the binary sequences $\underline{a_i}$.
- Jan 28 2014 cs.CR arXiv:1401.6604v1We propose a general approach to construct cryptographic significant Boolean functions of $(r+1)m$ variables based on the additive decomposition $\mathbb{F}_{2^{rm}}\times\mathbb{F}_{2^m}$ of the finite field $\mathbb{F}_{2^{(r+1)m}}$, where $r$ is odd and $m\geq3$. A class of unbalanced functions are constructed first via this approach, which coincides with a variant of the unbalanced class of generalized Tu-Deng functions in the case $r=1$. This class of functions have high algebraic degree, but their algebraic immunity does not exceeds $m$, which is impossible to be optimal when $r>1$. By modifying these unbalanced functions, we obtain a class of balanced functions which have optimal algebraic degree and high nonlinearity (shown by a lower bound we prove). These functions have optimal algebraic immunity provided a combinatorial conjecture on binary strings which generalizes the Tu-Deng conjecture is true. Computer investigations show that, at least for small values of number of variables, functions from this class also behave well against fast algebraic attacks.
- Let $\underline{a}$ and $\underline{b}$ be primitive sequences over $\mathbb{Z}/(p^e)$ with odd prime $p$ and $e\ge 2$. For certain compressing maps, we consider the distribution properties of compressing sequences of $\underline{a}$ and $\underline{b}$, and prove that $\underline{a}=\underline{b}$ if the compressing sequences are equal at the times $t$ such that $\alpha(t)=k$, where $\underline{\alpha}$ is a sequence related to $\underline{a}$. We also discuss the $s$-uniform distribution property of compressing sequences. For some compressing maps, we have that there exist different primitive sequences such that the compressing sequences are $s$-uniform. We also discuss that compressing sequences can be $s$-uniform for how many elements $s$.
- An \emphs-graph is a graph with two kinds of edges: \emphsubdivisible edges and \emphreal edges. A \emphrealisation of an s-graph $B$ is any graph obtained by subdividing subdivisible edges of $B$ into paths of arbitrary length (at least one). Given an s-graph $B$, we study the decision problem $\Pi_B$ whose instance is a graph $G$ and question is "Does $G$ contain a realisation of $B$ as an induced subgraph?". For several $B$'s, the complexity of $\Pi_B$ is known and here we give the complexity for several more. Our NP-completeness proofs for $\Pi_B$'s rely on the NP-completeness proof of the following problem. Let $\cal S$ be a set of graphs and $d$ be an integer. Let $\Gamma_{\cal S}^d$ be the problem whose instance is $(G, x, y)$ where $G$ is a graph whose maximum degree is at most d, with no induced subgraph in $\cal S$ and $x, y \in V(G)$ are two non-adjacent vertices of degree 2. The question is "Does $G$ contain an induced cycle passing through $x, y$?". Among several results, we prove that $\Gamma^3_{\emptyset}$ is NP-complete. We give a simple criterion on a connected graph $H$ to decide whether $\Gamma^{+\infty}_{\{H\}}$ is polynomial or NP-complete. The polynomial cases rely on the algorithm three-in-a-tree, due to Chudnovsky and Seymour.
- In this paper, a new construction of quaternary bent functions from quaternary quadratic forms over Galois rings of characteristic 4 is proposed. Based on this construction, several new classes of quaternary bent functions are obtained, and as a consequence, several new classes of quadratic binary bent and semi-bent functions in polynomial forms are derived. This work generalizes the recent work of N. Li, X. Tang and T. Helleseth.
- Mar 28 2013 cs.AI arXiv:1304.1086v1We propose an abductive diagnosis theory that integrates probabilistic, causal and taxonomic knowledge. Probabilistic knowledge allows us to select the most likely explanation; causal knowledge allows us to make reasonable independence assumptions; taxonomic knowledge allows causation to be modeled at different levels of detail, and allows observations be described in different levels of precision. Unlike most other approaches where a causal explanation is a hypothesis that one or more causative events occurred, we define an explanation of a set of observations to be an occurrence of a chain of causation events. These causation events constitute a scenario where all the observations are true. We show that the probabilities of the scenarios can be computed from the conditional probabilities of the causation events. Abductive reasoning is inherently complex even if only modest expressive power is allowed. However, our abduction algorithm is exponential only in the number of observations to be explained, and is polynomial in the size of the knowledge base. This contrasts with many other abduction procedures that are exponential in the size of the knowledge base.
- In distributed storage systems reliability is achieved through redundancy stored at different nodes in the network. Then a data collector can reconstruct source information even though some nodes fail. To maintain reliability, an autonomous and efficient protocol should be used to repair the failed node. The repair process causes traffic and consequently transmission cost in the network. Recent results found the optimal trafficstorage tradeoff, and proposed regenerating codes to achieve the optimality. We aim at minimizing the transmission cost in the repair process. We consider the network topology in the repair, and accordingly modify information flow graphs. Then we analyze the cut requirement and based on the results, we formulate the minimum-cost as a linear programming problem for linear costs. We show that the solution of the linear problem establishes a fundamental lower bound of the repair-cost. We also show that this bound is achievable for minimum storage regenerating, which uses the optimal-cost minimum-storage regenerating (OCMSR) code. We propose surviving node cooperation which can efficiently reduce the repair cost. Further, the field size for the construction of OCMSR codes is discussed. We show the gain of optimal-cost repair in tandem, star, grid and fully connected networks.
- Mar 14 2013 cs.AI arXiv:1303.5415v1Bayesian networks are directed acyclic graphs representing independence relationships among a set of random variables. A random variable can be regarded as a set of exhaustive and mutually exclusive propositions. We argue that there are several drawbacks resulting from the propositional nature and acyclic structure of Bayesian networks. To remedy these shortcomings, we propose a probabilistic network where nodes represent unary predicates and which may contain directed cycles. The proposed representation allows us to represent domain knowledge in a single static network even though we cannot determine the instantiations of the predicates before hand. The ability to deal with cycles also enables us to handle cyclic causal tendencies and to recognize recursive plans.
- Nov 22 2011 cs.CR arXiv:1111.4635v1We find linear (as well as quadratic) relations in a very large class of T-functions. The relations may be used in analysis of T-function-based stream ciphers.
- Oct 03 2011 cs.DB arXiv:1109.6883v1With the growing use of location-based services, location privacy attracts increasing attention from users, industry, and the research community. While considerable effort has been devoted to inventing techniques that prevent service providers from knowing a user's exact location, relatively little attention has been paid to enabling so-called peer-wise privacy--the protection of a user's location from unauthorized peer users. This paper identifies an important efficiency problem in existing peer-privacy approaches that simply apply a filtering step to identify users that are located in a query range, but that do not want to disclose their location to the querying peer. To solve this problem, we propose a novel, privacy-policy enabled index called the PEB-tree that seamlessly integrates location proximity and policy compatibility. We propose efficient algorithms that use the PEB-tree for processing privacy-aware range and kNN queries. Extensive experiments suggest that the PEB-tree enables efficient query processing.
- Ensemble methods, such as stacking, are designed to boost predictive accuracy by blending the predictions of multiple machine learning models. Recent work has shown that the use of meta-features, additional inputs describing each example in a dataset, can boost the performance of ensemble methods, but the greatest reported gains have come from nonlinear procedures requiring significant tuning and training time. Here, we present a linear technique, Feature-Weighted Linear Stacking (FWLS), that incorporates meta-features for improved accuracy while retaining the well-known virtues of linear regression regarding speed, stability, and interpretability. FWLS combines model predictions linearly using coefficients that are themselves linear functions of meta-features. This technique was a key facet of the solution of the second place team in the recently concluded Netflix Prize competition. Significant increases in accuracy over standard linear stacking are demonstrated on the Netflix Prize collaborative filtering dataset.
- Algebraic and fast algebraic attacks are power tools to analyze stream ciphers. A class of symmetric Boolean functions with maximum algebraic immunity were found vulnerable to fast algebraic attacks at EUROCRYPT'06. Recently, the notion of AAR (algebraic attack resistant) functions was introduced as a unified measure of protection against both classical algebraic and fast algebraic attacks. In this correspondence, we first give a decomposition of symmetric Boolean functions, then we show that almost all symmetric Boolean functions, including these functions with good algebraic immunity, behave badly against fast algebraic attacks, and we also prove that no symmetric Boolean functions are AAR functions. Besides, we improve the relations between algebraic degree and algebraic immunity of symmetric Boolean functions.
- We propose a unified framework for deriving and studying soft-in-soft-out (SISO) detection in interference channels using the concept of variational inference. The proposed framework may be used in multiple-access interference (MAI), inter-symbol interference (ISI), and multiple-input multiple-outpu (MIMO) channels. Without loss of generality, we will focus our attention on turbo multiuser detection, to facilitate a more concrete discussion. It is shown that, with some loss of optimality, variational inference avoids the exponential complexity of a posteriori probability (APP) detection by optimizing a closely-related, but much more manageable, objective function called variational free energy. In addition to its systematic appeal, there are several other advantages to this viewpoint. First of all, it provides unified and rigorous justifications for numerous detectors that were proposed on radically different grounds, and facilitates convenient joint detection and decoding (utilizing the turbo principle) when error-control codes are incorporated. Secondly, efficient joint parameter estimation and data detection is possible via the variational expectation maximization (EM) algorithm, such that the detrimental effect of inaccurate channel knowledge at the receiver may be dealt with systematically. We are also able to extend BPSK-based SISO detection schemes to arbitrary square QAM constellations in a rigorous manner using a variational argument.
- Sep 13 2006 cs.CR arXiv:cs/0609057v1We consider a type of zero-knowledge protocols that are of interest for their practical applications within networks like the Internet: efficient zero-knowledge arguments of knowledge that remain secure against concurrent man-in-the-middle attacks. In an effort to reduce the setup assumptions required for efficient zero-knowledge arguments of knowledge that remain secure against concurrent man-in-the-middle attacks, we consider a model, which we call the Authenticated Public-Key (APK) model. The APK model seems to significantly reduce the setup assumptions made by the CRS model (as no trusted party or honest execution of a centralized algorithm are required), and can be seen as a slightly stronger variation of the Bare Public-Key (BPK) model from \citeCGGM,MR, and a weaker variation of the registered public-key model used in \citeBCNP. We then define and study man-in-the-middle attacks in the APK model. Our main result is a constant-round concurrent non-malleable zero-knowledge argument of knowledge for any polynomial-time relation (associated to a language in $\mathcal{NP}$), under the (minimal) assumption of the existence of a one-way function family. Furthermore,We show time-efficient instantiations of our protocol based on known number-theoretic assumptions. We also note a negative result with respect to further reducing the setup assumptions of our protocol to those in the (unauthenticated) BPK model, by showing that concurrently non-malleable zero-knowledge arguments of knowledge in the BPK model are only possible for trivial languages.
- Jul 11 2006 cs.CR arXiv:cs/0607035v3In this paper we resolve an open problem regarding resettable zero knowledge in the bare public-key (BPK for short) model: Does there exist constant round resettable zero knowledge argument with concurrent soundness for $\mathcal{NP}$ in BPK model without assuming \emphsub-exponential hardness? We give a positive answer to this question by presenting such a protocol for any language in $\mathcal{NP}$ in the bare public-key model assuming only collision-resistant hash functions against \emphpolynomial-time adversaries.
- We present an efficient, broad-coverage, principle-based parser for English. The parser has been implemented in C++ and runs on SUN Sparcstations with X-windows. It contains a lexicon with over 90,000 entries, constructed automatically by applying a set of extraction and conversion rules to entries from machine readable dictionaries.