results for au:Xu_Y in:cs

- Apr 20 2018 cs.DS arXiv:1804.06932v1Since the introduction of retroactive data structures at SODA 2004, a major unsolved problem has been to bound the gap between the best partially retroactive data structure (where changes can be made to the past, but only the present can be queried) and the best fully retroactive data structure (where the past can also be queried) for any problem. It was proved in 2004 that any partially retroactive data structure with operation time $T(n,m)$ can be transformed into a fully retroactive data structure with operation time $O(\sqrt{m} \cdot T(n,m))$, where $n$ is the size of the data structure and $m$ is the number of operations in the timeline [Demaine 2004], but it has been open for 14 years whether such a gap is necessary. In this paper, we prove nearly matching upper and lower bounds on this gap for all $n$ and $m$. We improve the upper bound for $n \ll \sqrt m$ by showing a new transformation with multiplicative overhead $n \log m$. We then prove a lower bound of $\min\{n \log m, \sqrt m\}^{1-o(1)}$ assuming any of the following conjectures: - Conjecture I: Circuit SAT requires $2^{n - o(n)}$ time on $n$-input circuits of size $2^{o(n)}$. (Far weaker than the well-believed SETH conjecture, which asserts that CNF SAT with $n$ variables and $O(n)$ clauses already requires $2^{n-o(n)}$ time.) - Conjecture II: Online $(\min,+)$ product between an integer $n\times n$ matrix and $n$ vectors requires $n^{3 - o(1)}$ time. - Conjecture III (3-SUM Conjecture): Given three sets $A,B,C$ of integers, each of size $n$, deciding whether there exist $a \in A, b \in B, c \in C$ such that $a + b + c = 0$ requires $n^{2 - o(1)}$ time. Our lower bound construction illustrates an interesting power of fully retroactive queries: they can be used to quickly solve batched pair evaluation. We believe this technique can prove useful for other data structure lower bounds, especially dynamic ones.
- This paper studies a new type of 3D bin packing problem (BPP), in which a number of cuboid-shaped items must be put into a bin one by one orthogonally. The objective is to find a way to place these items that can minimize the surface area of the bin. This problem is based on the fact that there is no fixed-sized bin in many real business scenarios and the cost of a bin is proportional to its surface area. Based on previous research on 3D BPP, the surface area is determined by the sequence, spatial locations and orientations of items. It is a new NP-hard combinatorial optimization problem on unfixed-sized bin packing, for which we propose a multi-task framework based on Selected Learning, generating the sequence and orientations of items packed into the bin simultaneously. During training steps, Selected Learning chooses one of loss functions derived from Deep Reinforcement Learning and Supervised Learning corresponding to the training procedure. Numerical results show that the method proposed significantly outperforms Lego baselines by a substantial gain of 7.52%. Moreover, we produce large scale 3D Bin Packing order data set for studying bin packing problems and will release it to the research community.
- Sound event detection (SED) aims to detect what and when sound events happen in an audio clip. Sound events can be segmented in the time-frequency (T-F) domain and is called T-F segmentation. Many supervised SED algorithms rely on strongly labelled data which contains labels of onset and offset times of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence or absence of the sound events is known, without knowing their onset and offset times. In this paper, we propose a SED and T-F segmentation framework trained with weakly labelled data. In the training stage, we propose a segmentation mapping applied on a T-F representation of an audio clip to obtain T-F segmentation masks of sound events. We then apply a classification mapping on each T-F segmentation mask to estimate the presence probability of each sound event. Both of the segmentation mapping and classification mapping are trained jointly. In T-F segmentation, T-F segmentation masks can be obtained by presenting a T-F representation of an audio clip to the trained segmentation mapping. In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. We propose to model the segmentation mapping using a convolutional neural network and the segmentation mapping using a global weighted rank pooling (GWRP). As a byproduct, separated waveforms of sound events can be obtained from their corresponding T-F segmentation masks. Experiments on the remixed DCASE 2013 dataset show that the proposed method obtains an area under the curve (AUC) score of 0.948 in audio tagging and 0.893 in sound event detection, outperforming a deep neural network baseline of 0.719 and 0.616, respectively.
- Apr 04 2018 cs.CV arXiv:1804.00931v1In this paper, we present a detailed design of dynamic video segmentation network (DVSNet) for fast and efficient semantic video segmentation. DVSNet consists of two convolutional neural networks: a segmentation network and a flow network. The former generates highly accurate semantic segmentations, but is deeper and slower. The latter is much faster than the former, but its output requires further processing to generate less accurate semantic segmentations. We explore the use of a decision network to adaptively assign different frame regions to different networks based on a metric called expected confidence score. Frame regions with a higher expected confidence score traverse the flow network. Frame regions with a lower expected confidence score have to pass through the segmentation network. We have extensively performed experiments on various configurations of DVSNet, and investigated a number of variants for the proposed decision network. The experimental results show that our DVSNet is able to achieve up to 70.4% mIoU at 19.8 fps on the Cityscape dataset. A high speed version of DVSNet is able to deliver an fps of 30.4 with 63.2% mIoU on the same dataset. DVSNet is also able to reduce up to 95\% of the computational workloads.
- Apr 03 2018 cs.CV arXiv:1804.00100v2Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributions from the current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation is superior to the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-arts on the ActivityNet Captions dataset with a relative gain of over 100% (Meteor score increases from 4.82 to 9.65).
- Apr 02 2018 cs.CV arXiv:1803.11527v1Deep neural networks have enjoyed remarkable success for various vision tasks, however it remains challenging to apply CNNs to domains lacking a regular underlying structures such as 3D point clouds. Towards this we propose a novel convolutional architecture, termed SpiderCNN, to efficiently extract geometric features from point clouds. SpiderCNN is comprised of units called SpiderConv, which extend convolutional operations from regular grids to irregular point set that can be embedded in R^n, by parametrizing a family of convolutional filters. We elaborately design the filter as a product of simple step function that captures local geodesic information and a Taylor polynomial that ensures the expressiveness. SpiderCNN inherits the multi-scale hierarchical architecture from the classical CNNs, which allows it to extract semantic deep features. Experiments on ModelNet40 demonstrate that SpiderCNN achieves the-state-of-the art accuracy 92.4% on standard benchmarks, and shows competitive performance on segmentation task.
- Mar 29 2018 cs.SI arXiv:1803.10402v1Multiplayer Online Battle Arena (MOBA) games have received increasing worldwide popularity recently. In such games, players compete in teams against each other by controlling selected game avatars, each of which is designed with different strengths and weaknesses. Intuitively, putting together game avatars that complement each other (synergy) and suppress those of opponents (opposition) would result in a stronger team. In-depth understanding of synergy and opposition relationships among game avatars benefits player in making decisions in game avatar drafting and gaining better prediction of match events. However, due to intricate design and complex interactions between game avatars, thorough understanding of their relationships is not a trivial task. In this paper, we propose a latent variable model, namely Game Avatar Embedding (GAE), to learn avatars' numerical representations which encode synergy and opposition relationships between pairs of avatars. The merits of our model are twofold: (1) the captured synergy and opposition relationships are sensible to experienced human players' perception; (2) the learned numerical representations of game avatars allow many important downstream tasks, such as similar avatar search, match outcome prediction, and avatar pick recommender. To our best knowledge, no previous model is able to simultaneously support both features. Our quantitative and qualitative evaluations on real match data from three commercial MOBA games illustrate the benefits of our model.
- Social awareness and social ties are becoming increasingly popular with emerging mobile and handheld devices. Social trust degree describing the strength of the social ties has drawn lots of research interests in many fields in wireless communications, such as resource sharing, cooperative communication and so on. In this paper, we propose a hybrid cooperative beamforming and jamming scheme to secure communication based on the social trust degree under a stochastic geometry framework. The friendly nodes are categorized into relays and jammers according to their locations and social trust degrees with the source node. We aim to analyze the involved connection outage probability (COP) and secrecy outage probability (SOP) of the performance in the networks. To achieve this target, we propose a double Gamma ratio (DGR) approach through Gamma approximation. Based on this, the COP and SOP are tractably obtained in closed-form. We further consider the SOP in the presence of Poisson Point Process (PPP) distributed eavesdroppers and derive an upper bound. The simulation results verify our theoretical findings, and validate that the social trust degree has dramatic influences on the security performance in the networks.
- Immersive media streaming, especially virtual reality (VR)/360-degree video streaming which is very bandwidth demanding, has become more and more popular due to the rapid growth of the multimedia and networking deployments. To better explore the usage of resource and achieve better quality of experience (QoE) perceived by users, this paper develops an application-layer scheme to jointly exploit the available bandwidth from the LTE and Wi-Fi networks in 360-degree video streaming. This newly proposed scheme and the corresponding solution algorithms utilize the saliency of video, prediction of users' view and the status information of users to obtain an optimal association of the users with different Wi-Fi access points (APs) for maximizing the system's utility. Besides, a novel buffer strategy is proposed to mitigate the influence of short-time prediction problem for transmitting 360-degree videos in time-varying networks. The promising performance and low complexity of the proposed scheme and algorithms are validated in simulations with various 360-degree videos.
- Mar 20 2018 cs.CV arXiv:1803.06655v1A novel warp for natural image stitching is proposed that utilizes the property of cylindrical warp and a horizontal pixel selection strategy. The proposed ratio-preserving half-cylindrical warp is a combination of homography and cylindrical warps which guarantees alignment by homography and possesses less projective distortion by cylindrical warp. Unlike previous approaches applying cylindrical warp before homography, we use partition lines to divide the image into different parts and apply homography in the overlapping region while a composition of homography and cylindrical warps in the non-overlapping region. The pixel selection strategy then samples the points in horizontal and reconstructs the image via interpolation to further reduce horizontal distortion by maintaining the ratio as similarity. With applying half-cylindrical warp and horizontal pixel selection, the projective distortion in vertical and horizontal is mitigated simultaneously. Experiments show that our warp is efficient and produces a more natural-looking stitched result than previous methods.
- Mar 14 2018 cs.CV arXiv:1803.04793v1Sparse representation classification achieves good results by addressing recognition problem with sufficient training samples per subject. However, SRC performs not very well for small sample data. In this paper, an inverse-projection group sparse representation model is presented for breast tumor classification, which is based on constructing low-rank variation dictionary. The proposed low-rank variation dictionary tackles tumor recognition problem from the viewpoint of detecting and using variations in gene expression profiles of normal and patients, rather than directly using these samples. The inverse projection group sparsity representation model is constructed based on taking full using of exist samples and group effect of microarray gene data. Extensive experiments on public breast tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods. The results of Breast-1, Breast-2 and Breast-3 databases are 80.81%, 89.10% and 100% respectively, which are better than the latest literature.
- Encoders of AOM/AV1 codec consider an input video sequence as succession of frames grouped in Golden-Frame (GF) groups. The coding structure of a GF group is fixed with a given GF group size. In the current AOM/AV1 encoder, video frames are coded using a hierarchical, multilayer coding structure within one GF group. It has been observed that the use of multilayer coding structure may result in worse coding performance if the GF group presents consistent stillness across its frames. This paper proposes a new approach that adaptively designs the Golden-Frame (GF) group coding structure through the use of stillness detection. Our new approach hence develops an automatic stillness detection scheme using three metrics extracted from each GF group. It then differentiates those GF groups of stillness from other non- still GF groups and uses different GF coding structures accordingly. Experimental result demonstrates a consistent coding gain using the new approach.
- Network quantization is an effective solution to compress deep neural networks for practical usage. Existing network quantization methods cannot sufficiently exploit the depth information to generate low-bit compressed network. In this paper, we propose two novel network quantization approaches, single-level network quantization (SLQ) for high-bit quantization and multi-level network quantization (MLQ) for extremely low-bit quantization (ternary).We are the first to consider the network quantization from both width and depth level. In the width level, parameters are divided into two parts: one for quantization and the other for re-training to eliminate the quantization loss. SLQ leverages the distribution of the parameters to improve the width level. In the depth level, we introduce incremental layer compensation to quantize layers iteratively which decreases the quantization loss in each iteration. The proposed approaches are validated with extensive experiments based on the state-of-the-art neural networks including AlexNet, VGG-16, GoogleNet and ResNet-18. Both SLQ and MLQ achieve impressive results.
- Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and ApplicationMar 05 2018 cs.LG arXiv:1803.00710v2In e-commerce platforms such as Amazon and TaoBao, ranking items in a search session is a typical multi-step decision-making problem. Learning to rank (LTR) methods have been widely applied to ranking problems. However, such methods often consider different ranking steps in a session to be independent, which conversely may be highly correlated to each other. For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multi-step ranking problem. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution of an SSMDP. Experiments are conducted in simulation and TaoBao search engine. The results demonstrate that our algorithm performs much better than online LTR methods, with more than 40% and 30% growth of total transaction amount in the simulation and the real application, respectively.
- We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.
- Mar 02 2018 cs.CL arXiv:1803.00179v1Semantic matching of natural language sentences or identifying the relationship between two sentences is a core research problem underlying many natural language tasks. Depending on whether training data is available, prior research has proposed both unsupervised distance-based schemes and supervised deep learning schemes for sentence matching. However, previous approaches either omit or fail to fully utilize the ordered, hierarchical, and flexible structures of language objects, as well as the interactions between them. In this paper, we propose Hierarchical Sentence Factorization---a technique to factorize a sentence into a hierarchical representation, with the components at each different scale reordered into a "predicate-argument" form. The proposed sentence factorization technique leads to the invention of: 1) a new unsupervised distance metric which calculates the semantic distance between a pair of text snippets by solving a penalized optimal transport problem while preserving the logical relationship of words in the reordered sentences, and 2) new multi-scale deep learning models for supervised semantic training, based on factorized sentence hierarchies. We apply our techniques to text-pair similarity estimation and text-pair relationship classification tasks, based on multiple datasets such as STSbenchmark, the Microsoft Research paraphrase identification (MSRP) dataset, the SICK dataset, etc. Extensive experiments show that the proposed hierarchical sentence factorization can be used to significantly improve the performance of existing unsupervised distance-based metrics as well as multiple supervised deep learning models based on the convolutional neural network (CNN) and long short-term memory (LSTM).
- Given $n$ vectors $x_0, x_1, \ldots, x_{n-1}$ in $\{0,1\}^{m}$, how to find two vectors whose pairwise Hamming distance is minimum? This problem is known as the Closest Pair Problem. If these vectors are generated uniformly at random except two of them are correlated with Pearson-correlation coefficient $\rho$, then the problem is called the Light Bulb Problem. In this work, we propose a novel coding-based scheme for the Close Pair Problem. We design both randomized and deterministic algorithms, which achieve the best-known running time when the minimum distance is very small compared to the length of input vectors. When applied to the Light Bulb Problem, our algorithms yields state-of-the-art deterministic running time when the Pearson-correlation coefficient $\rho$ is very large.
- Identifying the relationship between two text objects is a core research problem underlying many natural language processing tasks. A wide range of deep learning schemes have been proposed for text matching, mainly focusing on sentence matching, question answering or query document matching. We point out that existing approaches do not perform well at matching long documents, which is critical, for example, to AI-based news article understanding and event or story formation. The reason is that these methods either omit or fail to fully utilize complicated semantic structures in long documents. In this paper, we propose a graph approach to text matching, especially targeting long document matching, such as identifying whether two news articles report the same event in the real world, possibly with different narratives. We propose the Concept Interaction Graph to yield a graph representation for a document, with vertices representing different concepts, each being one or a group of coherent keywords in the document, and with edges representing the interactions between different concepts, connected by sentences in the document. Based on the graph representation of document pairs, we further propose a Siamese Encoded Graph Convolutional Network that learns vertex representations through a Siamese neural network and aggregates the vertex features though Graph Convolutional Networks to generate the matching result. Extensive evaluation of the proposed approach based on two labeled news article datasets created at Tencent for its intelligent news products show that the proposed graph approach to long document matching significantly outperforms a wide range of state-of-the-art methods.
- Feb 19 2018 cs.MM arXiv:1802.06057v1Immersive video offers the freedom to navigate inside virtualized environment. Instead of streaming the bulky immersive videos entirely, a viewport (also referred to as field of view, FoV) adaptive streaming is preferred. We often stream the high-quality content within current viewport, while reducing the quality of representation elsewhere to save the network bandwidth consumption. Consider that we could refine the quality when focusing on a new FoV, in this paper, we model the perceptual impact of the quality variations (through adapting the quantization stepsize and spatial resolution) with respect to the refinement duration, and yield a product of two closed-form exponential functions that well explain the joint quantization and resolution induced quality impact. Analytical model is cross-validated using another set of data, where both Pearson and Spearman's rank correlation coefficients are close to 0.98. Our work is devised to optimize the adaptive FoV streaming of the immersive video under limited network resource. Numerical results show that our proposed model significantly improves the quality of experience of users, with about 9.36\% BD-Rate (Bjontegaard Delta Rate) improvement on average as compared to other representative methods, particularly under the limited bandwidth.
- In this paper, we present an accurate approach to estimate vehicles' pose and shape from off-board multiview images. The images are taken by monocular cameras and have small overlaps. We utilize state-of-the-art convolutional neural networks (CNNs) to extract vehicles' semantic keypoints and introduce a Cross Projection Optimization (CPO) method to estimate the 3D pose. During the iterative CPO process, an adaptive shape adjustment method named Hierarchical Wireframe Constraint (HWC) is implemented to estimate the shape. Our approach is evaluated under both simulated and real-world scenes for performance verification. It's shown that our algorithm outperforms other existing monocular and stereo methods for vehicles' pose and shape estimation. This approach provides a new and robust solution for off-board visual vehicle localization and tracking, which can be applied to massive surveillance camera networks for intelligent transportation.
- Jan 29 2018 cs.CR arXiv:1801.08737v1In this work, we provide the first lattice-based group signature that offers full dynamicity (i.e., users have the flexibility in joining and leaving the group), and thus, resolve a prominent open problem posed by previous works. Moreover, we achieve this non-trivial feat in a relatively simple manner. Starting with Libert et al.'s fully static construction (Eurocrypt 2016) - which is arguably the most efficient lattice-based group signature to date, we introduce simple-but-insightful tweaks that allow to upgrade it directly into the fully dynamic setting. More startlingly, our scheme even produces slightly shorter signatures than the former, thanks to an adaptation of a technique proposed by Ling et al. (PKC 2013), allowing to prove inequalities in zero-knowledge. Our design approach consists of upgrading Libert et al.'s static construction (EUROCRYPT 2016) - which is arguably the most efficient lattice-based group signature to date - into the fully dynamic setting. Somewhat surprisingly, our scheme produces slightly shorter signatures than the former, thanks to a new technique for proving inequality in zero-knowledge without relying on any inequality check. The scheme satisfies the strong security requirements of Bootle et al.'s model (ACNS 2016), under the Short Integer Solution (SIS) and the Learning With Errors (LWE) assumptions. Furthermore, we demonstrate how to equip the obtained group signature scheme with the deniability functionality in a simple way. This attractive functionality, put forward by Ishida et al. (CANS 2016), enables the tracing authority to provide an evidence that a given user is not the owner of a signature in question. In the process, we design a zero-knowledge protocol for proving that a given LWE ciphertext does not decrypt to a particular message.
- Jan 26 2018 cs.CR arXiv:1801.08323v1Group signature is a fundamental cryptographic primitive, aiming to protect anonymity and ensure accountability of users. It allows group members to anonymously sign messages on behalf of the whole group, while incorporating a tracing mechanism to identify the signer of any suspected signature. Most of the existing group signature schemes, however, do not guarantee security once users' secret keys are exposed. To reduce potential damages caused by key exposure attacks, Song (CCS 2001) put forward the concept of forward-secure group signatures (FSGS). For the time being, all known secure FSGS schemes are based on number-theoretic assumptions, and are vulnerable against quantum computers. In this work, we construct the first lattice-based FSGS scheme. In Nakanishi et al.'s model, our scheme achieves forward-secure traceability under the Short Integer Solution (SIS) assumption, and offers full anonymity under the Learning With Errors (LWE) assumption. At the heart of our construction is a scalable lattice-based key-evolving mechanism, allowing users to periodically update their secret keys and to efficiently prove in zero-knowledge that the key-evolution process is done correctly. To realize this essential building block, we first employ the Bonsai-tree structure by Cash et al. (EUROCRYPT 2010) to handle the key evolution process, and then develop Langlois et al.'s construction (PKC 2014) to design its supporting zero-knowledge protocol. In comparison to all known lattice-based group signatures (that are \emphnot forward-secure), our scheme only incurs a very reasonable overhead: the bit-sizes of keys and signatures are at most O(log N), where N is the number of group users; and at most O(log^3 T), where T is the number of time periods.
- Jan 25 2018 cs.CL arXiv:1801.07746v2The science of happiness is an area of positive psychology concerned with understanding what behaviors make people happy in a sustainable fashion. Recently, there has been interest in developing technologies that help incorporate the findings of the science of happiness into users' daily lives by steering them towards behaviors that increase happiness. With the goal of building technology that can understand how people express their happy moments in text, we crowd-sourced HappyDB, a corpus of 100,000 happy moments that we make publicly available. This paper describes HappyDB and its properties, and outlines several important NLP problems that can be studied with the help of the corpus. We also apply several state-of-the-art analysis techniques to analyze HappyDB. Our results demonstrate the need for deeper NLP techniques to be developed which makes HappyDB an exciting resource for follow-on research.
- Jan 24 2018 cs.DC astro-ph.IM arXiv:1801.07548v1With many large science equipment constructing and putting into use, astronomy has stepped into the big data era. The new method and infrastructure of big data processing has become a new requirement of many astronomers. Cloud computing, Map/Reduce, Hadoop, Spark, etc. many new technology has sprung up in recent years. Comparing to the high performance computing(HPC), Data is the center of these new technology. So, a new computing architecture infrastructure is necessary, which can be shared by both HPC and big data processing. Based on Astronomy Cloud project of Chinese Virtual Observatory (China-VO), we have made much efforts to optimize the designation of the hybrid computing platform. which include the hardware architecture, cluster management, Job and Resource scheduling.
- Jan 23 2018 cs.CV arXiv:1801.06940v1We develop a novel cross-modality generation framework that learns to generate predicted modalities from given modalities in MR images without real acquisition. Our proposed method performs image-to-image translation by means of a deep learning model that leverages conditional generative adversarial networks (cGANs). Our framework jointly exploits the low-level features (pixel-wise information) and high-level representations (e.g. brain tumors, brain structure like gray matter, etc.) between cross modalities which are important for resolving the challenging complexity in brain structures. Based on our proposed framework, we first propose a method for cross-modality registration by fusing the deformation fields to adopt the cross-modality information from predicted modalities. Second, we propose an approach for MRI segmentation, translated multichannel segmentation (TMS), where given modalities, along with predicted modalities, are segmented by fully convolutional networks (FCN) in a multi-channel manner. Both these two methods successfully adopt the cross-modality information to improve the performance without adding any extra data. Experiments demonstrate that our proposed framework advances the state-of-the-art on five MRI datasets. We also observe encouraging results in cross-modality registration and segmentation on some widely adopted datasets. Overall, our work can serve as an auxiliary method in clinical diagnosis and be applied to various tasks in medical fields. Keywords: Image-to-image, cross-modality, registration, segmentation, MRI
- Cell movement in the early phase of C. elegans development is regulated by a highly complex process in which a set of rules and connections are formulated at distinct scales. Previous efforts have shown that agent-based, multi-scale modeling systems can integrate physical and biological rules and provide new avenues to study developmental systems. However, the application of these systems to model cell movement is still challenging and requires a comprehensive understanding of regulation networks at the right scales. Recent developments in deep learning and reinforcement learning provide an unprecedented opportunity to explore cell movement using 3D time-lapse images. We present a deep reinforcement learning approach within an ABM system to characterize cell movement in C. elegans embryogenesis. Our modeling system captures the complexity of cell movement patterns in the embryo and overcomes the local optimization problem encountered by traditional rule-based, ABM that uses greedy algorithms. We tested our model with two real developmental processes: the anterior movement of the Cpaaa cell via intercalation and the rearrangement of the left-right asymmetry. In the first case, model results showed that Cpaaa's intercalation is an active directional cell movement caused by the continuous effects from a longer distance, as opposed to a passive movement caused by neighbor cell movements. This is because the learning-based simulation found that a passive movement model could not lead Cpaaa to the predefined destination. In the second case, a leader-follower mechanism well explained the collective cell movement pattern. These results showed that our approach to introduce deep reinforcement learning into ABM can test regulatory mechanisms by exploring cell migration paths in a reverse engineering perspective. This model opens new doors to explore large datasets generated by live imaging.
- Jan 09 2018 cs.NI arXiv:1801.02077v1In this paper, we propose a two-layer framework to learn the optimal handover (HO) controllers in possibly large-scale wireless systems supporting mobile Internet-of-Things (IoT) users or traditional cellular users, where the user mobility patterns could be heterogeneous. In particular, our proposed framework first partitions the user equipments (UEs) with different mobility patterns into clusters, where the mobility patterns are similar in the same cluster. Then, within each cluster, an asynchronous multi-user deep reinforcement learning scheme is developed to control the HO processes across the UEs in each cluster, in the goal of lowering the HO rate while ensuring certain system throughput. In this scheme, we use a deep neural network (DNN) as an HO controller learned by each UE via reinforcement learning in a collaborative fashion. Moreover, we use supervised learning in initializing the DNN controller before the execution of reinforcement learning to exploit what we already know with traditional HO schemes and to mitigate the negative effects of random exploration at the initial stage. Furthermore, we show that the adopted global-parameter-based asynchronous framework enables us to train faster with more UEs, which could nicely address the scalability issue to support large systems. Finally, simulation results demonstrate that the proposed framework can achieve better performance than the state-of-art on-line schemes, in terms of HO rates.
- Jan 09 2018 cs.CV arXiv:1801.02475v2In this paper, we proposed a effective but extensible residual one-dimensional convolution neural network as base network, based on the this network, we proposed four subnets to explore the features of skeleton sequences from each aspect. Given a skeleton sequences, the spatial information are encoded into the skeleton joints coordinate in a frame and the temporal information are present by multiple frames. Limited by the skeleton sequence representations, two-dimensional convolution neural network cannot be used directly, we chose one-dimensional convolution layer as the basic layer. Each sub network could extract discriminative features from different aspects. Our first subnet is a two-stream network which could explore both temporal and spatial information. The second is a body-parted network, which could gain micro spatial features and macro temporal features. The third one is an attention network, the main contribution of which is to focus the key frames and feature channels which high related with the action classes in a skeleton sequence. One frame-difference network, as the last subnet, mainly processes the joints changes between the consecutive frames. Four subnets ensemble together by late fusion, the key problem of ensemble method is each subnet should have a certain performance and between the subnets, there are diversity existing. Each subnet shares a wellperformance basenet and differences between subnets guaranteed the diversity. Experimental results show that the ensemble network gets a state-of-the-art performance on three widely used datasets.
- Decoupled fractional Laplacian wave equation can describe the seismic wave propagation in attenuating media. Fourier pseudospectral implementations, which solve the equation in spatial frequency domain, are the only existing methods for solving the equation. For the earth media with curved boundaries, the pseudospectral methods could be less attractive to handle the irregular computational domains. In the paper, we propose a radial basis function collocation method that can easily tackle the irregular domain problems. Unlike the pseudospectral methods, the proposed method solves the equation in physical variable domain. The directional fractional Laplacian is chosen from varied definitions of fractional Laplacian. Particularly, the vector Grünwald-Letnikov formula is employed to approximate fractional directional derivative of radial basis function. The convergence and stability of the method are numerically investigated by using the synthetic solution and the long-time simulations, respectively. The method's flexibility is studied by considering homogeneous and multi-layer media having regular and irregular geometric boundaries.
- Jan 04 2018 cs.CV arXiv:1801.00926v3Glaucoma is a chronic eye disease that leads to irreversible vision loss. The cup to disc ratio (CDR) plays an important role in the screening and diagnosis of glaucoma. Thus, the accurate and automatic segmentation of optic disc (OD) and optic cup (OC) from fundus images is a fundamental task. Most existing methods segment them separately, and rely on hand-crafted visual feature from fundus images. In this paper, we propose a deep learning architecture, named M-Net, which solves the OD and OC segmentation jointly in a one-stage multi-label system. The proposed M-Net mainly consists of multi-scale input layer, U-shape convolutional network, side-output layer, and multi-label loss function. The multi-scale input layer constructs an image pyramid to achieve multiple level receptive field sizes. The U-shape convolutional network is employed as the main body network structure to learn the rich hierarchical representation, while the side-output layer acts as an early classifier that produces a companion local prediction map for different scale layers. Finally, a multi-label loss function is proposed to generate the final segmentation map. For improving the segmentation performance further, we also introduce the polar transformation, which provides the representation of the original image in the polar coordinate system. The experiments show that our M-Net system achieves state-of-the-art OD and OC segmentation result on ORIGA dataset. Simultaneously, the proposed method also obtains the satisfactory glaucoma screening performances with calculated CDR value on both ORIGA and SCES datasets.
- Jan 03 2018 cs.NI arXiv:1801.00095v1With the rapid growth of congestion-sensitive and data-intensive applications, traditional settlement-free peering agreements with best-effort delivery often do not meet the QoS requirements of content providers (CPs). Meanwhile, Internet access providers (IAPs) feel that revenues from end-users are not sufficient to recoup the upgrade costs of network infrastructures. Consequently, some IAPs have begun to offer CPs a new type of peering agreement, called paid peering, under which they provide CPs with better data delivery quality for a fee. In this paper, we model a network platform where an IAP makes decisions on the peering types offered to CPs and the prices charged to CPs and end-users. We study the optimal peering schemes for the IAP, i.e., to offer CPs both the paid and settlement-free peering to choose from or only one of them, as the objective is profit or welfare maximization. Our results show that 1) the IAP should always offer the paid and settlement-free peering under the profit-optimal and welfare-optimal schemes, respectively, 2) whether to simultaneously offer the other peering type is largely driven by the type of data traffic, e.g., text or video, and 3) regulators might want to encourage the IAP to allocate more network capacity to the settlement-free peering for increasing user welfare.
- Social awareness and social ties are becoming increasingly fashionable with emerging mobile and handheld devices. Social trust degree describing the strength of the social ties has drawn lots of research interests in many fields including secure cooperative communications. Such trust degree reflects the users' willingness for cooperation, which impacts the selection of the cooperative users in the practical networks. In this paper, we propose a cooperative relay and jamming selection scheme to secure communication based on the social trust degree under a stochastic geometry framework. We aim to analyze the involved secrecy outage probability (SOP) of the system's performance. To achieve this target, we propose a double Gamma ratio (DGR) approach through Gamma approximation. Based on this, the SOP is tractably obtained in closed form. The simulation results verify our theoretical findings, and validate that the social trust degree has dramatic influences on the network's secrecy performance.
- This paper proposes a distributed multiple relay selection scheme to maximize the satisfaction experiences of unmanned aerial vehicles (UAV) communication networks. The multi-radio and multi-channel (MRMC) UAV communication system is considered in this paper. One source UAV can select one or more relay radios, and each relay radio can be shared by multiple source UAVs equally. Without the center controller, source UAVs with heterogeneous requirements compete for channels dominated by relay radios. In order to optimize the global satisfaction performance, we model the UAV communication network as a many-to-many matching market without substitutability. We design a potential matching approach to address the optimization problem, in which the optimizing of local matching process will lead to the improvement of global matching results. Simulation results show that the proposed distributed matching approach yields good matching performance of satisfaction, which is close to the global optimum result. Moreover, the many-to-many potential matching approach outperforms existing schemes sufficiently in terms of global satisfaction within a reasonable convergence time.
- This paper investigates an energy-efficient non-orthogonal transmission design problem for two downlink receivers that have strict reliability and finite blocklength (latency) constraints. The Shannon capacity formula widely used in traditional designs needs the assumption of infinite blocklength and thus is no longer appropriate. We adopt the newly finite blocklength coding capacity formula for explicitly specifying the trade-off between reliability and code blocklength. However, conventional successive interference cancellation (SIC) may become infeasible due to heterogeneous blocklengths. We thus consider several scenarios with different channel conditions and with/without SIC. By carefully examining the problem structure, we present in closed-form the optimal power and code blocklength for energy-efficient transmissions. Simulation results provide interesting insights into conditions for which non-orthogonal transmission is more energy efficient than the orthogonal transmission such as TDMA.
- Dec 01 2017 cs.CV arXiv:1711.11317v3The visual attributes of cells, such as the nuclear morphology and chromatin openness, are critical for histopathology image analysis. By learning cell-level visual representation, we can obtain a rich mix of features that are highly reusable for various tasks, such as cell-level classification, nuclei segmentation, and cell counting. In this paper, we propose a unified generative adversarial networks architecture with a new formulation of loss to perform robust cell-level visual representation learning in an unsupervised setting. Our model is not only label-free and easily trained but also capable of cell-level unsupervised classification with interpretable visualization, which achieves promising results in the unsupervised classification of bone marrow cellular components. Based on the proposed cell-level visual representation learning, we further develop a pipeline that exploits the varieties of cellular elements to perform histopathology image classification, the advantages of which are demonstrated on bone marrow datasets.
- Nov 30 2017 cs.CV arXiv:1711.10658v2Recently, many methods of person re-identification (Re-ID) rely on part-based feature representation to learn a discriminative pedestrian descriptor. However, the spatial context between these parts is ignored for the independent extractor to each separate part. In this paper, we propose to apply Long Short-Term Memory (LSTM) in an end-to-end way to model the pedestrian, seen as a sequence of body parts from head to foot. Integrating the contextual information strengthens the discriminative ability of local representation. We also leverage the complementary information between local and global feature. Furthermore, we integrate both identification task and ranking task in one network, where a discriminative embedding and a similarity measurement are learned concurrently. This results in a novel three-branch framework named Deep-Person, which learns highly discriminative features for person Re-ID. Experimental results demonstrate that Deep-Person outperforms the state-of-the-art methods by a large margin on three challenging datasets including Market-1501, CUHK03, and DukeMTMC-reID. Specifically, combining with a re-ranking approach, we achieve a 90.84% mAP on Market-1501 under single query setting.
- Nov 27 2017 cs.CV arXiv:1711.08608v2We propose a registration algorithm for 2D CT/MRI medical images with a new unsupervised end-to-end strategy using convolutional neural networks. The contributions of our algorithm are threefold: (1) We transplant traditional image registration algorithms to an end-to-end convolutional neural network framework, while maintaining the unsupervised nature of image registration problems. The image-to-image integrated framework can simultaneously learn both image features and transformation matrix for registration. (2) Training with additional data without any label can further improve the registration performance by approximately 10 %. (3) The registration speed is 100x faster than traditional methods. The proposed network is easy to implement and can be trained efficiently. Experiments demonstrate that our system achieves state-of-the-art results on 2D brain registration and achieves comparable results on 2D liver registration. It can be extended to register other organs beyond liver and brain such as kidney, lung, and heart.
- First-order methods have been popularly used for solving large-scale problems. However, many existing works only consider unconstrained problems or those with simple constraint. In this paper, we develop two first-order methods for constrained convex programs, for which the constraint set is represented by affine equations and smooth nonlinear inequalities. Both methods are based on the classic augmented Lagrangian function. They update the multipliers in the same way as the augmented Lagrangian method (ALM) but employ different primal variable updates. The first method, at each iteration, performs a single proximal gradient step to the primal variable, and the second method is a block update version of the first one. For the first method, we establish its global iterate convergence as well as global sublinear and local linear convergence, and for the second method, we show a global sublinear convergence result in expectation. Numerical experiments are carried out on the basis pursuit denoising and a convex quadratically constrained quadratic program to show the empirical performance of the proposed methods. Their numerical behaviors closely match the established theoretical results.
- Minimizing job scheduling time is a fundamental issue in data center networks that has been extensively studied in recent years. The incoming jobs require different CPU and memory units, and span different number of time slots. The traditional solution is to design efficient heuristic algorithms with performance guarantee under certain assumptions. In this paper, we improve a recently proposed job scheduling algorithm using deep reinforcement learning and extend it to multiple server clusters. Our study reveals that deep reinforcement learning method has the potential to outperform traditional resource allocation algorithms in a variety of complicated environments.
- Augmented Lagrangian method (ALM) has been popularly used for solving constrained optimization problems. Practically, subproblems for updating primal variables in the framework of ALM usually can only be solved inexactly. The convergence and local convergence speed of ALM have been extensively studied. However, the global convergence rate of inexact ALM is still open for problems with nonlinear inequality constraints. In this paper, we work on general convex programs with both equality and inequality constraints. For these problems, we establish the global convergence rate of inexact ALM and estimate its iteration complexity in terms of the number of gradient evaluations to produce a solution with a specified accuracy. We first establish an ergodic convergence rate result of inexact ALM that uses constant penalty parameters or geometrically increasing penalty parameters. Based on the convergence rate result, we apply Nesterov's optimal first-order method on each primal subproblem and estimate the iteration complexity of the inexact ALM. We show that if the objective is convex, then $O(\varepsilon^{-1})$ gradient evaluations are sufficient to guarantee an $\varepsilon$-optimal solution in terms of both primal objective and feasibility violation. If the objective is strongly convex, the result can be improved to $O(\varepsilon^{-\frac{1}{2}}|\log\varepsilon|)$. Finally, by relating to the inexact proximal point algorithm, we establish a nonergodic convergence rate result of inexact ALM that uses geometrically increasing penalty parameters. We show that the nonergodic iteration complexity result is in the same order as that for the ergodic result. Numerical experiments on quadratically constrained quadratic programming are conducted to compare the performance of the inexact ALM with different settings.
- Nov 16 2017 cs.CL arXiv:1711.04964v2This paper presents a novel neural model - Dynamic Fusion Network (DFN), for machine reading comprehension (MRC). DFNs differ from most state-of-the-art models in their use of a dynamic multi-strategy attention process, in which passages, questions and answer candidates are jointly fused into attention vectors, along with a dynamic multi-step reasoning module for generating answers. With the use of reinforcement learning, for each input sample that consists of a question, a passage and a list of candidate answers, an instance of DFN with a sample-specific network architecture can be dynamically constructed by determining what attention strategy to apply and how many reasoning steps to take. Experiments show that DFNs achieve the best result reported on RACE, a challenging MRC dataset that contains real human reading questions in a wide variety of types. A detailed empirical analysis also demonstrates that DFNs can produce attention vectors that summarize information from questions, passages and answer candidates more effectively than other popular MRC models.
- Nov 15 2017 cs.CV arXiv:1711.04226v2Recognizing text from natural images is a hot research topic in computer vision due to its various applications. Despite the enduring research of several decades on optical character recognition (OCR), recognizing texts from natural images is still a challenging task. This is because scene texts are often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted) arrangements, which have not yet been well addressed in the literature. Existing methods on text recognition mainly work with regular (horizontal and frontal) texts and cannot be trivially generalized to handle irregular texts. In this paper, we develop the arbitrary orientation network (AON) to directly capture the deep features of irregular texts, which are combined into an attention-based decoder to generate character sequence. The whole network can be trained end-to-end by using only images and word-level annotations. Extensive experiments on various benchmarks, including the CUTE80, SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed AON-based method achieves the-state-of-the-art performance in irregular datasets, and is comparable to major existing methods in regular datasets.
- Source separation (SS) aims to separate individual sources from an audio recording. Sound event detection (SED) aims to detect sound events from an audio recording. We propose a joint separation-classification (JSC) model trained only on weakly labelled audio data, that is, only the tags of an audio recording are known but the time of the events are unknown. First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events. Second, a classification mapping is built from each T-F segmentation mask to the presence probability of each audio event. In the source separation stage, sources of audio events and time of sound events can be obtained from the T-F segmentation masks. The proposed method achieves an equal error rate (EER) of 0.14 in SED, outperforming deep neural network baseline of 0.29. Source separation SDR of 8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability mapping, outperforming the global max pooling (GMP) based probability mapping giving SDR at 0.03 dB. Source code of our work is published.
- We analyse a linear regression problem with nonconvex regularization called smoothly clipped absolute deviation (SCAD) under an overcomplete Gaussian basis for Gaussian random data. We propose an approximate message passing (AMP) algorithm considering nonconvex regularization, namely SCAD-AMP, and analytically show that the stability condition corresponds to the de Almeida--Thouless condition in spin glass literature. Through asymptotic analysis, we show the correspondence between the density evolution of SCAD-AMP and the replica symmetric solution. Numerical experiments confirm that for a sufficiently large system size, SCAD-AMP achieves the optimal performance predicted by the replica method. Through replica analysis, a phase transition between replica symmetric (RS) and replica symmetry breaking (RSB) region is found in the parameter space of SCAD. The appearance of the RS region for a nonconvex penalty is a significant advantage that indicates the region of smooth landscape of the optimization problem. Furthermore, we analytically show that the statistical representation performance of the SCAD penalty is better than that of L1-based methods, and the minimum representation error under RS assumption is obtained at the edge of the RS/RSB phase. The correspondence between the convergence of the existing coordinate descent algorithm and RS/RSB transition is also indicated.
- This paper investigates the classification of the Audio Set dataset. Audio Set is a large scale weakly labelled dataset of sound clips. Previous work used multiple instance learning (MIL) to classify weakly labelled data. In MIL, a bag consists of several instances, and a bag is labelled positive if at least one instances in the audio clip is positive. A bag is labelled negative if all the instances in the bag are negative. We propose an attention model to tackle the MIL problem and explain this attention model from a novel probabilistic perspective. We define a probability space on each bag, where each instance in the bag has a trainable probability measure for each class. Then the classification of a bag is the expectation of the classification output of the instances in the bag with respect to the learned probability measure. Experimental results show that our proposed attention model modeled by fully connected deep neural network obtains mAP of 0.327 on Audio Set dataset, outperforming the Google's baseline of 0.314 and recurrent neural network of 0.325.
- This paper proposes a practical approach for automatic sleep stage classification based on a multi-level feature learning framework and Recurrent Neural Network (RNN) classifier using heart rate and wrist actigraphy derived from a wearable device. The feature learning framework is designed to extract low- and mid-level features. Low-level features capture temporal and frequency domain properties and mid-level features learn compositions and structural information of signals. Since sleep staging is a sequential problem with long-term dependencies, we take advantage of RNNs with Bidirectional Long Short-Term Memory (BLSTM) architectures for sequence data learning. To simulate the actual situation of daily sleep, experiments are conducted with a resting group in which sleep is recorded in resting state, and a comprehensive group in which both resting sleep and non-resting sleep are included.We evaluate the algorithm based on an eight-fold cross validation to classify five sleep stages (W, N1, N2, N3, and REM). The proposed algorithm achieves weighted precision, recall and F1 score of 58.0%, 60.3%, and 58.2% in the resting group and 58.5%, 61.1%, and 58.5% in the comprehensive group, respectively. Various comparison experiments demonstrate the effectiveness of feature learning and BLSTM. We further explore the influence of depth and width of RNNs on performance. Our method is specially proposed for wearable devices and is expected to be applicable for long-term sleep monitoring at home. Without using too much prior domain knowledge, our method has the potential to generalize sleep disorder detection.
- Connections between nodes of fully connected neural networks are usually represented by weight matrices. In this article, functional transfer matrices are introduced as alternatives to the weight matrices: Instead of using real weights, a functional transfer matrix uses real functions with trainable parameters to represent connections between nodes. Multiple functional transfer matrices are then stacked together with bias vectors and activations to form deep functional transfer neural networks. These neural networks can be trained within the framework of back-propagation, based on a revision of the delta rules and the error transmission rule for functional connections. In experiments, it is demonstrated that the revised rules can be used to train a range of functional connections: 20 different functions are applied to neural networks with up to 10 hidden layers, and most of them gain high test accuracies on the MNIST database. It is also demonstrated that a functional transfer matrix with a memory function can roughly memorise a non-cyclical sequence of 400 digits.
- Power plant is a complex and nonstationary system for which the traditional machine learning modeling approaches fall short of expectations. The ensemble-based online learning methods provide an effective way to continuously learn from the dynamic environment and autonomously update models to respond to environmental changes. This paper proposes such an online ensemble regression approach to model power plant performance, which is critically important for operation optimization. The experimental results on both simulated and real data show that the proposed method can achieve performance with less than 1% mean average percentage error, which meets the general expectations in field operations.
- Data-driven predictive analytics are in use today across a number of industrial applications, but further integration is hindered by the requirement of similarity among model training and test data distributions. This paper addresses the need of learning from possibly nonstationary data streams, or under concept drift, a commonly seen phenomenon in practical applications. A simple dual-learner ensemble strategy, alternating learners framework, is proposed. A long-memory model learns stable concepts from a long relevant time window, while a short-memory model learns transient concepts from a small recent window. The difference in prediction performance of these two models is monitored and induces an alternating policy to select, update and reset the two models. The method features an online updating mechanism to maintain the ensemble accuracy, and a concept-dependent trigger to focus on relevant data. Through empirical studies the method demonstrates effective tracking and prediction when the steaming data carry abrupt and/or gradual changes.
- In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation. Our model directly takes 2D pose as input and learns a generalized 2D-3D mapping function. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a pose sample simulator to augment training samples in virtual camera views, which further improves our model generalizability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.
- This letter investigates the problem of anti-jamming communications in dynamic and unknown environment through on-line learning. Different from existing studies which need to know (estimate) the jamming patterns and parameters, we use the spectrum waterfall, i.e., the raw spectrum environment, directly. Firstly, to cope with the challenge of infinite state of raw spectrum information, a deep anti-jamming Q-network is constructed. Then, a deep anti-jamming reinforcement learning algorithm is proposed to obtain the optimal anti-jamming strategies. Finally, simulation results validate the the proposed approach. The proposed approach is relying only on the local observed information and does not need to estimate the jamming patterns and parameters, which implies that it can be widely used various anti-jamming scenarios.
- Oct 10 2017 cs.CV arXiv:1710.03011v1Almost all existing visual saliency models focus on predicting a universal saliency map across all observers. Yet psychology studies suggest that visual attention of different observers can vary a lot under some specific circumstances, especially when they view scenes with multiple salient objects. However, few work explores this visual attention difference probably because of lacking a proper dataset, as well as complex correlation between visual attention, personal preferences, and image contents. In this paper, we set out to study this heterogenous visual attention pattern between different observers and build the first dataset for personalized saliency detection. Further, we propose to decompose a personalized saliency map (referred to as PSM) into a universal saliency map (referred to as USM) which can be predicted by any existing saliency detection models and a discrepancy between them. Then personalized saliency detection is casted as the task of discrepancy estimation between PSM and USM. To tackle this task we propose two solutions: i) The discrepancy estimation for different observers are casted as different but related tasks. Then we feed the image and its USM into a multi-task convolutional neural network framework to estimate the discrepancy between PSM and USM for each observer; ii) As the discrepancy is related to both image contents and the observers' person-specific information, we feed the image and its associated USM into a convolutional neural network with person-specific information encoded filters to estimate the discrepancy. Extensive experimental results demonstrate the effectiveness of our models for PSM prediction as well their generalization capability for unseen observers.
- Oct 06 2017 cs.GT arXiv:1710.01918v1This paper investigates the incentive mechanism design from a novel and practically important perspective in which mobile users as contributors do not join simultaneously and a requester desires large efforts from early contributors. A two-stage Tullock contest framework is constructed:at the second stage the potential contributors compete for splittable reward by exerting efforts, and at the first stage the requester can orchestrate the incentive mechanism to maximize his crowdsensing efficiency given the rewarding budget. A general reward discrimination mechanism is developed for timeliness sensitive crowdsensing where an earlier contributor usually has a larger maximum achievable reward and thus allocates more efforts. Owning to the lack of joining time information, two practical implementations, namely earliest-n and termination time, are announced to the contributors. For each of them, we formulate a Stackelberg Bayesian game in which the joining time of a contributor is his type and not available to his opponents. The uniqueness of Bayesian Nash equilibrium (BNE) is proved in each strategy. To maximize the requester's efficiency, we compute the optimal number of rewarded contributors in the earliest-n scheme and the optimal deadline in the termination time scheme. Our contest framework is applicable not only to the closed crowdsensing with fixed number of contributors, but also to the open crowdsensing that the arrival of contributors is governed by a stochastic process. Extensive simulations manifest that with appropriate reward discriminations, the requester is able to achieve a much higher efficiency with the optimal selection of the number of rewarded contributiors and the termination time.
- In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labeled with one or a few audio tags but without timestamps of the audio events, which is called as weakly labeled data. Two sub-tasks are defined in this challenge including audio tagging and sound event detection using this weakly labeled data. A convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram is proposed. In addition, a temporal attention method is proposed along the frames to predicate the locations of each audio event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and Equal error 0.73, respectively.
- Sep 20 2017 cs.GT arXiv:1709.06367v1We consider the well-studied game-theoretic version of machine scheduling in which jobs correspond to self-interested users and machines correspond to resources. Here each user chooses a machine trying to minimize her own cost, and such selfish behavior typically results in some equilibrium which is not globally optimal: An equilibrium is an allocation where no user can reduce her own cost by moving to another machine, which in general need not minimize the makespan, i.e., the maximum load over the machines. We provide tight bounds on two well-studied notions in algorithmic game theory, namely, the price of anarchy and the strong price of anarchy on machine scheduling setting which lies in between the related and the unrelated machine case. Both notions study the social cost (makespan) of the worst equilibrium compared to the optimum, with the strong price of anarchy restricting to a stronger form of equilibria. Our results extend a prior study comparing the price of anarchy to the strong price of anarchy for two related machines (Epstein, Acta Informatica 2010), thus providing further insights on the relation between these concepts. Our exact bounds give a qualitative and quantitative comparison between the two models. The bounds also show that the setting is indeed easier than the two unrelated machines: In the latter, the strong price of anarchy is $2$, while in ours it is strictly smaller.
- Tracking humans that are interacting with the other subjects or environment remains unsolved in visual tracking, because the visibility of the human of interests in videos is unknown and might vary over time. In particular, it is still difficult for state-of-the-art human trackers to recover complete human trajectories in crowded scenes with frequent human interactions. In this work, we consider the visibility status of a subject as a fluent variable, whose change is mostly attributed to the subject's interaction with the surrounding, e.g., crossing behind another object, entering a building, or getting into a vehicle, etc. We introduce a Causal And-Or Graph (C-AOG) to represent the causal-effect relations between an object's visibility fluent and its activities, and develop a probabilistic graph model to jointly reason the visibility fluent change (e.g., from visible to invisible) and track humans in videos. We formulate this joint task as an iterative search of a feasible causal graph structure that enables fast search algorithm, e.g., dynamic programming method. We apply the proposed method on challenging video sequences to evaluate its capabilities of estimating visibility fluent changes of subjects and tracking subjects of interests over time. Results with comparisons demonstrate that our method outperforms the alternative trackers and can recover complete trajectories of humans in complicated scenarios with frequent human interactions.
- Cross-view video understanding is an important yet under-explored area in computer vision. In this paper, we introduce a joint parsing framework that integrates view-centric proposals into scene-centric parse graphs that represent a coherent scene-centric understanding of cross-view scenes. Our key observations are that overlapping fields of views embed rich appearance and geometry correlations and that knowledge fragments corresponding to individual vision tasks are governed by consistency constraints available in commonsense knowledge. The proposed joint parsing framework represents such correlations and constraints explicitly and generates semantic scene-centric parse graphs. Quantitative experiments show that scene-centric predictions in the parse graph outperform view-centric predictions.
- Sep 14 2017 cs.CV arXiv:1709.04344v3How to effectively approximate real-valued parameters with binary codes plays a central role in neural network binarization. In this work, we reveal an important fact that binarizing different layers has a widely-varied effect on the compression ratio of network and the loss of performance. Based on this fact, we propose a novel and flexible neural network binarization method by introducing the concept of layer-wise priority which binarizes parameters in inverse order of their layer depth. In each training step, our method selects a specific network layer, minimizes the discrepancy between the original real-valued weights and its binary approximations, and fine-tunes the whole network accordingly. During the iteration of the above process, it is significant that we can flexibly decide whether to binarize the remaining floating layers or not and explore a trade-off between the loss of performance and the compression ratio of model. The resulting binary network is applied for efficient pedestrian detection. Extensive experimental results on several benchmarks show that under the same compression ratio, our method achieves much lower miss rate and faster detection speed than the state-of-the-art neural network binarization method.
- Sep 12 2017 cs.CV arXiv:1709.03272v3In this paper, we introduce a novel end-end framework for multi-oriented scene text detection from an instance-aware semantic segmentation perspective. We present Fused Text Segmentation Networks, which combine multi-level features during the feature extracting as text instance may rely on finer feature expression compared to general objects. It detects and segments the text instance jointly and simultaneously, leveraging merits from both semantic segmentation task and region proposal based object detection task. Not involving any extra pipelines, our approach surpasses the current state of the art on multi-oriented scene text detection benchmarks: ICDAR2015 Incidental Scene Text and MSRA-TD500 reaching Hmean 84.1% and 82.0% respectively. Morever, we report a baseline on total-text containing curved text which suggests effectiveness of the proposed approach.
- Sep 08 2017 cs.CV arXiv:1709.02054v3Scene text recognition has been a hot research topic in computer vision due to its various applications. The state of the art is the attention-based encoder-decoder framework that learns the mapping between input images and output sequences in a purely data-driven way. However, we observe that existing attention-based methods perform poorly on complicated and/or low-quality images. One major reason is that existing methods cannot get accurate alignments between feature areas and targets for such images. We call this phenomenon "attention drift". To tackle this problem, in this paper we propose the FAN (the abbreviation of Focusing Attention Network) method that employs a focusing attention mechanism to automatically draw back the drifted attention. FAN consists of two major components: an attention network (AN) that is responsible for recognizing character targets as in the existing methods, and a focusing network (FN) that is responsible for adjusting attention by evaluating whether AN pays attention properly on the target areas in the images. Furthermore, different from the existing methods, we adopt a ResNet-based network to enrich deep representations of scene text images. Extensive experiments on various benchmarks, including the IIIT5k, SVT and ICDAR datasets, show that the FAN method substantially outperforms the existing methods.
- Sep 05 2017 cs.SD arXiv:1709.00551v2In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.
- Aug 22 2017 cs.AI arXiv:1708.05930v1In this paper, a new type of 3D bin packing problem (BPP) is proposed, in which a number of cuboid-shaped items must be put into a bin one by one orthogonally. The objective is to find a way to place these items that can minimize the surface area of the bin. This problem is based on the fact that there is no fixed-sized bin in many real business scenarios and the cost of a bin is proportional to its surface area. Our research shows that this problem is NP-hard. Based on previous research on 3D BPP, the surface area is determined by the sequence, spatial locations and orientations of items. Among these factors, the sequence of items plays a key role in minimizing the surface area. Inspired by recent achievements of deep reinforcement learning (DRL) techniques, especially Pointer Network, on combinatorial optimization problems such as TSP, a DRL-based method is applied to optimize the sequence of items to be packed into the bin. Numerical results show that the method proposed in this paper achieve about 5% improvement than heuristic method.
- Existing block-diagonal representation researches mainly focuses on casting block-diagonal regularization on training data, while only little attention is dedicated to concurrently learning both block-diagonal representations of training and test data. In this paper, we propose a discriminative block-diagonal low-rank representation (BDLRR) method for recognition. In particular, the elaborate BDLRR is formulated as a joint optimization problem of shrinking the unfavorable representation from off-block-diagonal elements and strengthening the compact block-diagonal representation under the semi-supervised framework of low-rank representation. To this end, we first impose penalty constraints on the negative representation to eliminate the correlation between different classes such that the incoherence criterion of the extra-class representation is boosted. Moreover, a constructed subspace model is developed to enhance the self-expressive power of training samples and further build the representation bridge between the training and test samples, such that the coherence of the learned intra-class representation is consistently heightened. Finally, the resulting optimization problem is solved elegantly by employing an alternative optimization strategy, and a simple recognition algorithm on the learned representation is utilized for final prediction. Extensive experimental results demonstrate that the proposed method achieves superb recognition results on four face image datasets, three character datasets, and the fifteen scene multi-categories dataset. It not only shows superior potential on image recognition but also outperforms state-of-the-art methods.
- This paper investigates the application of simultaneous wireless information and power transfer (SWIPT) to cooperative non-orthogonal multiple access (NOMA). A new cooperative multiple-input single-output (MISO) SWIPT NOMA protocol is proposed, where a user with a strong channel condition acts as an energy-harvesting (EH) relay to help a user with a poor channel condition. The power splitting (PS) scheme is adopted at the EH relay. By jointly optimizing the PS ratio and the beamforming vectors, the design objective is to maximize the data rate of the "strong user" while satisfying the QoS requirement of the "weak user". It boils down to a challenging nonconvex problem. To resolve this issue, the semidefinite relaxation (SDR) technique is applied to relax the quadratic terms related with the beamformers, and then it is solved to its global optimality by two-dimensional exhaustive search. We prove the rank-one optimality, which establishes the equivalence between the relaxed problem and the original one. To further reduce the high complexity due to the exhaustive search, an iterative algorithm based on successive convex approximation (SCA) is proposed, which can at least attain its stationary point efficiently. In view of the potential application scenarios, e.g., IoT, the single-input single-output (SISO) case of the cooperative SWIPT NOMA system is also studied. The formulated problem is proved to be strictly unimodal with respect to the PS ratio. Hence, a golden section search (GSS) based algorithm with closed-form solution at each step is proposed to find the unique global optimal solution. It is worth pointing out that the SCA method can also converge to the optimal solution in SISO cases. In the numerical simulation, the proposed algorithm is numerically shown to converge within a few iterations, and the SWIPT-aided NOMA protocol outperforms the existing transmission protocols.
- This paper provides a unified account of two schools of thinking in information retrieval modelling: the generative retrieval focusing on predicting relevant documents given a query, and the discriminative retrieval focusing on predicting relevancy given a query-document pair. We propose a game theoretical minimax game to iteratively optimise both models. On one hand, the discriminative model, aiming to mine signals from labelled and unlabelled data, provides guidance to train the generative model towards fitting the underlying relevance distribution over documents given the query. On the other hand, the generative model, acting as an attacker to the current discriminative model, generates difficult examples for the discriminative model in an adversarial way by minimising its discrimination objective. With the competition between these two models, we show that the unified framework takes advantage of both schools of thinking: (i) the generative model learns to fit the relevance distribution over documents via the signals from the discriminative model, and (ii) the discriminative model is able to exploit the unlabelled data selected by the generative model to achieve a better estimation for document ranking. Our experimental results have demonstrated significant performance gains as much as 23.96% on Precision@5 and 15.50% on MAP over strong baselines in a variety of applications including web search, item recommendation, and question answering.
- May 30 2017 cs.NI arXiv:1705.09999v1Network Function Virtualization (NFV) shed new light for the design, deployment, and management of cloud networks. Many network functions such as firewalls, load balancers, and intrusion detection systems can be virtualized by servers. However, network operators often have to sacrifice programmability in order to achieve high throughput, especially at networks' edge where complex network functions are required. Here, we design, implement, and evaluate Hybrid Modular Switch (HyMoS). The hybrid hardware/software switch is designed to meet requirements for modern-day NFV applications in providing high-throughput, with a high degree of programmability. HyMoS utilizes P4-compatible Network Interface Cards (NICs), PCI Express interface and CPU to act as line cards, switch fabric, and fabric controller respectively. In our implementation of HyMos, PCI Express interface is turned into a non-blocking switch fabric with a throughput of hundreds of Gigabits per second. Compared to existing NFV infrastructure, HyMoS offers modularity in hardware and software as well as a higher degree of programmability by supporting a superset of P4 language.
- Determining deep holes is an important topic in decoding Reed-Solomon codes. Let $l\ge 1$ be an integer and $a_1,\ldots,a_l$ be arbitrarily given $l$ distinct elements of the finite field ${\bf F}_q$ of $q$ elements with the odd prime number $p$ as its characteristic. Let $D={\bf F}_q\backslash\{a_1,\ldots,a_l\}$ and $k$ be an integer such that $2\le k\le q-l-1$. In this paper, we study the deep holes of generalized projective Reed-Solomon code ${\rm GPRS}_q(D, k)$ of length $q-l+1$ and dimension $k$ over ${\bf F}_q$. For any $f(x)\in {\bf F}_q[x]$, we let $f(D)=(f(y_1),\ldots,f(y_{q-l}))$ if $D=\{y_1, ..., y_{q-l}\}$ and $c_{k-1}(f(x))$ be the coefficient of $x^{k-1}$ of $f(x)$. By using Dür's theorem on the relation between the covering radius and minimum distance of ${\rm GPRS}_q(D, k)$, we show that if $u(x)\in {\bf F}_q[x]$ with $\deg (u(x))=k$, then the received codeword $(u(D), c_{k-1}(u(x)))$ is a deep hole of ${\rm GPRS}_q(D, k)$ if and only if the sum $\sum\limits_{y\in I}y$ is nonzero for any subset $I\subseteq D$ with $\#(I)=k$. We show also that if $j$ is an integer with $1\leq j\leq l$ and $u_j(x):= \lambda_j(x-a_j)^{q-2}+\nu_j x^{k-1}+f_{\leq k-2}^{(j)}(x)$ with $\lambda_j\in {\bf F}_q^*$, $\nu_j\in {\bf F}_q$ and $f_{\leq{k-2}}^{(j)}(x)\in{\bf F}_q[x]$ being a polynomial of degree at most $k-2$, then $(u_j(D), c_{k-1}(u_j(x)))$ is a deep hole of ${\rm GPRS}_q(D, k)$ if and only if the sum $\binom{q-2}{k-1}(-a_j)^{q-1-k}\prod\limits_{y\in I}(a_j-y)+e$ is nonzero for any subset $I\subseteq D$ with $\#(I)=k$, where $e$ is the identity of the group ${\bf F}_q^*$. This implies that $(u_j(D), c_{k-1}(u_j(x)))$ is a deep hole of ${\rm GPRS}_q(D, k)$ if $p|k$.
- Recent several years have witnessed the surge of asynchronous (async-) parallel computing methods due to the extremely big data involved in many modern applications and also the advancement of multi-core machines and computer clusters. In optimization, most works about async-parallel methods are on unconstrained problems or those with block separable constraints. In this paper, we propose an async-parallel method based on block coordinate update (BCU) for solving convex problems with nonseparable linear constraint. Running on a single node, the method becomes a novel randomized primal-dual BCU with adaptive stepsize for multi-block affinely constrained problems. For these problems, Gauss-Seidel cyclic primal-dual BCU needs strong convexity to have convergence. On the contrary, merely assuming convexity, we show that the objective value sequence generated by the proposed algorithm converges in probability to the optimal value and also the constraint residual to zero. In addition, we establish an ergodic $O(1/k)$ convergence result, where $k$ is the number of iterations. Numerical experiments are performed to demonstrate the efficiency of the proposed method and significantly better speed-up performance than its sync-parallel counterpart.
- The minimum feedback arc set problem asks to delete a minimum number of arcs (directed edges) from a digraph (directed graph) to make it free of any directed cycles. In this work we approach this fundamental cycle-constrained optimization problem by considering a generalized task of dividing the digraph into D layers of equal size. We solve the D-segmentation problem by the replica-symmetric mean field theory and belief-propagation heuristic algorithms. The minimum feedback arc density of a given random digraph ensemble is then obtained by extrapolating the theoretical results to the limit of large D. A divide-and-conquer algorithm (nested-BPR) is devised to solve the minimum feedback arc set problem with very good performance and high efficiency.
- May 16 2017 cs.AR arXiv:1705.04981v1In this paper, we investigate the challenges to apply Statistical Static Timing Analysis (SSTA) in hierarchical design flow, where modules supplied by IP vendors are used to hide design details for IP protection and to reduce the complexity of design and verification. For the three basic circuit types, combinational, flip-flop-based and latch-controlled, we propose methods to extract timing models which contain interfacing as well as compressed internal constraints. Using these compact timing models the runtime of full-chip timing analysis can be reduced, while circuit details from IP vendors are not exposed. We also propose a method to reconstruct the correlation between modules during full-chip timing analysis. This correlation can not be incorporated into timing models because it depends on the layout of the corresponding modules in the chip. In addition, we investigate how to apply the extracted timing models with the reconstructed correlation to evaluate the performance of the complete design. Experiments demonstrate that using the extracted timing models and reconstructed correlation full-chip timing analysis can be several times faster than applying the flattened circuit directly, while the accuracy of statistical timing analysis is still well maintained.