# Search SciRate

### results for au:Chen_L in:cs

• Since the introduction of retroactive data structures at SODA 2004, a major unsolved problem has been to bound the gap between the best partially retroactive data structure (where changes can be made to the past, but only the present can be queried) and the best fully retroactive data structure (where the past can also be queried) for any problem. It was proved in 2004 that any partially retroactive data structure with operation time $T(n,m)$ can be transformed into a fully retroactive data structure with operation time $O(\sqrt{m} \cdot T(n,m))$, where $n$ is the size of the data structure and $m$ is the number of operations in the timeline [Demaine 2004], but it has been open for 14 years whether such a gap is necessary. In this paper, we prove nearly matching upper and lower bounds on this gap for all $n$ and $m$. We improve the upper bound for $n \ll \sqrt m$ by showing a new transformation with multiplicative overhead $n \log m$. We then prove a lower bound of $\min\{n \log m, \sqrt m\}^{1-o(1)}$ assuming any of the following conjectures: - Conjecture I: Circuit SAT requires $2^{n - o(n)}$ time on $n$-input circuits of size $2^{o(n)}$. (Far weaker than the well-believed SETH conjecture, which asserts that CNF SAT with $n$ variables and $O(n)$ clauses already requires $2^{n-o(n)}$ time.) - Conjecture II: Online $(\min,+)$ product between an integer $n\times n$ matrix and $n$ vectors requires $n^{3 - o(1)}$ time. - Conjecture III (3-SUM Conjecture): Given three sets $A,B,C$ of integers, each of size $n$, deciding whether there exist $a \in A, b \in B, c \in C$ such that $a + b + c = 0$ requires $n^{2 - o(1)}$ time. Our lower bound construction illustrates an interesting power of fully retroactive queries: they can be used to quickly solve batched pair evaluation. We believe this technique can prove useful for other data structure lower bounds, especially dynamic ones.
• Anonymous Single-Sign-On authentication schemes have been proposed to allow users to access a service protected by a verifier without revealing their identity which has become more important due to the introduction of strong privacy regulations. In this paper we describe a new approach whereby anonymous authentication to different verifiers is achieved via authorisation tags and pseudonyms. The particular innovation of our scheme is authentication can only occur between a user and its designated verifier for a service, and the verification cannot be performed by any other verifier. The benefit of this authentication approach is that it prevents information leakage of a user's service access information, even if the verifiers for these services collude which each other. Our scheme also supports a trusted third party who is authorised to de-anonymise the user and reveal her whole services access information if required. Furthermore, our scheme is lightweight because it does not rely on attribute or policy-based signature schemes to enable access to multiple services. The scheme's security model is given together with a security proof, an implementation and a performance evaluation.
• In this work, we apply an attention-gated network to real-time automated scan plane detection for fetal ultrasound screening. Scan plane detection in fetal ultrasound is a challenging problem due the poor image quality resulting in low interpretability for both clinicians and automated algorithms. To solve this, we propose incorporating self-gated soft-attention mechanisms. A soft-attention mechanism generates a gating signal that is end-to-end trainable, which allows the network to contextualise local information useful for prediction. The proposed attention mechanism is generic and it can be easily incorporated into any existing classification architectures, while only requiring a few additional parameters. We show that, when the base network has a high capacity, the incorporated attention mechanism can provide efficient object localisation while improving the overall performance. When the base network has a low capacity, the method greatly outperforms the baseline approach and significantly reduces false positives. Lastly, the generated attention maps allow us to understand the model's reasoning process, which can also be used for weakly supervised object localisation.
• In traditional networks, interfaces of network nodes are duplex. But, emerging communication technologies such as visible light communication, millimeter-wave communications, can only provide a unidirectional interface when cost is limited. It's urgent to find effective solutions to utilize such new unidirectional communication skills. Decoupling implies separating one single resource to two independent resources. This idea can be applied at physical layer, link layer, network layer, even transport layer. TCP decoupling is an end to end solution provided at transport layer. With decoupled TCP, two distinct unidirectional path can be created to meet the requirements of reliable information transfer. However, it is not an easy task to decouple a bidirectional logical path at transport layer. In this paper, we dwell on the idea of TCP decoupling. Advantages of decoupling at transport layer are analyzed also. In addition, an experiment is carried out to figure out how to implement a decouple TCP. Our results show decoupling at transport layer is possible and the modified protocol is available.
• Cloudlet deployment and resource allocation for mobile users (MUs) have been extensively studied in existing works for computation resource scarcity. However, most of them failed to jointly consider the two techniques together, and the selfishness of cloudlet and access point (AP) are ignored. Inspired by the group-buying mechanism, this paper proposes three-stage auction schemes by combining cloudlet placement and resource assignment, to improve the social welfare subject to the economic properties. We first divide all MUs into some small groups according to the associated APs. Then the MUs in same group can trade with cloudlets in a group-buying way through the APs. Finally, the MUs pay for the cloudlets if they are the winners in the auction scheme. We prove that our auction schemes can work in polynomial time. We also provide the proofs for economic properties in theory. For the purpose of performance comparison, we compare the proposed schemes with HAF, which is a centralized cloudlet placement scheme without auction. Numerical results confirm the correctness and efficiency of the proposed schemes.
• Grasping skill is a major ability that a wide number of real-life applications require for robotisation. State-of-the-art robotic grasping methods perform prediction of object grasp locations based on deep neural networks which require huge amount of labeled data for training and prove impracticable in robotics. In this paper, we propose to generate a large scale synthetic dataset with ground truth, which we refer to as the Jacquard grasping dataset. Specifically, the proposed Jacquard dataset builds on a subset of ShapeNet with its numerous shape objects and features at a scale of millions varied object grasping positions for a large diversity of more than 11k objects. Beyond the simulation of the grasping scene with the underlying object, the ground truth for a successful grasping position is also based on tentatives of a simulated grasping robot. We carried out experiments using an off-the-shelf CNN, with three different evaluation metrics, including real grasping robot trials. The results show that Jacquard enables much better generalization skills than a human labeled dataset thanks to its diversity of objects and grasping positions. For the purpose of reproducible research in robotics, we are releasing along with the Jacquard dataset a web interface for researchers to evaluate the successfulness of their grasping position detections using our dataset.
• To support future IoT networks with dense sensor connectivity, a technique called over-the-air-computation (AirComp) was recently developed which enables a data-fusion to receive a desired function of sensing-data from concurrent-transmissions by exploiting the superposition property of a multi-access-channel. This work aims at further developing AirComp for next-generation multi-antenna multi-modal sensor networks. Specifically, we design beamforming and channel-feedback techniques for multi-function AirComp. Given the objective of minimizing sum-mean-squared-error of computed functions, the optimization of receive-beamforming for multi-function AirComp is a NP-hard problem. The approximate problem based on tightening transmission-power constraints, however, is shown to be solvable using differential-geometry. The solution is proved to be the weighted-centroid of points on a Grassmann-manifold, where each point represents the subspace spanned by the channel matrix of a sensor. As a by-product, the beamforming problem is found to have the same form as the classic problem of multicast-beamforming, establishing the AirComp-multicasting-duality. Its significance lies in making the said Grassmannian-centroid solution transferable to the latter problem which otherwise is solved using the computation-intensive semidefinite-relaxation-technique. Last, building on the AirComp-beamforming solution, an efficient channel-feedback technique is designed for a data-fusion to receive the beamformer from distributed sensor transmissions of designed signals that are functions of local channel-state-information.
• Mar 29 2018 cs.CV arXiv:1803.10404v2
Cross-modality generation is an emerging topic that aims to synthesize data in one modality based on information in a different modality. In this paper, we consider a task of such: given an arbitrary audio speech and one lip image of arbitrary target identity, generate synthesized lip movements of the target identity saying the speech. To perform well in this task, it inevitably requires a model to not only consider the retention of target identity, photo-realistic of synthesized images, consistency and smoothness of lip images in a sequence, but more importantly, learn the correlations between audio speech and lip movements. To solve the collective problems, we explore the best modeling of the audio-visual correlations in building and training a lip-movement generator network. Specifically, we devise a method to fuse audio and image embeddings to generate multiple lip images at once and propose a novel correlation loss to synchronize lip changes and speech changes. Our final model utilizes a combination of four losses for a comprehensive consideration in generating lip movements; it is trained in an end-to-end fashion and is robust to lip shapes, view angles and different facial characteristics. Thoughtful experiments on three datasets ranging from lab-recorded to lips in-the-wild show that our model significantly outperforms other state-of-the-art methods extended to this task.
• Distributed model training is vulnerable to worst-case system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To tolerate node failures and adversarial attacks, recent work suggests using variants of the geometric median to aggregate distributed updates at the PS, in place of bulk averaging. Although median-based update rules are robust to adversarial nodes, their computational cost can be prohibitive in large-scale settings and their convergence guarantees often require relatively strong assumptions. In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are then used by the parameter server to eliminate the effects of adversarial updates. We present problem-independent robustness guarantees for DRACO and show that the model it produces is identical to the one trained in the adversary-free setup. We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that DRACO is several times to orders of magnitude faster than median-based approaches.
• A variety of rating-based recommendation methods have been extensively studied including the well-known collaborative filtering approaches and some network diffusion-based methods, however, social trust relations are not sufficiently considered when making recommendations. In this paper, we contribute to the literature by proposing a trust-based recommendation method, named CosRA+T, after integrating the information of trust relations into the resource-redistribution process. Specifically, a tunable parameter is used to scale the resources received by trusted users before the redistribution back to the objects. Interestingly, we find an optimal scaling parameter for the proposed CosRA+T method to achieve its best recommendation accuracy, and the optimal value seems to be universal under several evaluation metrics across different datasets. Moreover, results of extensive experiments on the two real-world rating datasets with trust relations, Epinions and FriendFeed, suggest that CosRA+T has a remarkable improvement in overall accuracy, diversity, and novelty. Our work takes a step towards designing better recommendation algorithms by employing multiple resources of social network information.
• We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417.
• Mixed Reality (MR) is a powerful interactive technology that yields new types of user experience. We present a semantic based interactive MR framework that exceeds the current geometry level approaches, a step change in generating high-level context-aware interactions. Our key insight is to build semantic understanding in MR that not only can greatly enhance user experience through object-specific behaviours, but also pave the way for solving complex interaction design challenges. The framework generates semantic properties of the real world environment through dense scene reconstruction and deep image understanding. We demonstrate our approach with a material-aware prototype system for generating context-aware physical interactions between the real and the virtual objects. Quantitative and qualitative evaluations are carried out and the results show that the framework delivers accurate and fast semantic information in interactive MR environment, providing effective semantic level interactions.
• Meaningful facial parts can convey key cues for both facial action unit detection and expression prediction. Textured 3D face scan can provide both detailed 3D geometric shape and 2D texture appearance cues of the face which are beneficial for Facial Expression Recognition (FER). However, accurate facial parts extraction as well as their fusion are challenging tasks. In this paper, a novel system for 3D FER is designed based on accurate facial parts extraction and deep feature fusion of facial parts. In particular, each textured 3D face scan is firstly represented as a 2D texture map and a depth map with one-to-one dense correspondence. Then, the facial parts of both texture map and depth map are extracted using a novel 4-stage process consists of facial landmark localization, facial rotation correction, facial resizing, facial parts bounding box extraction and post-processing procedures. Finally, deep fusion Convolutional Neural Networks (CNNs) features of all facial parts are learned from both texture maps and depth maps, respectively and nonlinear SVMs are used for expression prediction. Experiments are conducted on the BU-3DFE database, demonstrating the effectiveness of combing different facial parts, texture and depth cues and reporting the state-of-the-art results in comparison with all existing methods under the same setting.
• Convolutional Neural Networks (CNNs) need large amounts of data with ground truth annotation, which is a challenging problem that has limited the development and fast deployment of CNNs for many computer vision tasks. We propose a novel framework for depth estimation from monocular images with corresponding confidence in a self-supervised manner. A fully differential patch-based cost function is proposed by using the Zero-Mean Normalized Cross Correlation (ZNCC) that takes multi-scale patches as a matching strategy. This approach greatly increases the accuracy and robustness of the depth learning. In addition, the proposed patch-based cost function can provide a 0 to 1 confidence, which is then used to supervise the training of a parallel network for confidence map learning and estimation. Evaluation on KITTI dataset shows that our method outperforms the state-of-the-art results.
• As the demand for enabling high-level autonomous driving has increased in recent years and visual perception is one of the critical features to enable fully autonomous driving, in this paper, we introduce an efficient approach for simultaneous object detection, depth estimation and pixel-level semantic segmentation using a shared convolutional architecture. The proposed network model, which we named Driving Scene Perception Network (DSPNet), uses multi-level feature maps and multi-task learning to improve the accuracy and efficiency of object detection, depth estimation and image segmentation tasks from a single input image. Hence, the resulting network model uses less than 850 MiB of GPU memory and achieves 14.0 fps on NVIDIA GeForce GTX 1080 with a 1024x512 input image, and both precision and efficiency have been improved over combination of single tasks.
• Comprehending meaning from natural language is a primary objective of Natural Language Processing (NLP), and text comprehension is the cornerstone for achieving this objective upon which all other problems like chat bots, language translation and others can be achieved. We report a Summary-Attentive Reader we designed to better emulate the human reading process, along with a dictiontary-based solution regarding out-of-vocabulary (OOV) words in the data, to generate answer based on machine comprehension of reading passages and question from the SQuAD benchmark. Our implementation of these features with two popular models (Match LSTM and Dynamic Coattention) was able to reach close to matching the results obtained from humans.
• Modern switches have packet processing capacity of up to multi-tera bits per second, and they are also becoming more and more programmable. We seek to understand whether the programmability can translate packet processing capacity to computational power for parallel computing applications. In this paper, we first develop a simple mathematical model to understand the costs and overheads of data plane computation. Then we validate the the performance benefits of offloading computation to network. Using experiments on real data center network, we finnd that offloading computation to the data plane results in up to 20x speed-up for a simple Map-Reduce application. Motivated by this, we propose a parallel programming framework, p4mr, to help users efficiently program multiple switches. We successfully build and test a prototype of p4mr on a simulated testbed.
• In the era of Internet of Things (IoT) technologies the potential for privacy invasion is becoming a major concern especially in regards to healthcare data and Ambient Assisted Living (AAL) environments. Systems that offer AAL technologies make extensive use of personal data in order to provide services that are context-aware and personalized. This makes privacy preservation a very important issue especially since the users are not always aware of the privacy risks they could face. A lot of progress has been made in the deep learning field, however, there has been lack of research on privacy preservation of sensitive personal data with the use of deep learning. In this paper we focus on a Long Short Term Memory (LSTM) Encoder-Decoder, which is a principal component of deep learning, and propose a new encoding technique that allows the creation of different AAL data views, depending on the access level of the end user and the information they require access to. The efficiency and effectiveness of the proposed method are demonstrated with experiments on a simulated AAL dataset. Qualitatively, we show that the proposed model learns privacy operations such as disclosure, deletion and generalization and can perform encoding and decoding of the data with almost perfect recovery.
• Multispectral images contain many clues of surface characteristics of the objects, thus can be widely used in many computer vision tasks, e.g., recolorization and segmentation. However, due to the complex illumination and the geometry structure of natural scenes, the spectra curves of a same surface can look very different. In this paper, a Low Rank Multispectral Image Intrinsic Decomposition model (LRIID) is presented to decompose the shading and reflectance from a single multispectral image. We extend the Retinex model, which is proposed for RGB image intrinsic decomposition, for multispectral domain. Based on this, a low rank constraint is proposed to reduce the ill-posedness of the problem and make the algorithm solvable. A dataset of 12 images is given with the ground truth of shadings and reflectance, so that the objective evaluations can be conducted. The experiments demonstrate the effectiveness of proposed method.
• Online optimization has been a successful framework for solving large-scale problems under computational constraints and partial information. Current methods for online convex optimization require either a projection or exact gradient computation at each step, both of which can be prohibitively expensive for large-scale applications. At the same time, there is a growing trend of non-convex optimization in machine learning community and a need for online methods. Continuous submodular functions, which exhibit a natural diminishing returns condition, have recently been proposed as a broad class of non-convex functions which may be efficiently optimized. Although online methods have been introduced, they suffer from similar problems. In this work, we propose Meta-Frank-Wolfe, the first online projectionfree algorithm that uses stochastic gradient estimates. The algorithm relies on a careful sampling of gradients in each round and achieves the optimal $O(\sqrt{T})$ adversarial regret bounds for convex and continuous submodular optimization. We also propose One-Shot Frank-Wolfe, a simpler algorithm which requires only a single stochastic gradient estimate in each round and achieves a $O(T^{2/3})$ stochastic regret bound for convex and continuous submodular optimization. We apply our methods to develop a novel "lifting" framework for the online discrete submodular maximization and also see that they outperform current state of the art techniques on an extensive set of experiments.
• Domain adaptation (DA) is transfer learning which aims to learn an effective predictor on target data from source data despite data distribution mismatch between source and target. We present in this paper a novel unsupervised DA method for cross-domain visual recognition which simultaneously optimizes the three terms of a theoretically established error bound. Specifically, the proposed DA method iteratively searches a latent shared feature subspace where not only the divergence of data distributions between the source domain and the target domain is decreased as most state-of-the-art DA methods do, but also the inter-class distances are increased to facilitate discriminative learning. Moreover, the proposed DA method sparsely regresses class labels from the features achieved in the shared subspace while minimizing the prediction errors on the source data and ensuring label consistency between source and target. Data outliers are also accounted for to further avoid negative knowledge transfer. Comprehensive experiments and in-depth analysis verify the effectiveness of the proposed DA method which consistently outperforms the state-of-the-art DA methods on standard DA benchmarks, i.e., 12 cross-domain image classification tasks.
• There is increasing interest in learning algorithms that involve interaction between human and machine. Comparison-based queries are among the most natural ways to get feedback from humans. A challenge in designing comparison-based interactive learning algorithms is coping with noisy answers. The most common fix is to submit a query several times, but this is not applicable in many situations due to its prohibitive cost and due to the unrealistic assumption of independent noise in different repetitions of the same query. In this paper, we introduce a new weak oracle model, where a non-malicious user responds to a pairwise comparison query only when she is quite sure about the answer. This model is able to mimic the behavior of a human in noise-prone regions. We also consider the application of this weak oracle model to the problem of content search (a variant of the nearest neighbor search problem) through comparisons. More specifically, we aim at devising efficient algorithms to locate a target object in a database equipped with a dissimilarity metric via invocation of the weak comparison oracle. We propose two algorithms termed WORCS-I and WORCS-II (Weak-Oracle Comparison-based Search), which provably locate the target object in a number of comparisons close to the entropy of the target distribution. While WORCS-I provides better theoretical guarantees, WORCS-II is applicable to more technically challenging scenarios where the algorithm has limited access to the ranking dissimilarity between objects. A series of experiments validate the performance of our proposed algorithms.
• The rapidly growing body of research in adversarial machine learning has demonstrated that deep neural networks (DNNs) are highly vulnerable to adversarially generated images. This underscores the urgent need for practical defense that can be readily deployed to combat attacks in real-time. Observing that many attack strategies aim to perturb image pixels in ways that are visually imperceptible, we place JPEG compression at the core of our proposed Shield defense framework, utilizing its capability to effectively "compress away" such pixel manipulation. To immunize a DNN model from artifacts introduced by compression, Shield "vaccinates" a model by re-training it with compressed images, where different compression levels are applied to generate multiple vaccinated models that are ultimately used together in an ensemble defense. On top of that, Shield adds an additional layer of protection by employing randomization at test time that compresses different regions of an image using random compression levels, making it harder for an adversary to estimate the transformation performed. This novel combination of vaccination, ensembling, and randomization makes Shield a fortified multi-pronged protection. We conducted extensive, large-scale experiments using the ImageNet dataset, and show that our approaches eliminate up to 94% of black-box attacks and 98% of gray-box attacks delivered by the recent, strongest attacks, such as Carlini-Wagner's L2 and DeepFool. Our approaches are fast and work without requiring knowledge about the model.
• In this paper, we consider an online optimization process, where the objective functions are not convex (nor concave) but instead belong to a broad class of continuous submodular functions. We first propose a variant of the Frank-Wolfe algorithm that has access to the full gradient of the objective functions. We show that it achieves a regret bound of $O(\sqrt{T})$ (where $T$ is the horizon of the online optimization problem) against a $(1-1/e)$-approximation to the best feasible solution in hindsight. However, in many scenarios, only an unbiased estimate of the gradients are available. For such settings, we then propose an online stochastic gradient ascent algorithm that also achieves a regret bound of $O(\sqrt{T})$ regret, albeit against a weaker $1/2$-approximation to the best feasible solution in hindsight. We also generalize our results to $\gamma$-weakly submodular functions and prove the same sublinear regret bounds. Finally, we demonstrate the efficiency of our algorithms on a few problem instances, including non-convex/non-concave quadratic programs, multilinear extensions of submodular set functions, and D-optimal design.
• Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on the PASCAL VOC 2012 semantic image segmentation dataset and achieve a performance of 89% on the test set without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.
• In this paper we study the (Bichromatic) Maximum Inner Product Problem (Max-IP), in which we are given sets $A$ and $B$ of vectors, and the goal is to find $a \in A$ and $b \in B$ maximizing inner product $a \cdot b$. Max-IP is very basic and serves as the base problem in the recent breakthrough of [Abboud et al., FOCS 2017] on hardness of approximation for polynomial-time problems. It is also used (implicitly) in the argument for hardness of exact $\ell_2$-Furthest Pair (and other important problems in computational geometry) in poly-log-log dimensions in [Williams, SODA 2018]. We have three main results regarding this problem. First, we study the best multiplicative approximation ratio for Boolean Max-IP in sub-quadratic time. We show that, for Max-IP with two sets of $n$ vectors from $\{0,1\}^{d}$, there is an $n^{2 - \Omega(1)}$ time $\left( d/\log n \right)^{\Omega(1)}$-multiplicative-approximating algorithm, and we show this is conditionally optimal, as such a $\left(d/\log n\right)^{o(1)}$-approximating algorithm would refute SETH. Second, we achieve a similar characterization for the best additive approximation error to Boolean Max-IP. We show that, for Max-IP with two sets of $n$ vectors from $\{0,1\}^{d}$, there is an $n^{2 - \Omega(1)}$ time $\Omega(d)$-additive-approximating algorithm, and this is conditionally optimal, as such an $o(d)$-approximating algorithm would refute SETH [Rubinstein, STOC 2018]. Last, we revisit the hardness of solving Max-IP exactly for vectors with integer entries. We show that, under SETH, for Max-IP with sets of $n$ vectors from $\mathbb{Z}^{d}$ for some $d = 2^{O(\log^{*} n)}$, every exact algorithm requires $n^{2 - o(1)}$ time. With the reduction from [Williams, SODA 2018], it follows that $\ell_2$-Furthest Pair and Bichromatic $\ell_2$-Closest Pair in $2^{O(\log^{*} n)}$ dimensions require $n^{2 - o(1)}$ time.
• Glioma is one of the most common and aggressive types of primary brain tumors. The accurate segmentation of subcortical brain structures is crucial to the study of gliomas in that it helps the monitoring of the progression of gliomas and aids the evaluation of treatment outcomes. However, the large amount of required human labor makes it difficult to obtain the manually segmented Magnetic Resonance Imaging (MRI) data, limiting the use of precise quantitative measurements in the clinical practice. In this work, we try to address this problem by developing a 3D Convolutional Neural Network~(3D CNN) based model to automatically segment gliomas. The major difficulty of our segmentation model comes with the fact that the location, structure, and shape of gliomas vary significantly among different patients. In order to accurately classify each voxel, our model captures multi-scale contextual information by extracting features from two scales of receptive fields. To fully exploit the tumor structure, we propose a novel architecture that hierarchically segments different lesion regions of the necrotic and non-enhancing tumor~(NCR/NET), peritumoral edema~(ED) and GD-enhancing tumor~(ET). Additionally, we utilize densely connected convolutional blocks to further boost the performance. We train our model with a patch-wise training schema to mitigate the class imbalance problem. The proposed method is validated on the BraTS 2017 dataset and it achieves Dice scores of 0.72, 0.83 and 0.81 for the complete tumor, tumor core and enhancing tumor, respectively. These results are comparable to the reported state-of-the-art results, and our method is better than existing 3D-based methods in terms of compactness, time and space efficiency.
• Solving nonlinear algebraic equations is a classic mathematics problem, and common in scientific researches and engineering applications. There are many numeric, symbolic and numeric-symbolic methods of solving (real) solutions. Unlucky, these methods are constrained by some factors, e.g., high complexity, slow serial calculation, and the notorious intermediate expression expansion. Especially when the count of variables is larger than six, the efficiency is decreasing drastically. In this paper, according to the property of physical world, we pay attention to nonlinear algebraic equations whose variables are in fixed constraints, and get meaningful real solutions. Combining with parallelism of GPGPU, we present an efficient algorithm, by searching the solution space globally and solving the nonlinear algebraic equations with real interval solutions. Furthermore, we realize the Hansen-Sengupta method on GPGPU. The experiments show that our method can solve many nonlinear algebraic equations, and the results are accurate and more efficient compared to traditional serial methods.
• Jan 31 2018 cs.CV arXiv:1801.09718v1
Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural language processing with abstract reasoning, the problem is considered as AI-complete. Recent advances indicate that using high-level, abstract facts extracted from the inputs might facilitate reasoning. Following that direction we decided to develop a solution combining state-of-the-art object detection and reasoning modules. The results, achieved on the well-balanced CLEVR dataset, confirm the promises and show significant, few percent improvements of accuracy on the complex "counting" task.
• Cooperative device to device (CD2D) communication has been considered to be a solution to capacity shortage problem. Combining multi-homing and CD2D techniques together can potentially improve network performance. We propose a novel multi-homing CD2D (MH-CD2D) network, in which multiple homing mobile devices (MMDs) act as relays for the cooperative communications of ordinary mobile devices (OMDs). We formulate such joint bandwidth-relay allocation problem as a two-stage game, in order to deal with two challenges: how to motivate MMDs to lease spare bandwidths and help OMDs to choose appropriate MMD relays. In the first stage, we use a non-cooperative game to model the competition between MMDs in terms of shared bandwidth and price. In the second stage, we model the behavior of OMDs selecting MMDs by an evolutionary game. We prove that there exists Nash equilibrium in the game and propose a distributed incentive scheme named IMES to solve the joint bandwidth-relay allocation problem. Extensive simulation results show that the equilibrium can be achieved and the best response price of one MMD increases with the other's best price in the Stackelberg game. The utility of MMDs increases with the number of OMDs in each OMD group at the evolutionary equilibrium. The proposed algorithms are able to reduce average service delay by more than 25% in comparison to the randomized scheme which is frequently used in existing works. On average, IMES outperforms existing scheme by about 20.37% in terms of utility of MMDs.
• An intelligent robot agent based on domain ontology, machine learning mechanism, and Fuzzy Markup Language (FML) for students and robot co-learning is presented in this paper. The machine-human co-learning model is established to help various students learn the mathematical concepts based on their learning ability and performance. Meanwhile, the robot acts as a teacher's assistant to co-learn with children in the class. The FML-based knowledge base and rule base are embedded in the robot so that the teachers can get feedback from the robot on whether students make progress or not. Next, we inferred students' learning performance based on learning content's difficulty and students' ability, concentration level, as well as teamwork sprit in the class. Experimental results show that learning with the robot is helpful for disadvantaged and below-basic children. Moreover, the accuracy of the intelligent FML-based agent for student learning is increased after machine learning mechanism.
• Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.
• Modern networks are of huge sizes as well as high dynamics, which challenges the efficiency of community detection algorithms. In this paper, we study the problem of overlapping community detection on distributed and dynamic graphs. Given a distributed, undirected and unweighted graph, the goal is to detect overlapping communities incrementally as the graph is dynamically changing. We propose an efficient algorithm, called \textitrandomized Speaker-Listener Label Propagation Algorithm (rSLPA), based on the \textitSpeaker-Listener Label Propagation Algorithm (SLPA) by relaxing the probability distribution of label propagation. Besides detecting high-quality communities, rSLPA can incrementally update the detected communities after a batch of edge insertion and deletion operations. To the best of our knowledge, rSLPA is the first algorithm that can incrementally capture the same communities as those obtained by applying the detection algorithm from the scratch on the updated graph. Extensive experiments are conducted on both synthetic and real-world datasets, and the results show that our algorithm can achieve high accuracy and efficiency at the same time.
• Mobile Edge Computing (MEC) pushes computing functionalities away from the centralized cloud to the network edge, thereby meeting the latency requirements of many emerging mobile applications and saving backhaul network bandwidth. Although many existing works have studied computation offloading policies, service caching is an equally, if not more important, design topic of MEC, yet receives much less attention. Service caching refers to caching application services and their related databases/libraries in the edge server (e.g. MEC-enabled BS), thereby enabling corresponding computation tasks to be executed. Because only a small number of application services can be cached in resource-limited edge server at the same time, which services to cache has to be judiciously decided to maximize the edge computing performance. In this paper, we investigate the extremely compelling but much less studied problem of dynamic service caching in MEC-enabled dense cellular networks. We propose an efficient online algorithm, called OREO, which jointly optimizes dynamic service caching and task offloading to address a number of key challenges in MEC systems, including service heterogeneity, unknown system dynamics, spatial demand coupling and decentralized coordination. Our algorithm is developed based on Lyapunov optimization and Gibbs sampling, works online without requiring future information, and achieves provable close-to-optimal performance. Simulation results show that our algorithm can effectively reduce computation latency for end users while keeping energy consumption low.
• Correctly estimating the discrepancy between two data distributions has always been an important task in Machine Learning. Recently, Cuturi proposed the Sinkhorn distance which makes use of an approximate Optimal Transport cost between two distributions as a distance to describe distribution discrepancy. Although it has been successfully adopted in various machine learning applications (e.g. in Natural Language Processing and Computer Vision) since then, the Sinkhorn distance also suffers from two unnegligible limitations. The first one is that the Sinkhorn distance only gives an approximation of the real Wasserstein distance, the second one is the `divide by zero' problem which often occurs during matrix scaling when setting the entropy regularization coefficient to a small value. In this paper, we introduce a new Brenier approach for calculating a more accurate Wasserstein distance between two discrete distributions, this approach successfully avoids the two limitations shown above for Sinkhorn distance and gives an alternative way for estimating distribution discrepancy.
• In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters
• Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories, e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.
• This paper presents HeNet, a hierarchical ensemble neural network, applied to classify hardware-generated control flow traces for malware detection. Deep learning-based malware detection has so far focused on analyzing executable files and runtime API calls. Static code analysis approaches face challenges due to obfuscated code and adversarial perturbations. Behavioral data collected during execution is more difficult to obfuscate but recent research has shown successful attacks against API call based malware classifiers. We investigate control flow based characterization of a program execution to build robust deep learning malware classifiers. HeNet consists of a low-level behavior model and a top-level ensemble model. The low-level model is a per-application behavior model, trained via transfer learning on a time-series of images generated from control flow trace of an execution. We use Intel$^\circledR$ Processor Trace enabled processor for low overhead execution tracing and design a lightweight image conversion and segmentation of the control flow trace. The top-level ensemble model aggregates the behavior classification of all the trace segments and detects an attack. The use of hardware trace adds portability to our system and the use of deep learning eliminates the manual effort of feature engineering. We evaluate HeNet against real-world exploitations of PDF readers. HeNet achieves 100\% accuracy and 0\% false positive on test set, and higher classification accuracy compared to classical machine learning algorithms.
• With more and more household objects built on planned obsolescence and consumed by a fast-growing population, hazardous waste recycling has become a critical challenge. Given the large variability of household waste, current recycling platforms mostly rely on human operators to analyze the scene, typically composed of many object instances piled up in bulk. Helping them by robotizing the unitary extraction is a key challenge to speed up this tedious process. Whereas supervised deep learning has proven very efficient for such object-level scene understanding, e.g., generic object detection and segmentation in everyday scenes, it however requires large sets of per-pixel labeled images, that are hardly available for numerous application contexts, including industrial robotics. We thus propose a step towards a practical interactive application for generating an object-oriented robotic grasp, requiring as inputs only one depth map of the scene and one user click on the next object to extract. More precisely, we address in this paper the middle issue of object seg-mentation in top views of piles of bulk objects given a pixel location, namely seed, provided interactively by a human operator. We propose a twofold framework for generating edge-driven instance segments. First, we repurpose a state-of-the-art fully convolutional object contour detector for seed-based instance segmentation by introducing the notion of edge-mask duality with a novel patch-free and contour-oriented loss function. Second, we train one model using only synthetic scenes, instead of manually labeled training data. Our experimental results show that considering edge-mask duality for training an encoder-decoder network, as we suggest, outperforms a state-of-the-art patch-based network in the present application context.
• A fundamental challenge in wireless multicast has been how to simultaneously achieve high-throughput and low-delay for reliably serving a large number of users. In this paper, we show how to harness substantial throughput and delay gains by exploiting multi-channel resources. We develop a new scheme called Multi-Channel Moving Window Codes (MC-MWC) for multi-channel multi-session wireless multicast. The salient features of MC-MWC are three-fold. (i) High throughput: we show that MC-MWC achieves order-optimal throughput in the many-user many-channel asymptotic regime. Moreover, the number of channels required by a conventional channel-allocation based scheme is shown to be doubly-exponentially larger than that required by MC-MWC. (ii) Low delay: using large deviations theory, we show that the delay of MC-MWC decreases linearly with the number of channels, while the delay reduction of conventional schemes is no more than a finite constant. (iii) Low feedback overhead: the feedback overhead of MC-MWC is a constant that is independent of both the number of receivers in each session and the number of sessions in the network. Finally, our trace-driven simulation and numerical results validate the analytical results and show that the implementation complexity of MC-MWC is low.
• Many applications require the immutable and consistent sharing of data across organizational boundaries. Because conventional datastores cannot provide this functionality, blockchains have been proposed as one possible solution. Yet public blockchains are energy inefficient, hard to scale and suffer from limited throughput and high latencies, while permissioned blockchains depend on specially designated nodes, potentially leak meta-information, and also suffer from scale and performance bottlenecks. This paper presents CreDB, a datastore that provides blockchain-like guarantees of integrity using trusted execution environments. CreDB employs four novel mechanisms to support a new class of applications. First, it creates a permanent record of every transaction, known as a witness, that clients can then use not only to audit the database but to prove to third parties that desired actions took place. Second, it associates with every object an inseparable and inviolable policy, which not only performs access control but enables the datastore to implement state machines whose behavior is amenable to analysis. Third, timeline inspection allows authorized parties to inspect and reason about the history of changes made to the data. Finally, CreDB provides a protected function evaluation mechanism that allows integrity-protected computation over private data. The paper describes these mechanisms, and the applications they collectively enable, in detail. We have fully implemented a prototype of CreDB on Intel SGX. Evaluation shows that CreDB can serve as a drop-in replacement for other NoSQL stores, such as MongoDB while providing stronger integrity guarantees.
• Deep Learning can significantly benefit cancer proteomics and genomics. In this study, we attempt to determine a set of critical proteins that are associated with the FLT3-ITD mutation in newly-diagnosed acute myeloid leukemia patients. A Deep Learning network consisting of autoencoders forming a hierarchical model from which high-level features are extracted without labeled training data. Dimensional reduction reduced the number of critical proteins from 231 to 20. Deep Learning found an excellent correlation between FLT3-ITD mutation with the levels of these 20 critical proteins (accuracy 97%, sensitivity 90%, specificity 100%). Our Deep Learning network could hone in on 20 proteins with the strongest association with FLT3-ITD. The results of this study allow a novel approach to determine critical protein pathways in the FLT3-ITD mutation, and provide proof-of-concept for an accurate approach to model big data in cancer proteomics and genomics.
• Domain adaptation (DA) aims to generalize a learning model across training and testing data despite the mismatch of their data distributions. In light of a theoretical estimation of upper error bound, we argue in this paper that an effective DA method should 1) search a shared feature subspace where source and target data are not only aligned in terms of distributions as most state of the art DA methods do, but also discriminative in that instances of different classes are well separated; 2) account for the geometric structure of the underlying data manifold when inferring data labels on the target domain. In comparison with a baseline DA method which only cares about data distribution alignment between source and target, we derive three different DA models, namely CDDA, GA-DA, and DGA-DA, to highlight the contribution of Close yet Discriminative DA(CDDA) based on 1), Geometry Aware DA (GA-DA) based on 2), and finally Discriminative and Geometry Aware DA (DGA-DA) implementing jointly 1) and 2). Using both synthetic and real data, we show the effectiveness of the proposed approach which consistently outperforms state of the art DA methods over 36 image classification DA tasks through 6 popular benchmarks. We further carry out in-depth analysis of the proposed DA method in quantifying the contribution of each term of our DA model and provide insights into the proposed DA methods in visualizing both real and synthetic data.
• The objective of the triple scoring task in WSDM Cup 2017 is to compute relevance scores for knowledge-base triples of type-like relations. For example, consider Julius Caesar who has had various professions, including Politician and Author. For two given triples (Julius Caesar, profession, Politician) and (Julius Caesar, profession, Author), the former triple is likely to have a higher relevance score (also called "triple score") because Julius Caesar was well-known as a politician and not as an author. Accurate prediction of such triple scores greatly benefits real-world applications, such as information retrieval or knowledge base query. In these scenarios, being able to rank all relations (Profession/Nationality) can help improve the user experience. We propose a triple scoring model which integrates knowledge from both latent features and explicit features via an ensemble approach. The latent features consist of representations for a person learned by using a word2vec model and representations for profession/nationality values extracted from a pre-trained GloVe embedding model. In addition, we extract explicit features for person entities from the Freebase knowledge base. Experimental results show that the proposed method performs competitively at WSDM Cup 2017, ranking at the third place with an accuracy of 79.72% for predicting within two places of the ground truth score.
• Automated segmentation of intracranial arteries on magnetic resonance angiography (MRA) allows for quantification of cerebrovascular features, which provides tools for understanding aging and pathophysiological adaptations of the cerebrovascular system. Using a convolutional autoencoder (CAE) for segmentation is promising as it takes advantage of the autoencoder structure in effective noise reduction and feature extraction by representing high dimensional information with low dimensional latent variables. In this report, an optimized CAE model (Y-net) was trained to learn a 3D segmentation model of intracranial arteries from 49 cases of MRA data. The trained model was shown to perform better than the three traditional segmentation methods in both binary classification and visual evaluation.
• Aiming to address the fast multi-object tracking for dense small object in the cluster background, we review track orientated multi-hypothesis tracking(TOMHT) with consideration of batch optimization. Employing autocorrelation based motion score test and staged hypotheses merging approach, we build our homologous hypothesis generation and management method. A new one-to-many constraint is proposed and applied to tackle the track exclusions during complex occlusions. Besides, to achieve better results, we develop a multi-appearance segmentation for detection, which exploits tree-like topological information and realizes one threshold for one object. Experimental results verify the strength of our methods, indicating speed and performance advantages of our tracker.
• In this work, we tackle the problem of instance segmentation, the task of simultaneously solving object detection and semantic segmentation. Towards this goal, we present a model, called MaskLab, which produces three outputs: box detection, semantic segmentation, and direction prediction. Building on top of the Faster-RCNN object detector, the predicted boxes provide accurate localization of object instances. Within each region of interest, MaskLab performs foreground/background segmentation by combining semantic and direction prediction. Semantic segmentation assists the model in distinguishing between objects of different semantic classes including background, while the direction prediction, estimating each pixel's direction towards its corresponding center, allows separating instances of the same semantic class. Moreover, we explore the effect of incorporating recent successful methods from both segmentation and detection (i.e. atrous convolution and hypercolumn). Our proposed model is evaluated on the COCO instance segmentation benchmark and shows comparable performance with other state-of-art models.
• We propose an approach for forecasting video of complex human activity involving multiple people. Direct pixel-level prediction is too simple to handle the appearance variability in complex activities. Hence, we develop novel intermediate representations. An architecture combining a hierarchical temporal model for predicting human poses and encoder-decoder convolutional neural networks for rendering target appearances is proposed. Our hierarchical model captures interactions among people by adopting a dynamic group-based interaction mechanism. Next, our appearance rendering network encodes the targets' appearances by learning adaptive appearance filters using a fully convolutional network. Finally, these filters are placed in encoder-decoder neural networks to complete the rendering. We demonstrate that our model can generate videos that are superior to state-of-the-art methods, and can handle complex human activity scenarios in video forecasting.
• We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test images and their classes are both unseen during training. SP-AEN aims to tackle the inherent problem --- semantic loss --- in the prevailing family of embedding-based ZSL, where some semantics would be discarded during training if they are non-discriminative for training classes, but could become critical for recognizing test classes. Specifically, SP-AEN prevents the semantic loss by introducing an independent visual-to-semantic space embedder which disentangles the semantic space into two subspaces for the two arguably conflicting objectives: classification and reconstruction. Through adversarial learning of the two subspaces, SP-AEN can transfer the semantics from the reconstructive subspace to the discriminative one, accomplishing the improved zero-shot recognition of unseen classes. Comparing with prior works, SP-AEN can not only improve classification but also generate photo-realistic images, demonstrating the effectiveness of semantic preservation. On four popular benchmarks: CUB, AWA, SUN and aPY, SP-AEN considerably outperforms other state-of-the-art methods by an absolute performance difference of 12.2\%, 9.3\%, 4.0\%, and 3.6\% in terms of harmonic mean values
• The interpolation based algebraic decoding for Reed-Solomon (RS) codes can correct errors beyond half of the code's minimum Hamming distance. Using soft information, algebraic soft decoding (ASD) can further enhance the performance. This paper introduces two ASD algorithms that utilize a new interpolation technique, the module minimization (MM). Achieving the desirable Gröbner basis, the MM interpolation is simpler than the existing Koetter's interpolation that is an iterative polynomial construction process. We will demonstrate how the MM technique can be utilized to solve the interpolation problem in the Koetter-Vardy decoding and the algebraic Chase decoding, respectively. Their re-encoding transformed variants will also be introduced. This transform reduces the size of module entries, leading to a reduced MM complexity. It is more effective for high rate codes. For an RS code of length $n$ and dimension $k$, the MM complexity is $O(n(n - k)l^5)$, where $l$ is the $y$-degree of the interpolated polynomial. Re-encoding transform further reduces the MM complexity to $O((n-k)^2l^5)$. The MM complexity characterizes complexity of the two ASD algorithms. Simulation results will be further provided to substantiate the proposals' decoding and complexity performances.
• Voice input has been tremendously improving the user experience of mobile devices by freeing our hands from typing on the small screen. Speech recognition is the key technology that powers voice input, and it is usually outsourced to the cloud for the best performance. However, the cloud might compromise users' privacy by identifying their identities by voice, learning their sensitive input content via speech recognition, and then profiling the mobile users based on the content. In this paper, we design an intermediate between users and the cloud, named VoiceMask, to sanitize users' voice data before sending it to the cloud for speech recognition. We analyze the potential privacy risks and aim to protect users' identities and sensitive input content from being disclosed to the cloud. VoiceMask adopts a carefully designed voice conversion mechanism that is resistant to several attacks. Meanwhile, it utilizes an evolution-based keyword substitution technique to sanitize the voice input content. The two sanitization phases are all performed in the resource-limited mobile device while still maintaining the usability and accuracy of the cloud-supported speech recognition service. We implement the voice sanitizer on Android systems and present extensive experimental results that validate the effectiveness and efficiency of our app. It is demonstrated that we are able to reduce the chance of a user's voice being identified from 50 people by 84% while keeping the drop of speech recognition accuracy within 14.2%.
• Little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
• In the adversarial-perturbation problem of neural networks, an adversary starts with a neural network model $F$ and a point ${\bf x}$ that $F$ classifies correctly, and applies a \emphsmall perturbation to $\bf x$ to produce another point ${\bf x}'$ that $F$ classifies \emphincorrectly. In this paper, we propose taking into account \emphthe inherent confidence information produced by models when studying adversarial perturbations, where a natural measure of "confidence" is $\|F({\bf x})\|_\infty$ (i.e. how confident $F$ is about its prediction?). Motivated by a thought experiment based on the manifold assumption, we propose a "goodness property" of models which states that \emphconfident regions of a good model should be well separated. We give formalizations of this property and examine existing robust training objectives in view of them. Interestingly, we find that a recent objective by Madry et al. encourages training a model that satisfies well our formal version of the goodness property, but has a weak control of points that are wrong but with low confidence. However, if Madry et al.'s model is indeed a good solution to their objective, then good and bad points are now distinguishable and we can try to embed uncertain points back to the closest confident region to get (hopefully) correct predictions. We thus propose embedding objectives and algorithms, and perform an empirical study using this method. Our experimental results are encouraging: Madry et al.'s model wrapped with our embedding procedure achieves almost perfect success rate in defending against attacks that the base model fails on, while retaining good generalization behavior.
• A new form of variational autoencoder (VAE) is developed, in which the joint distribution of data and codes is considered in two (symmetric) forms: ($i$) from observed data fed through the encoder to yield codes, and ($ii$) from latent codes drawn from a simple prior and propagated through the decoder to manifest data. Lower bounds are learned for marginal log-likelihood fits observed data and latent codes. When learning with the variational bound, one seeks to minimize the symmetric Kullback-Leibler divergence of joint density functions from ($i$) and ($ii$), while simultaneously seeking to maximize the two marginal log-likelihoods. To facilitate learning, a new form of adversarial training is developed. An extensive set of experiments is performed, in which we demonstrate state-of-the-art data reconstruction and generation on several image benchmark datasets.
• A number of image-processing problems can be formulated as optimization problems. The objective function typically contains several terms specifically designed for different purposes. Parameters in front of these terms are used to control the relative weights among them. It is of critical importance to tune these parameters, as quality of the solution depends on their values. Tuning parameter is a relatively straightforward task for a human, as one can intelligently determine the direction of parameter adjustment based on the solution quality. Yet manual parameter tuning is not only tedious in many cases, but becomes impractical when a number of parameters exist in a problem. Aiming at solving this problem, this paper proposes an approach that employs deep reinforcement learning to train a system that can automatically adjust parameters in a human-like manner. We demonstrate our idea in an example problem of optimization-based iterative CT reconstruction with a pixel-wise total-variation regularization term. We set up a parameter tuning policy network (PTPN), which maps an CT image patch to an output that specifies the direction and amplitude by which the parameter at the patch center is adjusted. We train the PTPN via an end-to-end reinforcement learning procedure. We demonstrate that under the guidance of the trained PTPN for parameter tuning at each pixel, reconstructed CT images attain quality similar or better than in those reconstructed with manually tuned parameters.
• We present a planner for a robot to keep an object stable under a sequence of external forces. We assume a robot with multiple manipulators, with a gripper at the end of each manipulator. We make two key contributions: First, given an external force, we propose a method to check the feasibility of gripper placements on the object to resist the force. This method takes into account both the manipulator joint torque limits and the force/torque limits of the grippers-object system. Second, given a sequence of external forces, we propose a planner that generates a sequence of gripper placements and manipulator configurations, along with the motion to move between these configurations. The planner minimizes the number of different gripper placements required to resist the forces, resulting in efficient planning and execution. We perform experiments to verify the predictions of our feasibility checking method and to measure the performance of our planner.
• Although the word-popularity based negative sampler has shown superb performance in the skip-gram model, the theoretical motivation behind oversampling popular (non-observed) words as negative samples is still not well understood. In this paper, we start from an investigation of the gradient vanishing issue in the skip-gram model without a proper negative sampler. By performing an insightful analysis from the stochastic gradient descent (SGD) learning perspective, we demonstrate that, both theoretically and intuitively, negative samples with larger inner product scores are more informative than those with lower scores for the SGD learner in terms of both convergence rate and accuracy. Understanding this, we propose an alternative sampling algorithm that dynamically selects informative negative samples during each SGD update. More importantly, the proposed sampler accounts for multi-dimensional self-embedded features during the sampling process, which essentially makes it more effective than the original popularity-based (one-dimensional) sampler. Empirical experiments further verify our observations, and show that our fine-grained samplers gain significant improvement over the existing ones without increasing computational complexity.
• The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.
• 2D face analysis techniques, such as face landmarking, face recognition and face verification, are reasonably dependent on illumination conditions which are usually uncontrolled and unpredictable in the real world. An illumination robust preprocessing method thus remains a significant challenge in reliable face analysis. In this paper we propose a novel approach for improving lighting normalization through building the underlying reflectance model which characterizes interactions between skin surface, lighting source and camera sensor, and elaborates the formation of face color appearance. Specifically, the proposed illumination processing pipeline enables the generation of Chromaticity Intrinsic Image (CII) in a log chromaticity space which is robust to illumination variations. Moreover, as an advantage over most prevailing methods, a photo-realistic color face image is subsequently reconstructed which eliminates a wide variety of shadows whilst retaining the color information and identity details. Experimental results under different scenarios and using various face databases show the effectiveness of the proposed approach to deal with lighting variations, including both soft and hard shadows, in face recognition.
• In this paper we propose a new Koopman operator approach to the decomposition of nonlinear dynamical systems using Koopman Gramians. We introduce the notion of an input-Koopman operator, and show how input-Koopman operators can be used to cast a nonlinear system into the classical state-space form, and identify conditions under which input and state observable functions are well separated. We then extend an existing method of dynamic mode decomposition for learning Koopman operators from data known as deep dynamic mode decomposition to systems with controls or disturbances. We illustrate the accuracy of the method in learning an input-state separable Koopman operator for an example system, even when the underlying system exhibits mixed state-input terms. We next introduce a nonlinear decomposition algorithm, based on Koopman Gramians, that maximizes internal subsystem observability and disturbance rejection from unwanted noise from other subsystems. We derive a relaxation based on Koopman Gramians and multi-way partitioning for the resulting NP-hard decomposition problem. We lastly illustrate the proposed algorithm with the swing dynamics for an IEEE 39-bus system.
• Mobile Edge Computing (MEC) pushes computing functionalities away from the centralized cloud to the proximity of data sources, thereby reducing service provision latency and saving backhaul network bandwidth. Although computation offloading has been extensively studied in the literature, service caching is an equally, if not more, important design topic of MEC, yet receives much less attention. Service caching refers to caching application services and their related data (libraries/databases) in the edge server, e.g. MEC-enabled Base Station (BS), enabling corresponding computation tasks to be executed. Since only a small number of services can be cached in resource-limited edge server at the same time, which services to cache has to be judiciously decided to maximize the system performance. In this paper, we investigate collaborative service caching in MEC-enabled dense small cell (SC) networks. We propose an efficient decentralized algorithm, called CSC (Collaborative Service Caching), where a network of small cell BSs optimize service caching collaboratively to address a number of key challenges in MEC systems, including service heterogeneity, spatial demand coupling, and decentralized coordination. Our algorithm is developed based on parallel Gibbs sampling by exploiting the special structure of the considered problem using graphing coloring. The algorithm significantly improves the time efficiency compared to conventional Gibbs sampling, yet guarantees provable convergence and optimality. CSC is further extended to the SC network with selfish BSs, where a coalitional game is formulated to incentivize collaboration. A coalition formation algorithm is developed by employing the merge-and-split rules and ensures the stability of the SC coalitions.
• Recently, the low rank and sparse (LRS) matrix decomposition has been introduced as an effective mean to solve the multi-view registration. It views each available relative motion as a block element to reconstruct one matrix so as to approximate the low rank matrix, where global motions can be recovered for multi-view registration. However, this approach is sensitive to the sparsity of the reconstructed matrix and it treats all block elements equally in spite of their varied reliability. Therefore, this paper proposes an effective approach for multi-view registration by the weighted LRS decomposition. Based on the anti-symmetry property of relative motions, it firstly proposes a completion strategy to reduce the sparsity of the reconstructed matrix. The reduced sparsity of reconstructed matrix can improve the robustness of LRS decomposition. Then, it proposes the weighted LRS decomposition, where each block element is assigned with one estimated weight to denote its reliability. By introducing the weight, more accurate registration results can be recovered from the estimated low rank matrix with good efficiency. Experimental results tested on public data sets illustrate the superiority of the proposed approach over the state-of-the-art approaches on robustness, accuracy, and efficiency.
• A Triangle Generative Adversarial Network ($\Delta$-GAN) is developed for semi-supervised cross-domain joint distribution matching, where the training data consists of samples from each domain, and supervision of domain correspondence is provided by only a few paired samples. $\Delta$-GAN consists of four neural networks, two generators and two discriminators. The generators are designed to learn the two-way conditional distributions between the two domains, while the discriminators implicitly define a ternary discriminative function, which is trained to distinguish real data pairs and two kinds of fake data pairs. The generators and discriminators are trained together using adversarial learning. Under mild assumptions, in theory the joint distributions characterized by the two generators concentrate to the data distribution. In experiments, three different kinds of domain pairs are considered, image-label, image-image and image-attribute pairs. Experiments on semi-supervised image classification, image-to-image translation and attribute-based image generation demonstrate the superiority of the proposed approach.
• Recommender systems take the key responsibility to help users discover items that they might be interested in. Many recommendation algorithms are built upon similarity measures, which usually result in low intra-list diversity. The deficiency in capturing the whole range of user interest often leads to poor satisfaction. To solve this problem, increasing attention has been paid on improving the diversity of recommendation results in recent years. In this paper, we propose a novel method to improve the diversity of top-$N$ recommendation results based on the determinantal point process (DPP), which is an elegant model for characterizing the repulsion phenomenon. We propose an acceleration algorithm to greatly speed up the process of the result inference, making our algorithm practical for large-scale scenarios. We also incorporate a tunable parameter into the DPP model which allows the users to smoothly control the level of diversity. More diversity metrics are introduced to better evaluate diversification algorithms. We have evaluated our algorithm on several public datasets, and compared it thoroughly with other reference algorithms. Results show that our proposed algorithm provides a much better accuracy-diversity trade-off with comparable efficiency.
• Sep 14 2017 cs.CV arXiv:1709.04295v1
3D face dense tracking aims to find dense inter-frame correspondences in a sequence of 3D face scans and constitutes a powerful tool for many face analysis tasks, e.g., 3D dynamic facial expression analysis. The majority of the existing methods just fit a 3D face surface or model to a 3D target surface without considering temporal information between frames. In this paper, we propose a novel method for densely tracking sequences of 3D face scans, which ex- tends the non-rigid ICP algorithm by adding a novel specific criterion for temporal information. A novel fitting framework is presented for automatically tracking a full sequence of 3D face scans. The results of experiments carried out on the BU4D-FE database are promising, showing that the proposed algorithm outperforms state-of-the-art algorithms for 3D face dense tracking.
• Inspired by the generation power of generative adversarial networks (GANs) in image domains, we introduce a novel hierarchical architecture for learning characteristic topological features from a single arbitrary input graph via GANs. The hierarchical architecture consisting of multiple GANs preserves both local and global topological features and automatically partitions the input graph into representative stages for feature learning. The stages facilitate reconstruction and can be used as indicators of the importance of the associated topological structures. Experiments show that our method produces subgraphs retaining a wide range of topological features, even in early reconstruction stages (unlike a single GAN, which cannot easily identify such features, let alone reconstruct the original graph). This paper is firstline research on combining the use of GANs and graph topological analysis.
• With the rapid development of mobile apps, the availability of a large number of mobile apps in application stores brings challenge to locate appropriate apps for users. Providing accurate mobile app recommendation for users becomes an imperative task. Conventional approaches mainly focus on learning users' preferences and app features to predict the user-app ratings. However, most of them did not consider the interactions among the context information of apps. To address this issue, we propose a broad learning approach for \textbfContext-\textbfAware app recommendation with \textbfTensor \textbfAnalysis (CATA). Specifically, we utilize a tensor-based framework to effectively integrate user's preference, app category information and multi-view features to facilitate the performance of app rating prediction. The multidimensional structure is employed to capture the hidden relationships between multiple app categories with multi-view features. We develop an efficient factorization method which applies Tucker decomposition to learn the full-order interactions within multiple categories and features. Furthermore, we employ a group $\ell_{1}-$norm regularization to learn the group-wise feature importance of each view with respect to each app category. Experiments on two real-world mobile app datasets demonstrate the effectiveness of the proposed method.
• Multilayer network analysis has become a vital tool for understanding different relationships and their interactions in a complex system, where each layer in a multilayer network depicts the topological structure of a group of nodes corresponding to a particular relationship. The interactions among different layers imply how the interplay of different relations on the topology of each layer. For a single-layer network, network embedding methods have been proposed to project the nodes in a network into a continuous vector space with a relatively small number of dimensions, where the space embeds the social representations among nodes. These algorithms have been proved to have a better performance on a variety of regular graph analysis tasks, such as link prediction, or multi-label classification. In this paper, by extending a standard graph mining into multilayer network, we have proposed three methods ("network aggregation," "results aggregation" and "layer co-analysis") to project a multilayer network into a continuous vector space. From the evaluation, we have proved that comparing with regular link prediction methods, "layer co-analysis" achieved the best performance on most of the datasets, while "network aggregation" and "results aggregation" also have better performance than regular link prediction methods.