results for au:Fan_F in:cs

- Oct 03 2017 cs.DB arXiv:1710.00204v1Even though many machine algorithms have been proposed for entity resolution, it remains very challenging to find a solution with quality guarantees. In this paper, we propose a novel HUman and Machine cOoperative (HUMO) framework for entity resolution (ER), which divides an ER workload between machine and human. HUMO enables a mechanism for quality control that can flexibly enforce both precision and recall levels. We introduce the optimization problem of HUMO, minimizing human cost given a quality requirement, and then present three optimization approaches: a conservative baseline one purely based on the monotonicity assumption of precision, a more aggressive one based on sampling and a hybrid one that can take advantage of the strengths of both previous approaches. Finally, we demonstrate by extensive experiments on real and synthetic datasets that HUMO can achieve high-quality results with reasonable return on investment (ROI) in terms of human cost, and it performs considerably better than the state-of-the-art alternative in quality control.
- The artificial neural network is a popular framework in machine learning. To empower individual neurons, we recently suggested that the current type of neurons could be upgraded to 2nd order counterparts, in which the linear operation between inputs to a neuron and the associated weights is replaced with a nonlinear quadratic operation. A single 2nd order neurons already has a strong nonlinear modeling ability, such as implementing basic fuzzy logic operations. In this paper, we develop a general backpropagation (BP) algorithm to train the network consisting of 2nd-order neurons. The numerical studies are performed to verify of the generalized BP algorithm.
- In machine learning, the use of an artificial neural network is the mainstream approach. Such a network consists of layers of neurons. These neurons are of the same type characterized by the two features: (1) an inner product of an input vector and a matching weighting vector of trainable parameters and (2) a nonlinear excitation function. Here we investigate the possibility of replacing the inner product with a quadratic function of the input vector, thereby upgrading the 1st order neuron to the 2nd order neuron, empowering individual neurons, and facilitating the optimization of neural networks. Also, numerical examples are provided to illustrate the feasibility and merits of the 2nd order neurons. Finally, further topics are discussed.
- The recently proposed generalized epidemic modeling framework (GEMF) \citesahneh2013generalized lays the groundwork for systematically constructing a broad spectrum of stochastic spreading processes over complex networks. This article builds an algorithm for exact, continuous-time numerical simulation of GEMF-based processes. Moreover the implementation of this algorithm, GEMFsim, is available in popular scientific programming platforms such as MATLAB, R, Python, and C; GEMFsim facilitates simulating stochastic spreading models that fit in GEMF framework. Using these simulations one can examine the accuracy of mean-field-type approximations that are commonly used for analytical study of spreading processes on complex networks.
- Mar 16 2015 cs.IR arXiv:1503.03961v1Since the length of microblog texts, such as tweets, is strictly limited to 140 characters, traditional Information Retrieval techniques suffer from the vocabulary mismatch problem severely and cannot yield good performance in the context of microblogosphere. To address this critical challenge, in this paper, we propose a new language modeling approach for microblog retrieval by inferring various types of context information. In particular, we expand the query using knowledge terms derived from Freebase so that the expanded one can better reflect users' search intent. Besides, in order to further satisfy users' real-time information need, we incorporate temporal evidences into the expansion method, which can boost recent tweets in the retrieval results with respect to a given topic. Experimental results on two official TREC Twitter corpora demonstrate the significant superiority of our approach over baseline methods.
- Dec 05 2014 cs.CG arXiv:1412.1680v3Given a real-valued function $f$ defined over a manifold $M$ embedded in $\mathbb{R}^d$, we are interested in recovering structural information about $f$ from the sole information of its values on a finite sample $P$. Existing methods provide approximation to the persistence diagram of $f$ when geometric noise and functional noise are bounded. However, they fail in the presence of aberrant values, also called outliers, both in theory and practice. We propose a new algorithm that deals with outliers. We handle aberrant functional values with a method inspired from the k-nearest neighbors regression and the local median filtering, while the geometric outliers are handled using the distance to a measure. Combined with topological results on nested filtrations, our algorithm performs robust topological analysis of scalar fields in a wider range of noise models than handled by current methods. We provide theoretical guarantees and experimental results on the quality of our approximation of the sampled scalar field.
- Detecting the dimension of a hidden manifold from a point sample has become an important problem in the current data-driven era. Indeed, estimating the shape dimension is often the first step in studying the processes or phenomena associated to the data. Among the many dimension detection algorithms proposed in various fields, a few can provide theoretical guarantee on the correctness of the estimated dimension. However, the correctness usually requires certain regularity of the input: the input points are either uniformly randomly sampled in a statistical setting, or they form the so-called $(\varepsilon,\delta)$-sample which can be neither too dense nor too sparse. Here, we propose a purely topological technique to detect dimensions. Our algorithm is provably correct and works under a more relaxed sampling condition: we do not require uniformity, and we also allow Hausdorff noise. Our approach detects dimension by determining local homology. The computation of this topological structure is much less sensitive to the local distribution of points, which leads to the relaxation of the sampling conditions. Furthermore, by leveraging various developments in computational topology, we show that this local homology at a point $z$ can be computed \emphexactly for manifolds using Vietoris-Rips complexes whose vertices are confined within a local neighborhood of $z$. We implement our algorithm and demonstrate the accuracy and robustness of our method using both synthetic and real data sets.
- The efficiency of extracting topological information from point data depends largely on the complex that is built on top of the data points. From a computational viewpoint, the most favored complexes for this purpose have so far been Vietoris-Rips and witness complexes. While the Vietoris-Rips complex is simple to compute and is a good vehicle for extracting topology of sampled spaces, its size is huge--particularly in high dimensions. The witness complex on the other hand enjoys a smaller size because of a subsampling, but fails to capture the topology in high dimensions unless imposed with extra structures. We investigate a complex called the \em graph induced complex that, to some extent, enjoys the advantages of both. It works on a subsample but still retains the power of capturing the topology as the Vietoris-Rips complex. It only needs a graph connecting the original sample points from which it builds a complex on the subsample thus taming the size considerably. We show that, using the graph induced complex one can (i) infer the one dimensional homology of a manifold from a very lean subsample, (ii) reconstruct a surface in three dimension from a sparse subsample without computing Delaunay triangulations, (iii) infer the persistent homology groups of compact sets from a sufficiently dense sample. We provide experimental evidences in support of our theory.
- Algorithms for persistent homology and zigzag persistent homology are well-studied for persistence modules where homomorphisms are induced by inclusion maps. In this paper, we propose a practical algorithm for computing persistence under $\mathbb{Z}_2$ coefficients for a sequence of general simplicial maps and show how these maps arise naturally in some applications of topological data analysis. First, we observe that it is not hard to simulate simplicial maps by inclusion maps but not necessarily in a monotone direction. This, combined with the known algorithms for zigzag persistence, provides an algorithm for computing the persistence induced by simplicial maps. Our main result is that the above simple minded approach can be improved for a sequence of simplicial maps given in a monotone direction. A simplicial map can be decomposed into a set of elementary inclusions and vertex collapses--two atomic operations that can be supported efficiently with the notion of simplex annotations for computing persistent homology. A consistent annotation through these atomic operations implies the maintenance of a consistent cohomology basis, hence a homology basis by duality. While the idea of maintaining a cohomology basis through an inclusion is not new, maintaining them through a vertex collapse is new, which constitutes an important atomic operation for simulating simplicial maps. Annotations support the vertex collapse in addition to the usual inclusion quite naturally. Finally, we exhibit an application of this new tool in which we approximate the persistence diagram of a filtration of Rips complexes where vertex collapses are used to tame the blow-up in size.