results for au:Agarwal_P in:cs

- Oct 16 2017 cs.CV arXiv:1710.04803v1Biometric identification systems have become immensely popular and important because of their high reliability and efficiency. However person identification at a distance, still remains a challenging problem. Gait can be seen as an essential biometric feature for human recognition and identification. It can be easily acquired from a distance and does not require any user cooperation thus making it suitable for surveillance. But the task of recognizing an individual using gait can be adversely affected by varying view points making this task more and more challenging. Our proposed approach tackles this problem by identifying spatio-temporal features and performing extensive experimentation and training mechanism. In this paper, we propose a 3-D Convolution Deep Neural Network for person identification using gait under multiple view. It is a 2-stage network, in which we have a classification network that initially identifies the viewing point angle. After that another set of networks (one for each angle) has been trained to identify the person under a particular viewing angle. We have tested this network over CASIA-B publicly available database and have achieved state-of-the-art results. The proposed system is much more efficient in terms of time and space and performing better for almost all angles.
- Sep 06 2017 cs.LG arXiv:1709.01073v2We consider the problem of estimating the remaining useful life (RUL) of a system or a machine from sensor data. Many approaches for RUL estimation based on sensor data make assumptions about how machines degrade. Additionally, sensor data from machines is noisy and often suffers from missing values in many practical settings. We propose Embed-RUL: a novel approach for RUL estimation from sensor data that does not rely on any degradation-trend assumptions, is robust to noise, and handles missing values. Embed-RUL utilizes a sequence-to-sequence model based on Recurrent Neural Networks (RNNs) to generate embeddings for multivariate time series subsequences. The embeddings for normal and degraded machines tend to be different, and are therefore found to be useful for RUL estimation. We show that the embeddings capture the overall pattern in the time series while filtering out the noise, so that the embeddings of two machines with similar operational behavior are close to each other, even when their sensor readings have significant and varying levels of noise content. We perform experiments on publicly available turbofan engine dataset and a proprietary real-world dataset, and demonstrate that Embed-RUL outperforms the previously reported state-of-the-art on several metrics.
- Jun 28 2017 cs.LG arXiv:1706.08838v1Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extractors for images, we propose TimeNet: a deep recurrent neural network (RNN) trained on diverse time series in an unsupervised manner using sequence to sequence (seq2seq) models to extract features from time series. Rather than relying on data from the problem domain, TimeNet attempts to generalize time series representation across domains by ingesting time series from several domains simultaneously. Once trained, TimeNet can be used as a generic off-the-shelf feature extractor for time series. The representations or embeddings given by a pre-trained TimeNet are found to be useful for time series classification (TSC). For several publicly available datasets from UCR TSC Archive and an industrial telematics sensor data from vehicles, we observe that a classifier learned over the TimeNet embeddings yields significantly better performance compared to (i) a classifier learned over the embeddings given by a domain-specific RNN, as well as (ii) a nearest neighbor classifier based on Dynamic Time Warping.
- Jun 23 2017 cs.AI arXiv:1706.07160v2Explaining the behavior of a black box machine learning model at the instance level is useful for building trust. However, what is also important is understanding how the model behaves globally. Such an understanding provides insight into both the data on which the model was trained and the generalization power of the rules it learned. We present here an approach that learns rules to explain globally the behavior of black box machine learning models. Collectively these rules represent the logic learned by the model and are hence useful for gaining insight into its behavior. We demonstrate the power of the approach on three publicly available data sets.
- We study a path-planning problem amid a set $\mathcal{O}$ of obstacles in $\mathbb{R}^2$, in which we wish to compute a short path between two points while also maintaining a high clearance from $\mathcal{O}$; the clearance of a point is its distance from a nearest obstacle in $\mathcal{O}$. Specifically, the problem asks for a path minimizing the reciprocal of the clearance integrated over the length of the path. We present the first polynomial-time approximation scheme for this problem. Let $n$ be the total number of obstacle vertices and let $\varepsilon \in (0,1]$. Our algorithm computes in time $O(\frac{n^2}{\varepsilon ^2} \log \frac{n}{\varepsilon})$ a path of total cost at most $(1+\varepsilon)$ times the cost of the optimal path.
- A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item's attributes with a user's weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors. We show that k-regret minimization is NP-Complete for all dimensions d >= 3. This settles an open problem from Chester et al. [VLDB 2014], and resolves the complexity status of the problem for all d: the problem is known to have polynomial-time solution for d <= 2. In addition, we propose two new approximation schemes for regret minimization, both with provable guarantees, one based on coresets and another based on hitting sets. We also carry out extensive experimental evaluation, and show that our schemes compute regret-minimizing sets comparable in size to the greedy algorithm proposed in [VLDB 14] but our schemes are significantly faster and scalable to large data sets.
- Jan 05 2017 cs.DB arXiv:1701.01094v1Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains, even after disparate data is technically ingested into a common data lake. Sometimes this is a missing data issue, while in other cases it may be inherent, e.g., the records in different geographical databases may actually describe different product 'SKUs', or follow different norms for categorization. Record linkage techniques can be used to automatically map products in different data sources to a common set of global attributes, thereby enabling federated aggregation joins to be performed. Traditional record-linkage techniques are typically unsupervised, relying textual similarity features across attributes to estimate matches. In this paper, we present an ensemble model combining minimal supervision using Bayesian network models together with unsupervised textual matching for automating such 'attribute fusion'. We present results of our approach on a large volume of real-life data from a market-research scenario and compare with a standard record matching algorithm. Finally we illustrate how attribute fusion using machine learning could be included as a data-lake management feature, especially as our approach also provides confidence values for matches, enabling human intervention, if required.
- Jan 04 2017 cs.LG arXiv:1701.00597v1Discovering causal models from observational and interventional data is an important first step preceding what-if analysis or counterfactual reasoning. As has been shown before, the direction of pairwise causal relations can, under certain conditions, be inferred from observational data via standard gradient-boosted classifiers (GBC) using carefully engineered statistical features. In this paper we apply deep convolutional neural networks (CNNs) to this problem by plotting attribute pairs as 2-D scatter plots that are fed to the CNN as images. We evaluate our approach on the 'Cause- Effect Pairs' NIPS 2013 Data Challenge. We observe that a weighted ensemble of CNN with the earlier GBC approach yields significant improvement. Further, we observe that when less training data is available, our approach performs better than the GBC based approach suggesting that CNN models pre-trained to determine the direction of pairwise causal direction could have wider applicability in causal discovery and enabling what-if or counterfactual analysis.
- Dec 21 2016 cs.AI arXiv:1612.06528v1We investigate solving discrete optimisation problems using the estimation of distribution (EDA) approach via a novel combination of deep belief networks(DBN) and inductive logic programming (ILP).While DBNs are used to learn the structure of successively better feasible solutions,ILP enables the incorporation of domain-based background knowledge related to the goodness of solutions.Recent work showed that ILP could be an effective way to use domain knowledge in an EDA scenario.However,in a purely ILP-based EDA,sampling successive populations is either inefficient or not straightforward.In our Neuro-symbolic EDA,an ILP engine is used to construct a model for good solutions using domain-based background knowledge.These rules are introduced as Boolean features in the last hidden layer of DBNs used for EDA-based optimization.This incorporation of logical ILP features requires some changes while training and sampling from DBNs: (a)our DBNs need to be trained with data for units at the input layer as well as some units in an otherwise hidden layer, and (b)we would like the samples generated to be drawn from instances entailed by the logical model.We demonstrate the viability of our approach on instances of two optimisation problems: predicting optimal depth-of-win for the KRK endgame,and jobshop scheduling.Our results are promising: (i)On each iteration of distribution estimation,samples obtained with an ILP-assisted DBN have a substantially greater proportion of good solutions than samples generated using a DBN without ILP features, and (ii)On termination of distribution estimation,samples obtained using an ILP-assisted DBN contain more near-optimal samples than samples from a DBN without ILP features.These results suggest that the use of ILP-constructed theories could be useful for incorporating complex domain-knowledge into deep models for estimation of distribution based procedures.
- Many approaches for estimation of Remaining Useful Life (RUL) of a machine, using its operational sensor data, make assumptions about how a system degrades or a fault evolves, e.g., exponential degradation. However, in many domains degradation may not follow a pattern. We propose a Long Short Term Memory based Encoder-Decoder (LSTM-ED) scheme to obtain an unsupervised health index (HI) for a system using multi-sensor time-series data. LSTM-ED is trained to reconstruct the time-series corresponding to healthy state of a system. The reconstruction error is used to compute HI which is then used for RUL estimation. We evaluate our approach on publicly available Turbofan Engine and Milling Machine datasets. We also present results on a real-world industry dataset from a pulverizer mill where we find significant correlation between LSTM-ED based HI and maintenance costs.
- Aug 04 2016 cs.AI arXiv:1608.01093v2Our interest in this paper is in optimisation problems that are intractable to solve by direct numerical optimisation, but nevertheless have significant amounts of relevant domain-specific knowledge. The category of heuristic search techniques known as estimation of distribution algorithms (EDAs) seek to incrementally sample from probability distributions in which optimal (or near-optimal) solutions have increasingly higher probabilities. Can we use domain knowledge to assist the estimation of these distributions? To answer this in the affirmative, we need: (a)a general-purpose technique for the incorporation of domain knowledge when constructing models for optimal values; and (b)a way of using these models to generate new data samples. Here we investigate a combination of the use of Inductive Logic Programming (ILP) for (a), and standard logic-programming machinery to generate new samples for (b). Specifically, on each iteration of distribution estimation, an ILP engine is used to construct a model for good solutions. The resulting theory is then used to guide the generation of new data instances, which are now restricted to those derivable using the ILP model in conjunction with the background knowledge). We demonstrate the approach on two optimisation problems (predicting optimal depth-of-win for the KRK endgame, and job-shop scheduling). Our results are promising: (a)On each iteration of distribution estimation, samples obtained with an ILP theory have a substantially greater proportion of good solutions than samples without a theory; and (b)On termination of distribution estimation, samples obtained with an ILP theory contain more near-optimal samples than samples without a theory. Taken together, these results suggest that the use of ILP-constructed theories could be a useful technique for incorporating complex domain-knowledge into estimation distribution procedures.
- Jul 13 2016 cs.MA arXiv:1607.03340v1This paper addresses the issues concerning the rescheduling of a static timetable in case of a disaster encountered in a large and complex railway network system. The proposed approach tries to modify the schedule so as to minimise the overall delay of trains. This is achieved by representing the rescheduling problem in the form of a Petri-Net and the highly uncertain disaster recovery times in such a model is handled as Markov Decision Processes (MDP ). For solving the rescheduling problem, a istributed Constraint Optimisation (DCOP ) based strategy involving the use of autonomous agents is used to generate the desired schedule. The proposed approach is evaluated on the actual schedule of the Eastern Railways, India by constructing vari- ous disaster scenarios using the Java Agent DEvelopment Framework (JADE). When compared to the existing approaches, the proposed framework substantially reduces the delay of trains after rescheduling.
- Mechanical devices such as engines, vehicles, aircrafts, etc., are typically instrumented with numerous sensors to capture the behavior and health of the machine. However, there are often external factors or variables which are not captured by sensors leading to time-series which are inherently unpredictable. For instance, manual controls and/or unmonitored environmental conditions or load may lead to inherently unpredictable time-series. Detecting anomalies in such scenarios becomes challenging using standard approaches based on mathematical models that rely on stationarity, or prediction models that utilize prediction errors to detect anomalies. We propose a Long Short Term Memory Networks based Encoder-Decoder scheme for Anomaly Detection (EncDec-AD) that learns to reconstruct 'normal' time-series behavior, and thereafter uses reconstruction error to detect anomalies. We experiment with three publicly available quasi predictable time-series datasets: power demand, space shuttle, and ECG, and two real-world engine datasets with both predictive and unpredictable behavior. We show that EncDec-AD is robust and can detect anomalies from predictable, unpredictable, periodic, aperiodic, and quasi-periodic time-series. Further, we show that EncDec-AD is able to detect anomalies from short time-series (length as small as 30) as well as long time-series (length as large as 500).
- Jun 02 2016 cs.CG arXiv:1606.00112v1Nearest-neighbor search, which returns the nearest neighbor of a query point in a set of points, is an important and widely studied problem in many fields, and it has wide range of applications. In many of them, such as sensor databases, location-based services, face recognition, and mobile data, the location of data is imprecise. We therefore study nearest-neighbor queries in a probabilistic framework in which the location of each input point is specified as a probability distribution function. We present efficient algorithms for - computing all points that are nearest neighbors of a query point with nonzero probability; and - estimating the probability of a point being the nearest neighbor of a query point, either exactly or within a specified additive error.
- Large-scale graph-structured data arising from social networks, databases, knowledge bases, web graphs, etc. is now available for analysis and mining. Graph-mining often involves 'relationship queries', which seek a ranked list of interesting interconnections among a given set of entities, corresponding to nodes in the graph. While relationship queries have been studied for many years, using various terminologies, e.g., keyword-search, Steiner-tree in a graph etc., the solutions proposed in the literature so far have not focused on scaling relationship queries to large graphs having billions of nodes and edges, such are now publicly available in the form of 'linked-open-data'. In this paper, we present an algorithm for distributed keyword search (DKS) on large graphs, based on the graph-parallel computing paradigm Pregel. We also present an analytical proof that our algorithm produces an optimally ranked list of answers if run to completion. Even if terminated early, our algorithm produces approximate answers along with bounds. We describe an optimized implementation of our DKS algorithm along with time-complexity analysis. Finally, we report and analyze experiments using an implementation of DKS on Giraph the graph-parallel computing framework based on Pregel, and demonstrate that we can efficiently process relationship queries on large-scale subsets of linked-open-data.
- We give the first subquadratic-time approximation schemes for dynamic time warping (DTW) and edit distance (ED) of several natural families of point sequences in $\mathbb{R}^d$, for any fixed $d \ge 1$. In particular, our algorithms compute $(1+\varepsilon)$-approximations of DTW and ED in time near-linear for point sequences drawn from k-packed or k-bounded curves, and subquadratic for backbone sequences. Roughly speaking, a curve is $\kappa$-packed if the length of its intersection with any ball of radius $r$ is at most $\kappa \cdot r$, and a curve is $\kappa$-bounded if the sub-curve between two curve points does not go too far from the two points compared to the distance between the two points. In backbone sequences, consecutive points are spaced at approximately equal distances apart, and no two points lie very close together. Recent results suggest that a subquadratic algorithm for DTW or ED is unlikely for an arbitrary pair of point sequences even for $d=1$. Our algorithms work by constructing a small set of rectangular regions that cover the entries of the dynamic programming table commonly used for these distance measures. The weights of entries inside each rectangle are roughly the same, so we are able to use efficient procedures to approximately compute the cheapest paths through these rectangles.
- Sep 21 2015 cs.CG arXiv:1509.05751v2The Gromov-Hausdorff (GH) distance is a natural way to measure distance between two metric spaces. We prove that it is $\mathrm{NP}$-hard to approximate the Gromov-Hausdorff distance better than a factor of $3$ for geodesic metrics on a pair of trees. We complement this result by providing a polynomial time $O(\min\{n, \sqrt{rn}\})$-approximation algorithm for computing the GH distance between a pair of metric trees, where $r$ is the ratio of the longest edge length in both trees to the shortest edge length. For metric trees with unit length edges, this yields an $O(\sqrt{n})$-approximation algorithm.
- Let $P$ be a set of $n$ points in $\mathrm{R}^2$, and let $\mathrm{DT}(P)$ denote its Euclidean Delaunay triangulation. We introduce the notion of an edge of $\mathrm{DT}(P)$ being \it stable. Defined in terms of a parameter $\alpha>0$, a Delaunay edge $pq$ is called $\alpha$-stable, if the (equal) angles at which $p$ and $q$ see the corresponding Voronoi edge $e_{pq}$ are at least $\alpha$. A subgraph $G$ of $\mathrm{DT}(P)$ is called \it $(c\alpha, \alpha)$-stable Delaunay graph ($\mathrm{SDG}$ in short), for some constant $c \ge 1$, if every edge in $G$ is $\alpha$-stable and every $c\alpha$-stable of $\mathrm{DT}(P)$ is in $G$. We show that if an edge is stable in the Euclidean Delaunay triangulation of $P$, then it is also a stable edge, though for a different value of $\alpha$, in the Delaunay triangulation of $P$ under any convex distance function that is sufficiently close to the Euclidean norm, and vice-versa. In particular, a $6\alpha$-stable edge in $\mathrm{DT}(P)$ is $\alpha$-stable in the Delaunay triangulation under the distance function induced by a regular $k$-gon for $k \ge 2\pi/\alpha$, and vice-versa. Exploiting this relationship and the analysis in~\citepolydel, we present a linear-size kinetic data structure (KDS) for maintaining an $(8\alpha,\alpha)$-$\mathrm{SDG}$ as the points of $P$ move. If the points move along algebraic trajectories of bounded degree, the KDS processes nearly quadratic events during the motion, each of which can processed in $O(\log n)$ time. Finally, we show that a number of useful properties of $\mathrm{DT}(P)$ are retained by $\mathrm{SDG}$ of $P$.
- Robot localization is a one of the most important problems in robotics. Most of the existing approaches assume that the map of the environment is available beforehand and focus on accurate metrical localization. In this paper, we address the localization problem when the map of the environment is not present beforehand, and the robot relies on a hand-drawn map from a non-expert user. We addressed this problem by expressing the robot pose in the pixel coordinate and simultaneously estimate a local deformation of the hand-drawn map. Experiments show that we are able to localize the robot in the correct room with a robustness up to 80%
- Accurate metrical localization is one of the central challenges in mobile robotics. Many existing methods aim at localizing after building a map with the robot. In this paper, we present a novel approach that instead uses geotagged panoramas from the Google Street View as a source of global positioning. We model the problem of localization as a non-linear least squares estimation in two phases. The first estimates the 3D position of tracked feature points from short monocular camera sequences. The second computes the rigid body transformation between the Street View panoramas and the estimated points. The only input of this approach is a stream of monocular camera images and odometry estimates. We quantified the accuracy of the method by running the approach on a robotic platform in a parking lot by using visual fiducials as ground truth. Additionally, we applied the approach in the context of personal localization in a real urban scenario by using data from a Google Tango tablet.
- While analyzing vehicular sensor data, we found that frequently occurring waveforms could serve as features for further analysis, such as rule mining, classification, and anomaly detection. The discovery of waveform patterns, also known as time-series motifs, has been studied extensively; however, available techniques for discovering frequently occurring time-series motifs were found lacking in either efficiency or quality: Standard subsequence clustering results in poor quality, to the extent that it has even been termed 'meaningless'. Variants of hierarchical clustering using techniques for efficient discovery of 'exact pair motifs' find high-quality frequent motifs, but at the cost of high computational complexity, making such techniques unusable for our voluminous vehicular sensor data. We show that good quality frequent motifs can be discovered using bounded spherical clustering of time-series subsequences, which we refer to as COIN clustering, with near linear complexity in time-series size. COIN clustering addresses many of the challenges that previously led to subsequence clustering being viewed as meaningless. We describe an end-to-end motif-discovery procedure using a sequence of pre and post-processing techniques that remove trivial-matches and shifted-motifs, which also plagued previous subsequence-clustering approaches. We demonstrate that our technique efficiently discovers frequent motifs in voluminous vehicular sensor data as well as in publicly available data sets.
- All multi-component product manufacturing companies face the problem of warranty cost estimation. Failure rate analysis of components plays a key role in this problem. Data source used for failure rate analysis has traditionally been past failure data of components. However, failure rate analysis can be improved by means of fusion of additional information, such as symptoms observed during after-sale service of the product, geographical information (hilly or plains areas), and information from tele-diagnostic analytics. In this paper, we propose an approach, which learns dependency between part-failures and symptoms gleaned from such diverse sources of information, to predict expected number of failures with better accuracy. We also indicate how the optimum warranty period can be computed. We demonstrate, through empirical results, that our method can improve the warranty cost estimates significantly.
- Aug 19 2014 cs.LG arXiv:1408.3733v1Vehicular sensor data consists of multiple time-series arising from a number of sensors. Using such multi-sensor data we would like to detect occurrences of specific events that vehicles encounter, e.g., corresponding to particular maneuvers that a vehicle makes or conditions that it encounters. Events are characterized by similar waveform patterns re-appearing within one or more sensors. Further such patterns can be of variable duration. In this work, we propose a method for detecting such events in time-series data using a novel feature descriptor motivated by similar ideas in image processing. We define the shape histogram: a constant dimension descriptor that nevertheless captures patterns of variable duration. We demonstrate the efficacy of using shape histograms as features to detect events in an SVM-based, multi-sensor, supervised learning scenario, i.e., multiple time-series are used to detect an event. We present results on real-life vehicular sensor data and show that our technique performs better than available pattern detection implementations on our data, and that it can also be used to combine features from multiple sensors resulting in better accuracy than using any single sensor. Since previous work on pattern detection in time-series has been in the single series context, we also present results using our technique on multiple standard time-series datasets and show that it is the most versatile in terms of how it ranks compared to other published results.
- Jun 26 2014 cs.CG arXiv:1406.6599v1We study the convex-hull problem in a probabilistic setting, motivated by the need to handle data uncertainty inherent in many applications, including sensor databases, location-based services and computer vision. In our framework, the uncertainty of each input site is described by a probability distribution over a finite number of possible locations including a \emphnull location to account for non-existence of the point. Our results include both exact and approximation algorithms for computing the probability of a query point lying inside the convex hull of the input, time-space tradeoffs for the membership queries, a connection between Tukey depth and membership queries, as well as a new notion of $\some$-hull that may be a useful representation of uncertain hulls.
- Jun 17 2014 cs.CG arXiv:1406.4005v2We consider maintaining the contour tree $\mathbb{T}$ of a piecewise-linear triangulation $\mathbb{M}$ that is the graph of a time varying height function $h: \mathbb{R}^2 \rightarrow \mathbb{R}$. We carefully describe the combinatorial change in $\mathbb{T}$ that happen as $h$ varies over time and how these changes relate to topological changes in $\mathbb{M}$. We present a kinetic data structure that maintains the contour tree of $h$ over time. Our data structure maintains certificates that fail only when $h(v)=h(u)$ for two adjacent vertices $v$ and $u$ in $\mathbb{M}$, or when two saddle vertices lie on the same contour of $\mathbb{M}$. A certificate failure is handled in $O(\log(n))$ time. We also show how our data structure can be extended to handle a set of general update operations on $\mathbb{M}$ and how it can be applied to maintain topological persistence pairs of time varying functions.
- Let $P$ be a set of $n$ points and $Q$ a convex $k$-gon in ${\mathbb R}^2$. We analyze in detail the topological (or discrete) changes in the structure of the Voronoi diagram and the Delaunay triangulation of $P$, under the convex distance function defined by $Q$, as the points of $P$ move along prespecified continuous trajectories. Assuming that each point of $P$ moves along an algebraic trajectory of bounded degree, we establish an upper bound of $O(k^4n\lambda_r(n))$ on the number of topological changes experienced by the diagrams throughout the motion; here $\lambda_r(n)$ is the maximum length of an $(n,r)$-Davenport-Schinzel sequence, and $r$ is a constant depending on the algebraic degree of the motion of the points. Finally, we describe an algorithm for efficiently maintaining the above structures, using the kinetic data structure (KDS) framework.
- In many government applications we often find that information about entities, such as persons, are available in disparate data sources such as passports, driving licences, bank accounts, and income tax records. Similar scenarios are commonplace in large enterprises having multiple customer, supplier, or partner databases. Each data source maintains different aspects of an entity, and resolving entities based on these attributes is a well-studied problem. However, in many cases documents in one source reference those in others; e.g., a person may provide his driving-licence number while applying for a passport, or vice-versa. These links define relationships between documents of the same entity (as opposed to inter-entity relationships, which are also often used for resolution). In this paper we describe an algorithm to cluster documents that are highly likely to belong to the same entity by exploiting inter-document references in addition to attribute similarity. Our technique uses a combination of iterative graph-traversal, locality-sensitive hashing, iterative match-merge, and graph-clustering to discover unique entities based on a document corpus. A unique feature of our technique is that new sets of documents can be added incrementally while having to re-resolve only a small subset of a previously resolved entity-document collection. We present performance and quality results on two data-sets: a real-world database of companies and a large synthetically generated `population' database. We also demonstrate benefit of using inter-document references for clustering in the form of enhanced recall of documents for resolution.
- Oct 22 2013 cs.CG arXiv:1310.5647v1Let $\mathcal{C}=\{C_1,\ldots,C_n\}$ be a set of $n$ pairwise-disjoint convex sets of constant description complexity, and let $\pi$ be a probability density function (pdf for short) over the non-negative reals. For each $i$, let $K_i$ be the Minkowski sum of $C_i$ with a disk of radius $r_i$, where each $r_i$ is a random non-negative number drawn independently from the distribution determined by $\pi$. We show that the expected complexity of the union of $K_1, \ldots, K_n$ is $O(n^{1+\varepsilon})$ for any $\varepsilon > 0$; here the constant of proportionality depends on $\varepsilon$ and on the description complexity of the sets in $\mathcal{C}$, but not on $\pi$. If each $C_i$ is a convex polygon with at most $s$ vertices, then we show that the expected complexity of the union is $O(s^2n\log n)$. Our bounds hold in the stronger model in which we are given an arbitrary multi-set $R=\{r_1,\ldots,r_n\}$ of expansion radii, each a non-negative real number. We assign them to the members of $\mathcal{C}$ by a random permutation, where all permutations are equally likely to be chosen; the expectations are now with respect to these permutations. We also present an application of our results to a problem that arises in analyzing the vulnerability of a network to a physical attack. %
- Mar 08 2013 cs.CG arXiv:1303.1585v1With recent advances in sensing and tracking technology, trajectory data is becoming increasingly pervasive and analysis of trajectory data is becoming exceedingly important. A fundamental problem in analyzing trajectory data is that of identifying common patterns between pairs or among groups of trajectories. In this paper, we consider the problem of identifying similar portions between a pair of trajectories, each observed as a sequence of points sampled from it. We present new measures of trajectory similarity --- both local and global --- between a pair of trajectories to distinguish between similar and dissimilar portions. Our model is robust under noise and outliers, it does not make any assumptions on the sampling rates on either trajectory, and it works even if they are partially observed. Additionally, the model also yields a scalar similarity score which can be used to rank multiple pairs of trajectories according to similarity, e.g. in clustering applications. We also present efficient algorithms for computing the similarity under our measures; the worst-case running time is quadratic in the number of sample points. Finally, we present an extensive experimental study evaluating the effectiveness of our approach on real datasets, comparing with it with earlier approaches, and illustrating many issues that arise in trajectory data. Our experiments show that our approach is highly accurate in distinguishing similar and dissimilar portions as compared to earlier methods even with sparse sampling.
- We present Roadmap Sparsification by Edge Contraction (RSEC), a simple and effective algorithm for reducing the size of a motion-planning roadmap. The algorithm exhibits minimal effect on the quality of paths that can be extracted from the new roadmap. The primitive operation used by RSEC is edge contraction - the contraction of a roadmap edge to a single vertex and the connection of the new vertex to the neighboring vertices of the contracted edge. For certain scenarios, we compress more than 98% of the edges and vertices at the cost of degradation of average shortest path length by at most 2%.
- Let $P$ be a set of $n$ points in $\R^d$. We present a linear-size data structure for answering range queries on $P$ with constant-complexity semialgebraic sets as ranges, in time close to $O(n^{1-1/d})$. It essentially matches the performance of similar structures for simplex range searching, and, for $d\ge 5$, significantly improves earlier solutions by the first two authors obtained in~1994. This almost settles a long-standing open problem in range searching. The data structure is based on the polynomial-partitioning technique of Guth and Katz [arXiv:1011.4105], which shows that for a parameter $r$, $1 < r \le n$, there exists a $d$-variate polynomial $f$ of degree $O(r^{1/d})$ such that each connected component of $\R^d\setminus Z(f)$ contains at most $n/r$ points of $P$, where $Z(f)$ is the zero set of $f$. We present an efficient randomized algorithm for computing such a polynomial partition, which is of independent interest and is likely to have additional applications.
- Apr 25 2012 cs.CG arXiv:1204.5333v1The Fréchet distance is a similarity measure between two curves $A$ and $B$: Informally, it is the minimum length of a leash required to connect a dog, constrained to be on $A$, and its owner, constrained to be on $B$, as they walk without backtracking along their respective curves from one endpoint to the other. The advantage of this measure on other measures such as the Hausdorff distance is that it takes into account the ordering of the points along the curves. The discrete Fréchet distance replaces the dog and its owner by a pair of frogs that can only reside on $n$ and $m$ specific pebbles on the curves $A$ and $B$, respectively. These frogs hop from a pebble to the next without backtracking. The discrete Fréchet distance can be computed by a rather straightforward quadratic dynamic programming algorithm. However, despite a considerable amount of work on this problem and its variations, there is no subquadratic algorithm known, even for approximation versions of the problem. In this paper we present a subquadratic algorithm for computing the discrete Fréchet distance between two sequences of points in the plane, of respective lengths $m\le n$. The algorithm runs in $O(\dfrac{mn\log\log n}{\log n})$ time and uses $O(n+m)$ storage. Our approach uses the geometry of the problem in a subtle way to encode legal positions of the frogs as states of a finite automata.
- Clustering is an unsupervised technique of Data Mining. It means grouping similar objects together and separating the dissimilar ones. Each object in the data set is assigned a class label in the clustering process using a distance measure. This paper has captured the problems that are faced in real when clustering algorithms are implemented .It also considers the most extensively used tools which are readily available and support functions which ease the programming. Once algorithms have been implemented, they also need to be tested for its validity. There exist several validation indexes for testing the performance and accuracy which have also been discussed here.
- We propose the construction of an unobservable communications network using social networks. The communication endpoints are vertices on a social network. Probabilistically unobservable communication channels are built by leveraging image steganography and the social image sharing behavior of users. All communication takes place along the edges of a social network overlay connecting friends. We show that such a network can provide decent bandwidth even with a far from optimal routing mechanism such as restricted flooding. We show that such a network is indeed usable by constructing a botnet on top of it, called Stegobot. It is designed to spread via social malware attacks and steal information from its victims. Unlike conventional botnets, Stegobot traffic does not introduce new communication endpoints between bots. We analyzed a real-world dataset of image sharing between members of an online social network. Analysis of Stegobot's network throughput indicates that stealthy as it is, it is also functionally powerful -- capable of channeling fair quantities of sensitive data from its victims to the botmaster at tens of megabytes every month.
- Apr 05 2011 cs.CG arXiv:1104.0622v1We consider the problem of maintaining the Euclidean Delaunay triangulation $\DT$ of a set $P$ of $n$ moving points in the plane, along algebraic trajectories of constant description complexity. Since the best known upper bound on the number of topological changes in the full $\DT$ is nearly cubic, we seek to maintain a suitable portion of it that is less volatile yet retains many useful properties. We introduce the notion of a stable Delaunay graph, which is a dynamic subgraph of the Delaunay triangulation. The stable Delaunay graph (a) is easy to define, (b) experiences only a nearly quadratic number of discrete changes, (c) is robust under small changes of the norm, and (d) possesses certain useful properties. The stable Delaunay graph ($\SDG$ in short) is defined in terms of a parameter $\alpha>0$, and consists of Delaunay edges $pq$ for which the angles at which $p$ and $q$ see their Voronoi edge $e_{pq}$ are at least $\alpha$. We show that (i) $\SDG$ always contains at least roughly one third of the Delaunay edges; (ii) it contains the $\beta$-skeleton of $P$, for $\beta=1+\Omega(\alpha^2)$; (iii) it is stable, in the sense that its edges survive for long periods of time, as long as the orientations of the segments connecting (nearby) points of $P$ do not change by much; and (iv) stable Delaunay edges remain stable (with an appropriate redefinition of stability) if we replace the Euclidean norm by any sufficiently close norm. In particular, we can approximate the Euclidean norm by a polygonal norm (namely, a regular $k$-gon, with $k=\Theta(1/\alpha)$), and keep track of a Euclidean $\SDG$ by maintaining the full Delaunay triangulation of $P$ under the polygonal norm. We describe two kinetic data structures for maintaining $\SDG$. Both structures use $O^*(n)$ storage and process $O^*(n^2)$ events during the motion, each in $O^*(1)$ time.
- Dec 14 2010 cs.CG arXiv:1012.2694v1Let P be a set of n points in R^3. The 2-center problem for P is to find two congruent balls of minimum radius whose union covers P. We present two randomized algorithms for computing a 2-center of P. The first algorithm runs in O(n^3 log^5 n) expected time, and the second algorithm runs in O((n^2 log^5 n) /(1-r*/r_0)^3) expected time, where r* is the radius of the 2-center balls of P and r_0 is the radius of the smallest enclosing ball of P. The second algorithm is faster than the first one as long as r* is not too close to r_0, which is equivalent to the condition that the centers of the two covering balls be not too close to each other.
- Mar 31 2010 cs.CG arXiv:1003.5874v1Given a set P of n points in |R^d, an eps-kernel K subset P approximates the directional width of P in every direction within a relative (1-eps) factor. In this paper we study the stability of eps-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor eps. In the first case, we say an algorithm for dynamically maintaining a eps-kernel is stable if at most O(1) points change in K as one point is inserted or deleted from P. We describe an algorithm to maintain an eps-kernel of size O(1/eps^(d-1)/2) in O(1/eps^(d-1)/2 + log n) time per update. Not only does our algorithm maintain a stable eps-kernel, its update time is faster than any known algorithm that maintains an eps-kernel of size O(1/eps^(d-1)/2). Next, we show that if there is an eps-kernel of P of size k, which may be dramatically less than O(1/eps^(d-1)/2), then there is an (eps/2)-kernel of P of size O(min 1/eps^(d-1)/2, k^floor(d/2) log^d-2 (1/eps)). Moreover, there exists a point set P in |R^d and a parameter eps > 0 such that if every eps-kernel of P has size at least k, then any (eps/2)-kernel of P has size Omega(k^floor(d/2)).
- We describe algorithms for finding the regression of t, a sequence of values, to the closest sequence s by mean squared error, so that s is always increasing (isotonicity) and so the values of two consecutive points do not increase by too much (Lipschitz). The isotonicity constraint can be replaced with a unimodular constraint, where there is exactly one local maximum in s. These algorithm are generalized from sequences of values to trees of values. For each scenario we describe near-linear time algorithms.
- Dec 22 2009 cs.NI arXiv:0912.4115v2Multi-channel wireless networks are increasingly being employed as infrastructure networks, e.g. in metro areas. Nodes in these networks frequently employ directional antennas to improve spatial throughput. In such networks, given a source and destination, it is of interest to compute an optimal path and channel assignment on every link in the path such that the path bandwidth is the same as that of the link bandwidth and such a path satisfies the constraint that no two consecutive links on the path are assigned the same channel, referred to as "Channel Discontinuity Constraint" (CDC). CDC-paths are also quite useful for TDMA system, where preferably every consecutive links along a path are assigned different time slots. This paper contains several contributions. We first present an $O(N^{2})$ distributed algorithm for discovering the shortest CDC-path between given source and destination. This improves the running time of the $O(N^{3})$ centralized algorithm of Ahuja et al. for finding the minimum-weight CDC-path. Our second result is a generalized $t$-spanner for CDC-path; For any $\theta>0$ we show how to construct a sub-network containing only $O(\frac{N}{\theta})$ edges, such that that length of shortest CDC-paths between arbitrary sources and destinations increases by only a factor of at most $(1-2\sin{\tfrac{\theta}{2}})^{-2}$. We propose a novel algorithm to compute the spanner in a distributed manner using only $O(n\log{n})$ messages. An important conclusion of this scheme is in the case of directional antennas are used. In this case, it is enough to consider only the two closest nodes in each cone.
- There is an increasing need for high density data storage devices driven by the increased demand of consumer electronics. In this work, we consider a data storage system that operates by encoding information as topographic profiles on a polymer medium. A cantilever probe with a sharp tip (few nm radius) is used to create and sense the presence of topographic profiles, resulting in a density of few Tb per in.2. The prevalent mode of using the cantilever probe is the static mode that is harsh on the probe and the media. In this article, the high quality factor dynamic mode operation, that is less harsh on the media and the probe, is analyzed. The read operation is modeled as a communication channel which incorporates system memory due to inter-symbol interference and the cantilever state. We demonstrate an appropriate level of abstraction of this complex nanoscale system that obviates the need for an involved physical model. Next, a solution to the maximum likelihood sequence detection problem based on the Viterbi algorithm is devised. Experimental and simulation results demonstrate that the performance of this detector is several orders of magnitude better than the performance of other existing schemes.
- Jun 27 2008 cs.CG arXiv:0806.4326v2For a set P of n points in R^2, the Euclidean 2-center problem computes a pair of congruent disks of the minimal radius that cover P. We extend this to the (2,k)-center problem where we compute the minimal radius pair of congruent disks to cover n-k points of P. We present a randomized algorithm with O(n k^7 log^3 n) expected running time for the (2,k)-center problem. We also study the (p,k)-center problem in R^2 under the \ell_∞-metric. We give solutions for p=4 in O(k^O(1) n log n) time and for p=5 in O(k^O(1) n log^5 n) time.
- Here we present the results of the NSF-funded Workshop on Computational Topology, which met on June 11 and 12 in Miami Beach, Florida. This report identifies important problems involving both computation and topology.
- Sep 01 1998 cs.CG arXiv:cs/9808008v1Problems presented at the open-problem session of the 14th Annual ACM Symposium on Computational Geometry are listed.