Data Structures and Algorithms (cs.DS)

  • PDF
    In this paper, we consider a coverage problem for uncertain points in a tree. Let T be a tree containing a set P of n (weighted) demand points, and the location of each demand point P_i∈P is uncertain but is known to appear in one of m_i points on T each associated with a probability. Given a covering range \lambda, the problem is to find a minimum number of points (called centers) on T to build facilities for serving (or covering) these demand points in the sense that for each uncertain point P_i∈P, the expected distance from P_i to at least one center is no more than $\lambda$. The problem has not been studied before. We present an O(|T|+M\log^2 M) time algorithm for the problem, where |T| is the number of vertices of T and M is the total number of locations of all uncertain points of P, i.e., M=\sum_P_i∈Pm_i. In addition, by using this algorithm, we solve a k-center problem on T for the uncertain points of P.
  • PDF
    Small depth networks arise in a variety of network related applications, often in the form of maximum flow and maximum weighted matching. Recent works have generalized such methods to include costs arising from concave functions. In this paper we give an algorithm that takes a depth $D$ network and strictly increasing concave weight functions of flows on the edges and computes a $(1 - \epsilon)$-approximation to the maximum weight flow in time $mD \epsilon^{-1}$ times an overhead that is logarithmic in the various numerical parameters related to the magnitudes of gradients and capacities. Our approach is based on extending the scaling algorithm for approximate maximum weighted matchings by [Duan-Pettie JACM`14] to the setting of small depth networks, and then generalizing it to concave functions. In this more restricted setting of linear weights in the range $[w_{\min}, w_{\max}]$, it produces a $(1 - \epsilon)$-approximation in time $O(mD \epsilon^{-1} \log( w_{\max} /w_{\min}))$. The algorithm combines a variety of tools and provides a unified approach towards several problems involving small depth networks.
  • PDF
    Principal component analysis (PCA) is a fundamental dimension reduction tool in statistics and machine learning. For large and high-dimensional data, computing the PCA (i.e., the singular vectors corresponding to a number of dominant singular values of the data matrix) becomes a challenging task. In this work, a single-pass randomized algorithm is proposed to compute PCA with only one pass over the data. It is suitable for processing extremely large and high-dimensional data stored in slow memory (hard disk) or the data generated in a streaming fashion. Experiments with synthetic and real data validate the algorithm's accuracy, which has orders of magnitude smaller error than an existing single-pass algorithm. For a set of high-dimensional data stored as a 150 GB file, the proposed algorithm is able to compute the first 50 principal components in just 24 minutes on a typical 24-core computer, with less than 1 GB memory cost.
  • PDF
    We consider the problem of summarizing a multi set of elements in $\{1, 2, \ldots , n\}$ under the constraint that no element appears more than $\ell$ times. The goal is then to answer \emphrank queries --- given $i\in\{1, 2, \ldots , n\}$, how many elements in the multi set are smaller than $i$? --- with an additive error of at most $\Delta$ and in constant time. For this problem, we prove a lower bound of $\mathcal B_{\ell,n,\Delta}\triangleq$ $\left\lfloor{\frac{n}{\left\lceil{\Delta / \ell}\right\rceil}}\right\rfloor $ $\log\big({\max\{\left\lfloor{\ell / \Delta}\right\rfloor,1\} + 1}\big)$ bits and provide a \emphsuccinct construction that uses $\mathcal B_{\ell,n,\Delta}(1+o(1))$ bits. Next, we generalize our data structure to support processing of a stream of integers in $\{0,1,\ldots,\ell\}$, where upon a query for some $i\le n$ we provide a $\Delta$-additive approximation for the sum of the \emphlast $i$ elements. We show that this too can be done using $\mathcal B_{\ell,n,\Delta}(1+o(1))$ bits and in constant time. This yields the first sub linear space algorithm that computes approximate sliding window sums in $O(1)$ time, where the window size is given at the query time; additionally, it requires only $(1+o(1))$ more space than is needed for a fixed window size.
  • PDF
    Achieving the goals in the title (and others) relies on a cardinality-wise scanning of the ideals of the poset. Specifically, the relevant numbers attached to the k+1 element ideals are inferred from the corresponding numbers of the k-element (order) ideals. Crucial in all of this is a compressed representation (using wildcards) of the ideal lattice. The whole scheme invites distributed computation.
  • PDF
    In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold $\frac1z$, we say that a string $P$ of length $m$ matches a weighted sequence $X$ at starting position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+m-1$ in $X$ is at least $\frac1z$. In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an $O(nz)$-time construction of an $O(nz)$-sized index for a weighted sequence of length $n$ over an integer alphabet that answers pattern matching queries in optimal, $O(m+\mathit{Occ})$ time, where $\mathit{Occ}$ is the number of occurrences reported. Our new index is based on a non-trivial construction of a family of $\lfloor z \rfloor$ weighted sequences of an especially simple form that are equivalent to a general weighted sequence. This new combinatorial insight allowed us to obtain: a construction of the index in the case of a constant-sized alphabet with the same complexities as in (Barton et al., CPM 2016) but with a simple implementation; a deterministic construction in the case of a general integer alphabet (the construction of Barton et al. in this case was randomised); an improvement of the space complexity from $O(nz^2)$ to $O(nz)$ of a more general index for weighted sequences that was presented in (Biswas et al., EDBT 2016); and a significant improvement of the complexities of the approximate variant of the index of Biswas et al.
  • PDF
    We consider the well-studied Hospital Residents (HR) problem in the presence of lower quotas (LQ). The input instance consists of a bipartite graph $G = (\mathcal{R} \cup \mathcal{H}, E)$ where $\mathcal{R}$ and $\mathcal{H}$ denote sets of residents and hospitals respectively. Every vertex has a preference list that imposes a strict ordering on its neighbors. In addition, each hospital $h$ has an associated upper-quota $q^+(h)$ and lower-quota $q^-(h)$. A matching $M$ in $G$ is an assignment of residents to hospitals, and $M$ is said to be feasible if every resident is assigned to at most one hospital and a hospital $h$ is assigned at least $q^-(h)$ and at most $q^+(h)$ residents. Stability is a de-facto notion of optimality in a model where both sets of vertices have preferences. A matching is stable if no unassigned pair has an incentive to deviate from it. It is well-known that an instance of the HRLQ problem need not admit a feasible stable matching. In this paper, we consider the notion of popularity for the HRLQ problem. A matching $M$ is popular if no other matching $M'$ gets more votes than $M$ when vertices vote between $M$ and $M'$. When there are no lower quotas, there always exists a stable matching and it is known that every stable matching is popular. We show that in an HRLQ instance, although a feasible stable matching need not exist, there is always a matching that is popular in the set of feasible matchings. We give an efficient algorithm to compute a maximum cardinality matching that is popular amongst all the feasible matchings in an HRLQ instance.
  • PDF
    String Kernel (SK) techniques, especially those using gapped $k$-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size ($\Sigma$) or allow more mismatches ($M$). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to $O(\Sigma^{M})$. We propose a \textbffast algorithm for calculating \underlineGapped $k$-mer \underlineKernel using \underlineCounting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger $\Sigma$ and $M$, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the $\Sigma^{M}$ term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets), respectively GaKCo is shared as an open source tool at \urlhttps://github.com/QData/GaKCo-SVM
  • PDF
    Osborne's iteration is a method for balancing $n\times n$ matrices which is widely used in linear algebra packages, as balancing preserves eigenvalues and stabilizes their numeral computation. The iteration can be implemented in any norm over $\mathbb{R}^n$, but it is normally used in the $L_2$ norm. The choice of norm not only affects the desired balance condition, but also defines the iterated balancing step itself. In this paper we focus on Osborne's iteration in any $L_p$ norm, where $p < \infty$. We design a specific implementation of Osborne's iteration in any $L_p$ norm that converges to a strictly $\epsilon$-balanced matrix in $\tilde{O}(\epsilon^{-2}n^{9} K)$ iterations, where $K$ measures, roughly, the \em number of bits required to represent the entries of the input matrix. This is the first result that proves that Osborne's iteration in the $L_2$ norm (or any $L_p$ norm, $p < \infty$) strictly balances matrices in polynomial time. This is a substantial improvement over our recent result (in SODA 2017) that showed weak balancing in $L_p$ norms. Previously, Schulman and Sinclair (STOC 2015) showed strong balancing of Osborne's iteration in the $L_\infty$ norm. Their result does not imply any bounds on strict balancing in other norms.

Recent comments

Māris Ozols Feb 21 2017 15:35 UTC

I'm wondering if this result could have any interesting consequences for Hamiltonian complexity. The LCL problem sounds very much like a local Hamiltonian problem, with the run-time of an LCL algorithm corresponding to the range of local interactions in the Hamiltonian.

Maybe one caveat is that thi

...(continued)
Zoltán Zimborás Jan 12 2017 20:38 UTC

Here is a nice description, with additional links, about the importance of this work if it turns out to be flawless (thanks a lot to Martin Schwarz for this link): [dichotomy conjecture][1].

[1]: http://processalgebra.blogspot.com/2017/01/has-feder-vardi-dichotomy-conjecture.html

Māris Ozols Oct 21 2016 21:06 UTC

Very nice! Now we finally know how to fairly cut a cake in a finite number of steps! What is more, the number of steps is expected to go down from the whopping $n^{n^{n^{n^{n^n}}}}$ to just barely $n^{n^n}$. I can't wait to get my slice!

https://www.quantamagazine.org/20161006-new-algorithm-solve

...(continued)
Ashley Apr 21 2015 18:42 UTC

Thanks for the further comments and spotting the new typos. To reply straight away to the other points:

First, the resulting states might as well stay in the same bin (even though, as you rightly note, the bins no longer correspond to the same bit-strings as before). All that matters is that the

...(continued)
Perplexed Platypus Apr 21 2015 14:55 UTC

Thanks for updating the paper so promptly. The updated version addresses all my concerns so far. However I noticed a few extra (minor) things while reading through it.

On page 15, last step of 2(b): if $|\psi_r\rangle$ and $|\psi_t\rangle$ were in the same bin but the combination operation failed

...(continued)
Ashley Apr 20 2015 16:27 UTC

Thank you for these very detailed and helpful comments. I have uploaded a new version of the paper to the arXiv to address them, which should appear tomorrow. I will reply to the comments in more detail (and justify the cases where I didn't modify the paper as suggested) when I receive them through

...(continued)
Perplexed Platypus Apr 13 2015 22:37 UTC

**Summary and recommendation**

This paper considers a $d$-dimensional version of the problem of finding a given pattern within a text, for random patterns and text. The text is assumed to be picked uniformly at random and has size $n^d$ while the pattern has size $m^d$ and is either uniformly ran

...(continued)
Ashley Apr 12 2015 13:01 UTC

Thanks for the clarification. In fact it seems that I do have this option switched on, with the correct author identifier, so I'm not sure why I didn't get an email about these comments.

Perplexed Platypus Apr 10 2015 13:18 UTC

Hi Ashley,

Thanks for your reply, it was very helpful! I thought about e-mailing you but I wanted to preserve my confidentiality as a reviewer. Also, I wanted to see if it is feasible to use SciRate as a platform for interacting with authors during the review process.

I encourage you (and **ot

...(continued)
Ashley Apr 09 2015 20:03 UTC

Hi,

Thank you for your very detailed comments / questions about the technical points in this paper. I did happen to check Scirate today but in general (as I suspect with many other people) I don't check it regularly, so for reliable replies it's better just to email me. To reply to your questions i

...(continued)