Mar 20 2018 cs.DB
The view and the view update are known mechanism for controlling access of data and for integrating data of different schemas. Despite intensive and long research on them in both the database community and the programming language community, we are facing difficulties to use them in practice. The main reason is that we are lacking of control over the view update strategy to deal with inherited ambiguity of view update for a given view. This vision paper aims to provide a new language-based approach to controlling and integrating decentralized data based on the view, and establish a software foundation for systematic construction of such data management systems. Our key observation is that a view should be defined through a view update strategy rather than a query. In other words, the view definition should be extracted from the view update strategy, which is in sharp contrast to the traditional approaches where the view update strategy is derived from the view definition. In this paper, we present the first programmable architecture with a declarative language for specifying update strategies over views, whose unique view definition can be automatically derived, and show how it can be effectively used to control data access, integrate data generally allowing coexistence of GAV (global as view) and LAV (local as view), and perform both analysis and updates on the integrated data. We demonstrate its usefulness through development of a privacy-preserving ride-sharing alliance system, discuss its application scope, and highlight future challenges.
Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog (the so-called warded Datalog+-) under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. Moreover, this translation has the potential for further extensions -- above all to capture the bag semantics of the semantic web query language SPARQL.
We study the design of differentially private algorithms for adaptive analysis of dynamically growing databases, where a database accumulates new data entries while the analysis is ongoing. We provide a collection of tools for machine learning and other types of data analysis that guarantee differential privacy and accuracy as the underlying databases grow arbitrarily large. We give both a general technique and a specific algorithm for adaptive analysis of dynamically growing databases. Our general technique is illustrated by two algorithms that schedule black box access to some algorithm that operates on a fixed database to generically transform private and accurate algorithms for static databases into private and accurate algorithms for dynamically growing databases. These results show that almost any private and accurate algorithm can be rerun at appropriate points of data growth with minimal loss of accuracy, even when data growth is unbounded. Our specific algorithm directly adapts the private multiplicative weights algorithm to the dynamic setting, maintaining the accuracy guarantee of the static setting through unbounded data growth. Along the way, we develop extensions of several other differentially private algorithms to the dynamic setting, which may be of independent interest for future work on the design of differentially private algorithms for growing databases.
Government policies aim to address public issues and problems and therefore play a pivotal role in peoples lives. The creation of public policies, however, is complex given the perspective of large and diverse stakeholders involvement, considerable human participation, lengthy processes, complex task specification and the non-deterministic nature of the process. The inherent complexities of the policy process impart challenges for designing a computing system that assists in supporting and automating the business process pertaining to policy setup, which also raises concerns for setting up a tracking service in the policy-making environment. A tracking service informs how decisions have been taken during policy creation and can provide useful and intrinsic information regarding the policy process. At present, there exists no computing system that assists in tracking the complete process that has been employed for policy creation. To design such a system, it is important to consider the policy environment challenges; for this a novel network and goal based approach has been framed and is covered in detail in this paper. Furthermore, smart governance objectives that include stakeholders participation and citizens involvement have been considered. Thus, the proposed approach has been devised by considering smart governance principles and the knowledge environment of policy making where tasks are largely dependent on policy makers decisions and on individual policy objectives. Our approach reckons the human dimension for deciding and defining autonomous process activities at run time. Furthermore, with the network-based approach, so-called provenance data tracking is employed which enables the capture of policy process.
Mar 20 2018 cs.DB
In this paper we present the GFP-growth (Guided FP-growth) algorithm, a novel method for finding the count of a given list of itemsets in large data. Unlike FP-growth, our algorithm is designed to focus on the specific multiple itemsets of interest and hence its time and memory costs are better. We prove that the GFP-growth algorithm yields the exact frequency-counts for the required itemsets. We show that for a number of different problems, a solution can be devised which takes advantage of the efficient implementation of multi-targeted mining for boosting the performance. In particular, we study in detail the problem of generating the minority-class rules from imbalanced data, a scenario that appears in many real-life domains such as medical applications, failure prediction, network and cyber security, and maintenance. We develop the Minority-Report Algorithm that uses the GFP-growth for boosting performance. We prove some theoretical properties of the Minority-Report Algorithm and demonstrate its superior performance using simulations and real data.
Serverless architectures organized around loosely-coupled function invocations represent an emerging design for many applications. Recent work mostly focuses on user-facing products and event-driven processing pipelines. In this paper, we explore a completely different part of the application space and examine the feasibility of analytical processing on big data using a serverless architecture. We present Flint, a prototype Spark execution engine that takes advantage of AWS Lambda to provide a pure pay-as-you-go cost model. With Flint, a developer uses PySpark exactly as before, but without needing an actual Spark cluster. We describe the design, implementation, and performance of Flint, along with the challenges associated with serverless analytics.