Databases (cs.DB)

  • PDF
    The execution logs that are used for process mining in practice are often obtained by querying an operational database and storing the result in a flat file. Consequently, the data processing power of the database system cannot be used anymore for this information, leading to constrained flexibility in the definition of mining patterns and limited execution performance in mining large logs. Enabling process mining directly on a database - instead of via intermediate storage in a flat file - therefore provides additional flexibility and efficiency. To help facilitate this ideal of in-database process mining, this paper formally defines a database operator that extracts the 'directly follows' relation from an operational database. This operator can both be used to do in-database process mining and to flexibly evaluate process mining related queries, such as: "which employee most frequently changes the 'amount' attribute of a case from one task to the next". We define the operator using the well-known relational algebra that forms the formal underpinning of relational databases. We formally prove equivalence properties of the operator that are useful for query optimization and present time-complexity properties of the operator. By doing so this paper formally defines the necessary relational algebraic elements of a 'directly follows' operator, which are required for implementation of such an operator in a DBMS.
  • PDF
    We consider answering queries where the underlying data is available only over limited interfaces which provide lookup access to the tuples matching a given binding, but possibly restricting the number of output tuples returned. Interfaces imposing such "result bounds" are common in accessing data via the web. Given a query over a set of relations as well as some integrity constraints that relate the queried relations to the data sources, we examine the problem of deciding if the query is answerable over the interfaces; that is, whether there exists a plan that returns all answers to the query, assuming the source data satisfies the integrity constraints. The first component of our analysis of answerability is a reduction to a query containment problem with constraints. The second component is a set of "schema simplification" theorems capturing limitations on how interfaces with result bounds can be useful to obtain complete answers to queries. These results also help to show decidability for the containment problem that captures answerability, for many classes of constraints. The final component in our analysis of answerability is a "linearization" method, showing that query containment with certain guarded dependencies -- including those that emerge from answerability problems -- can be reduced to query containment for a well-behaved class of linear dependencies. Putting these components together, we get a detailed picture of how to check answerability over result-bounded services.
  • PDF
    Managing dynamic information in large multi-site, multi-species, and multi-discipline consortia is a challenging task for data management applications. Often in academic research studies the goals for informatics teams are to build applications that provide extract-transform-load (ETL) functionality to archive and catalog source data that has been collected by the research teams. In consortia that cross species and methodological or scientific domains, building interfaces that supply data in a usable fashion and make intuitive sense to scientists from dramatically different backgrounds increases the complexity for developers. Further, reusing source data from outside one's scientific domain is fraught with ambiguities in understanding the data types, analysis methodologies, and how to combine the data with those from other research teams. We report on the design, implementation, and performance of a semantic data management application to support the NIMH funded Conte Center at the University of California, Irvine. The Center is testing a theory of the consequences of "fragmented" (unpredictable, high entropy) early-life experiences on adolescent cognitive and emotional outcomes in both humans and rodents. It employs cross-species neuroimaging, epigenomic, molecular, and neuroanatomical approaches in humans and rodents to assess the potential consequences of fragmented unpredictable experience on brain structure and circuitry. To address this multi-technology, multi-species approach, the system uses semantic web techniques based on the Neuroimaging Data Model (NIDM) to facilitate data ETL functionality. We find this approach enables a low-cost, easy to maintain, and semantically meaningful information management system, enabling the diverse research teams to access and use the data.