Data Platforms and Artificial Intelligence

Challenges and Applications

Matteo Francia
DISI — University of Bologna
m.francia@unibo.it

Polyglot Persistence

To each application, the appropriate DBMS

Polyglot Persistence

Polyglot Persistence

To each application, the appropriate DBMS: works well for OLTP

  • What about OLAP?

Polyglot Persistence

Support multiple models in the same database

How can we meta-model graphs into a relational database?

Support multiple models in the same database

CREATE TABLE node (
    node_id        INT PRIMARY KEY,
    created_at     TIMESTAMP
);
CREATE TABLE edge (
    edge_id        INT PRIMARY KEY,
    from_node_id   INT NOT NULL,
    to_node_id     INT NOT NULL,
    edge_type      VARCHAR(255) NOT NULL,
    created_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (from_node_id) REFERENCES node(node_id) ON DELETE CASCADE,
    FOREIGN KEY (to_node_id)   REFERENCES node(node_id) ON DELETE CASCADE
);
CREATE TABLE node_label (
    node_id    INT NOT NULL REFERENCES node(node_id) ON DELETE CASCADE,
    label      VARCHAR(255) NOT NULL,
    PRIMARY KEY (node_id, label)
);
CREATE TABLE node_property (
    node_id        INT NOT NULL REFERENCES node(node_id) ON DELETE CASCADE,
    property_key   VARCHAR(255) NOT NULL,
    property_value TEXT,
    PRIMARY KEY (node_id, property_key)
);
CREATE TABLE edge_property (
    edge_id        INT NOT NULL REFERENCES edge(edge_id) ON DELETE CASCADE,
    property_key   VARCHAR(255) NOT NULL,
    property_value TEXT,
    PRIMARY KEY (edge_id, property_key)
);

Support multiple models in the same database

How can we meta-model JSON documents into a relational database?

Support multiple models in the same database

CREATE TABLE document (
    document_id INT PRIMARY KEY,
    created_at  TIMESTAMP
);
CREATE TABLE node (
    node_id        INT PRIMARY KEY,
    document_id    INT NOT NULL REFERENCES document(document_id) ON DELETE CASCADE,
    parent_node_id INT NULL     REFERENCES node(node_id)         ON DELETE CASCADE,
    key            VARCHAR(255) NULL,
    index_in_array INT NULL,
    node_type      VARCHAR(10) NOT NULL CHECK (node_type IN ('OBJECT','ARRAY','VALUE')),
);
CREATE TABLE node_value (
    node_id       INT PRIMARY KEY REFERENCES node(node_id) ON DELETE CASCADE,
    value_type    VARCHAR(10) NOT NULL CHECK (value_type IN ('STRING','NUMBER','BOOLEAN','NULL')),
    value_string  TEXT NULL,
    value_number  NUMERIC NULL,
    value_boolean BOOLEAN NULL
);

Polyglot Persistence in PostgreSQL

PostgreSQL enables polyglot persistence by blending: - JSON for semi-structured document data
- Apache AGE for graph data (Cypher + property graphs)
- TimescaleDB for time-series and analytical workloads
- PostGIS for geospatial and GIS operations

Multiple data models coexist in one engine and one SQL interface

  • Enables hybrid relational/document/graph/time-series/geospatial systems
CREATE TABLE sensor_event (
    event_id     SERIAL PRIMARY KEY,
    payload      JSONB,                 -- document model
    location     geography(Point, 4326) -- PostGIS
);

Polyglot persistence: main challenges

Data model heterogeneity

  • Support multiple models in the same database
  • Or integrate data from different databases using different query languages

Inter-collection: different records in different collections have different schemas

  • Not a new problem: think federated databases, corporate mergers, etc.

Intra-collection: different records in the same collection have different schemas

  • Emerged with NoSQL databases
  • Reconcile inconsistent versions of the same data (inter- or intra-collection)

Data model heterogeneity

Polyglot Persistence

Basic solutions

Some DBMSs offer multi-model support.

  • Extended RDBMSs
    • KV implementable as a table with two fields: a string key, and a blob value
    • Cypher query language on top of a relational implementation of a graph
    • Hstore data type in PostgreSQL for wide-column-like implementation
    • Scalabilty issue remains
  • Multi-model NoSQL DBMSs
    • ArangoDB, OrientDB
    • Support all NoSQL data models, but not the relational one

Some approaches suggest strategies to model everything within RDBMSs

A taxonomy for distributed solutions (Tan et al. 2017)

Federated database system

  • Homogeneous data stores, exposes a single standard query interface
  • Features a mediator-wrapper architecture, employs schema-mapping and entity-merging techniques for integration of relational data

Polyglot system

  • Homogeneous data stores, exposes multiple query interfaces
  • Takes advantage of the semantic expressiveness of multiple interfaces (e.g., declarative, procedural)

Multistore system

  • Heterogeneous data stores, exposes a single query interface
  • Provides a unified querying layer by adopting ontologies and applying schema-mapping and entity-resolution techniques

Polystore system

  • Heterogeneous data stores, exposes multiple query interfaces
  • Choose from a variety of query interfaces to seamlessly query data residing in multiple data stores

Advanced solutions

The challenge is to balance two often conflicting forces.

  • Location Independence: A query is written, and the system figures out which storage engine it targets
  • Semantic Completeness: A query can exploit the full set of features provided by a storage engine

Example of a polystore

  • Island = a middleware application to support a set of operations on a given data model
  • Shim = a wrapper to convert from the island’s query language to the target DB’s query language

BigDAWG (Mattson, Rogers, and Elmore 2019)

Advanced solutions

BigDAWG middleware consists of

  • Optimizer: parses the input query and creates a set of viable query plan trees with possible engines for each subquery
  • Monitor: uses performance data from prior queries to determine the query plan tree with the best engine for each subquery
  • Executor: figures out how to best join the collections and then executes the query
  • Migrator: moves data from engine to engine when the plan calls for such data motion

Middleware

… and of course we have metadata

Catalog: stores metadata about the system

  • Databases: Databases, their engine membership, and connection authentication information.
  • Objects: Data objects (i.e., tables), field names, and object-to-database membership.

Middleware

Most notable multistore/polystore proposals

Beyond data model heterogeneity

Entity resolution

  • Every approach needs some kind of integrated knowledge
  • Ample research from federated database systems
  • Usually “out-of-scope”

Schema heterogeneity

Heterogeneous data stored with variant schemas and structural forms

  • Management of schema heterogeneity and data inconsistency
    • Usually addressed as different problems in the literature
  • Missing/additional attributes
  • Different names/types of attributes
  • Different nested structures

Schema heterogeneity

Two main problems

  • Understand the data
  • Query the data

Understanding the data

Understanding the data

Early work on XML

  • To deal with the widespread lack of DTDs and XSDs
  • Extract regular expressions to describe the content of elements in a set of XML documents

Recent work on JSON

Schema profiling

Schema profiles explain

  • What are the differences between schemas
  • When/why is one schema used instead of the other

The problem of schema profiling is quite similar to a classification problem.

  • Classifiers are also used to describe the rules for assigning a class to an observation based on the other observation features
  • Based on the requirements collected from potential users, decision trees emerged as the most adequate
SchemaID User Activity Weight Duration Repetitions
S1 Jack Run 108
S2 John Leg press 80 4 23
S1 Kate Walk 42
S3 John Push-ups 8 40

Schema profiling

The documents are the observations

The schema are the classes

Schema profiling

Schema profiling

Schema profiling

Schema profiling

Schema profiling

Schema profiling

Schema profiling

Preliminary activities

Semi-structured interviews with 5 users

  • Application domains: fitness equipment sales, software development
  • Understand goals, requirements, and visualization format
  • Not one complete/correct dataset description

Definition of schema profile characteristics

  • Explicativeness
  • Precision
  • Conciseness

Explicativeness

Value-based (VB)

Schema-based (SB)

Value-based (VB) conditions are preferred to schema-based (SB) ones

  • SB: acknowledge a difference between schemas
  • VB: explain it in terms of the values taken by an attribute

The fewer the SB conditions, the more explicativeness

Precision

A decision tree is precise if all the leaves are pure.

  • A leaf \(v_j\) is pure if all its observations belong to the same class (i.e., \(entropy(v_j) = 0\))
  • Entropy is strictly related to precision: divisive approaches typically stop only when the leaves are all pure

Entropy formulation

Leaf purity

Precision and conciseness

Minimization of entropy often leads to splitting observations of the same class among several leaves

  • Entropy’s sole focus is on node purity
  • More frequent when the number of classes is high

Documents with the same schema are sparse

Precision and conciseness

Typically, precision is more important than readability.

  • In schema profiling, readability is a critical problem.
  • It conflicts with the conciseness requirement

Conciseness

A maximally concise schema profile is one where there is a single rule for each schema

  • Entropy: a leaf is pure if it contains only documents with the same class
  • Schema entropy: a schema is pure if all its documents are in the same leaf

Entropy

Schema entropy

Conciseness

Purity

Schema profiling example

Examples

Schema profiling algorithm

Introduced the notion of schema entropy loss

Loss

Defined a criterion for comparing two splits in the decision tree

Gain

Query the data

Query the data

One thing is understanding the data; another is enabling querying over heterogeneous data.

What we need

  1. Integration techniques to solve schema heterogeneity and produce a global knowledge base
  2. Query rewriting techniques to translate queries on the global knowledge to queries on the actual schemas

Focus on OLAP queries

Integration techniques

Integration at the intensional level

  • Schema matching and mapping (Rahm and Bernstein 2001)
    • A match is a correspondence between attributes
    • A mapping is a function to explain the relationship between attributes
      • E.g., S1.FullName = CONCAT(S2.FirstName, S2.LastName)

Integration at the extensional level

  • Entity resolution (a.k.a. record linkage or duplicate detection)
    • Identifying (or linking, or grouping) different records referring to the same real-world entity
    • Aims at removing redundancy and increasing conciseness
  • Data fusion (Mandreoli and Montangero 2019)
    • Fuse records on the same real-world entity into a single record and resolve possible conflicts
    • Aims at increasing correctness of data

Integration

OLAP querying

A first approach to OLAP on heterogeneous data

OLAP querying

Some limitations

  • Expensive querying
    • Does not scale well with the number of schemas
  • Expensive integration
    • High levels of heterogeneity imply complex rewriting rules (requiring knowledge and time)
    • Assuming to be always able to obtain a global schema is a bit pretentious

“One does not simply define a global schema”

New integration techniques

(Curry 2020)

Dataspace

Replace the global schema with a dataspace (Curry 2020)

A dataspace is a lightweight integration approach providing basic query expressive power on a variety of data sources, bypassing the complexity of traditional integration approaches and possibly returning best-effort or approximate answers (Franklin, Halevy, and Maier 2005)

Replace traditional integration with a pay-as-you-go approach.

Dataspace

Introducing new concepts

  • Entities: representation of a real-world entity
    • E.g., Customers, Products, Orders, etc.
  • Features: univocal representation of a group of semantically equivalent attributes
    • E.g., CustomerName = { S1.name, S2.fullname, S3.customer, S4.cName, ... }
    • Mapping functions must be defined/definable between every couple

The dataspace becomes an abstract view in terms of features and entities.

New OLAP querying

Entities and features

Execution plan

Previous issues

  • Expensive querying
    • Schema heterogeneity solved at query time
    • Requires complex - but feasible - algorithms
  • Expensive integration
    • Pay-as-you-go approach is quicker, iterative, and more flexible
    • Dataspace is conceptual, untied to logical data modeling

Now we have a multistore dealing with multiple data models and schema heterogeneity

Data inconsistency

Intra-collection

  • Due to denormalized data modeling

Inter-collection

  • Due to analytical data offloading
    • To reduce costs and optimize performance, the historical depth of databases is kept limited
    • After some years, data are offloaded to cheaper/bigger storage, e.g., cloud storage, data lakes
    • Offloading implies a change of data model, a change of schema, and obviously an overlapping of instances with the original data
  • Due to multi-cloud architectures
    • Enables the exploitation of data spread across different providers and architectures, all the while overcoming data silos through data virtualization
    • Typical in the presence of many company branches

Solutions?

  • Traditional ETL
  • Solve inconsistencies on the fly

Data fusion

Merge operator

  • Originally introduced as “full outer join merge” (Naumann, Freytag, and Leser 2004)
  • Aims to keep as much information as possible when joining the records of two schemas
    • Avoid any loss of records
    • Resolve mappings by providing transcoded output
    • Resolving conflicts whenever necessary

Merge operator

Fusion and entity resolution

Merge operator

Merge operator

On-the-fly data fusion

Merge operator in a query plan

  • Take the data from heterogeneous sources (in different colors)
  • Extract records of the single entities (e.g., customer, products)
  • Merge each entity
  • Join and produce the final result

Now we have a multistore dealing with multiple data models, schema heterogeneity, and data inconsistency

  • Are we done? Not yet!

On-the-fly data fusion

Main issue: performance

  • Collections accessed more than once
  • Most effort pulled to the middleware

What can we do about it?

  • Exploit more the local DBMSs
  • Exploit local data modelling
  • Carry out multi-entity merges

Issues

  • Several query plans could be devised
  • Hard to find the most efficient one

Logical optimization

Logical rules to transform a query plan into a more efficent one

  • Predicate push-down: applying selection predicates as close to the source as possible
    • Not always feasible (e.g., in the presence of inconsistent data)
  • Column pruning: extracting the only attributes relevant for the query
    • Not for granted when writing a custom query language
  • Join sequence reordering: changing the order to do binary joins
    • Not so easy when merges are involved as well
    • Not so easy when data comes from different sources

Same query, several query plans

What is the most efficient solution?

  • Single-entity merge and subsequent joins
  • Nest relational data and multi-merge with documents
  • Join relational data and multi-merge with flattened documents

Depends on several factors

  • On the capabilities of each DBMS/middleware
  • On the presence of indexes and statistics
  • On the resources available to each DBMS/middleware
  • On the number of records involved on each side

… which can change over time

Normalized vs nested representation

Cost modelling

White-box cost modelling

  • Associate theoretical formulas with each query operator, then build up the cost of a query by summing the cost of each operation
  • Cost can be determined in terms of disk I/O, CPU, and network
  • Requires an enormous effort to effectively model the many factors that contribute to query costs in a heterogeneous multistore

Cost modelling

Relational algebra

  • Use algebraic structures for modeling data and defining queries on it with well founded semantics.
  • Operate on relations, i.e. homogeneous sets of tuples \(S=\{(s_{j1},s_{j2},...s_{jn})|j\in 1...m\}\)
    • \(m\) is the number of tuples in a table
    • \(n\) is the number of columns.
    • All entries in each column have the same type.
  • A relation also has a unique tuple called the header which gives each column a unique name or attribute inside the relation.
  • Projection is a unary operation \(\Pi _{a_{1},\ldots ,a_{n}}(R)\) where \(a_{1},\ldots ,a_{n}\) is a set of attribute names.
  • Selection is a unary operation \(\sigma _{\varphi }(R)\)
    • E.g., \(\sigma _{{\text{isFriend = true}}\,\lor \,{\text{isBusinessContact = true}}}({\text{addressBook}})\)
  • Cartesian product is a binary operator \(R\times S:=\{(r_{1},r_{2},\dots ,r_{n},s_{1},s_{2},\dots ,s_{m})|(r_{1},r_{2},\dots ,r_{n})\in R,(s_{1},s_{2},\dots ,s_{m})\in S\}\)

Cost modelling

Nested relational algebra

Nested data

Cost modelling

White-box cost modelling example (Forresi, Francia, et al. 2021)

Operations

Statistics

Cost modelling

Black-box cost modelling

  • Hide the behavior of an execution engine within a black-box
    • The known information is mostly limited to the issued queries and the given response times
  • Cost is determined in terms of time
  • Easily adapts to evolving environments
  • Suffers from cold-start

Cost modelling

Black-box cost modelling example (Forresi et al. 2023)

Regression tree

References

Alotaibi, Rana, Bogdan Cautis, Alin Deutsch, Moustafa Latrache, Ioana Manolescu, and Yifei Yang. 2020. ESTOCADA: Towards Scalable Polystore Systems.” Proc. VLDB Endow. 13 (12): 2949–52. https://doi.org/10.14778/3415478.3415516.
Curry, Edward. 2020. “Dataspaces: Fundamentals, Principles, and Techniques.” In Real-Time Linked Dataspaces - Enabling Data Ecosystems for Intelligent Systems, edited by Edward Curry, 45–62. Springer. https://doi.org/10.1007/978-3-030-29665-0\_3.
Dasgupta, Subhasis, Kevin L. Coakley, and Amarnath Gupta. 2016. “Analytics-Driven Data Ingestion and Derivation in the AWESOME Polystore.” In 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington DC, USA, December 5-8, 2016, edited by James Joshi, George Karypis, Ling Liu, Xiaohua Hu, Ronay Ak, Yinglong Xia, Weijia Xu, et al., 2555–64. IEEE Computer Society. https://doi.org/10.1109/BIGDATA.2016.7840897.
DiScala, Michael, and Daniel J. Abadi. 2016. “Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data.” In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, edited by Fatma Özcan, Georgia Koutrika, and Sam Madden, 295–310. ACM. https://doi.org/10.1145/2882903.2882924.
Forresi, Chiara, Matteo Francia, Enrico Gallinucci, and Matteo Golfarelli. 2021. “Optimizing Execution Plans in a Multistore.” In Advances in Databases and Information Systems - 25th European Conference, ADBIS 2021, Tartu, Estonia, August 24-26, 2021, Proceedings, edited by Ladjel Bellatreche, Marlon Dumas, Panagiotis Karras, and Raimundas Matulevicius, 12843:136–51. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-030-82472-3\_11.
———. 2023. “Cost-Based Optimization of Multistore Query Plans.” Inf. Syst. Frontiers 25 (5): 1925–51. https://doi.org/10.1007/S10796-022-10320-2.
Forresi, Chiara, Enrico Gallinucci, Matteo Golfarelli, and Hamdi Ben Hamadou. 2021. “A Dataspace-Based Framework for OLAP Analyses in a High-Variety Multistore.” The VLDB Journal 30 (6): 1017–40. https://doi.org/10.1007/S00778-021-00682-5.
Franklin, Michael J., Alon Y. Halevy, and David Maier. 2005. “From Databases to Dataspaces: A New Abstraction for Information Management.” SIGMOD Rec. 34 (4): 27–33. https://doi.org/10.1145/1107499.1107502.
Gallinucci, Enrico, Matteo Golfarelli, and Stefano Rizzi. 2018. “Schema Profiling of Document-Oriented Databases.” Inf. Syst. 75: 13–25. https://doi.org/10.1016/J.IS.2018.02.007.
———. 2019. “Approximate OLAP of Document-Oriented Databases: A Variety-Aware Approach.” Inf. Syst. 85: 114–30. https://doi.org/10.1016/J.IS.2019.02.004.
Jeffery, Shawn R., Michael J. Franklin, and Alon Y. Halevy. 2008. “Pay-as-You-Go User Feedback for Dataspace Systems.” In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, edited by Jason Tsong-Li Wang, 847–60. ACM. https://doi.org/10.1145/1376616.1376701.
Klettke, Meike, Uta Störl, and Stefanie Scherzinger. 2015. “Schema Extraction and Structural Outlier Detection for JSON-Based NoSQL Data Stores.” In Datenbanksysteme für Business, Technologie Und Web (BTW), 16. Fachtagung Des GI-Fachbereichs "Datenbanken Und Informationssysteme" (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, edited by Thomas Seidl, Norbert Ritter, Harald Schöning, Kai-Uwe Sattler, Theo Härder, Steffen Friedrich, and Wolfram Wingerath, P-241:425–44. LNI. GI. https://dl.gi.de/handle/20.500.12116/2420.
Kolev, Boyan, Patrick Valduriez, Carlyna Bondiombouy, Ricardo Jiménez-Peris, Raquel Pau, and José Pereira. 2016. “CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language.” Distributed Parallel Databases 34 (4): 463–503. https://doi.org/10.1007/S10619-015-7185-Y.
Mandreoli, Federica, and Manuela Montangero. 2019. “Dealing with Data Heterogeneity in a Data Fusion Perspective: Models, Methodologies, and Algorithms.” In Data Handling in Science and Technology, 31:235–70. Elsevier.
Mattson, Tim, Jennie Rogers, and Aaron J. Elmore. 2019. “The BigDAWG Polystore System.” In Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker, edited by Michael L. Brodie, 22:279–89. ACM Books. ACM / Morgan & Claypool. https://doi.org/10.1145/3226595.3226620.
Naumann, Felix, Johann Christoph Freytag, and Ulf Leser. 2004. “Completeness of Integrated Information Sources.” Inf. Syst. 29 (7): 583–615. https://doi.org/10.1016/J.IS.2003.12.005.
Rahm, Erhard, and Philip A. Bernstein. 2001. “A Survey of Approaches to Automatic Schema Matching.” VLDB J. 10 (4): 334–50. https://doi.org/10.1007/S007780100057.
Ruiz, Diego Sevilla, Severino Feliciano Morales, and Jesús Garcı́a Molina. 2015. “Inferring Versioned Schemas from NoSQL Databases and Its Applications.” In Conceptual Modeling - 34th International Conference, ER 2015, Stockholm, Sweden, October 19-22, 2015, Proceedings, edited by Paul Johannesson, Mong-Li Lee, Stephen W. Liddle, Andreas L. Opdahl, and Oscar Pastor López, 9381:467–80. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-319-25264-3\_35.
Tahara, Daniel, Thaddeus Diamond, and Daniel J. Abadi. 2014. “Sinew: A SQL System for Multi-Structured Data.” In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, edited by Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, 815–26. ACM. https://doi.org/10.1145/2588555.2612183.
Tan, Ran, Rada Chirkova, Vijay Gadepally, and Timothy G. Mattson. 2017. “Enabling Query Processing Across Heterogeneous Data Models: A Survey.” In 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11-14, 2017, edited by Jian-Yun Nie, Zoran Obradovic, Toyotaro Suzumura, Rumi Ghosh, Raghunath Nambiar, Chonggang Wang, Hui Zang, et al., 3211–20. IEEE Computer Society. https://doi.org/10.1109/BIGDATA.2017.8258302.
Wang, Lanjun, Oktie Hassanzadeh, Shuo Zhang, Juwei Shi, Limei Jiao, Jia Zou, and Chen Wang. 2015. “Schema Management for Document Stores.” Proc. VLDB Endow. 8 (9): 922–33. https://doi.org/10.14778/2777598.2777601.