Data Platforms and Artificial Intelligence

Challenges and Applications

Matteo Francia
DISI — University of Bologna
m.francia@unibo.it

Polyglot Persistence

To each application, the appropriate DBMS

Polyglot Persistence

To each application, the appropriate DBMS: works well for OLTP

What about OLAP?

Polyglot Persistence

Support multiple models in the same database

How can we meta-model graphs into a relational database?

Support multiple models in the same database

CREATE TABLE node (
    node_id        INT PRIMARY KEY,
    created_at     TIMESTAMP
);
CREATE TABLE edge (
    edge_id        INT PRIMARY KEY,
    from_node_id   INT NOT NULL,
    to_node_id     INT NOT NULL,
    edge_type      VARCHAR(255) NOT NULL,
    created_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (from_node_id) REFERENCES node(node_id) ON DELETE CASCADE,
    FOREIGN KEY (to_node_id)   REFERENCES node(node_id) ON DELETE CASCADE
);
CREATE TABLE node_label (
    node_id    INT NOT NULL REFERENCES node(node_id) ON DELETE CASCADE,
    label      VARCHAR(255) NOT NULL,
    PRIMARY KEY (node_id, label)
);
CREATE TABLE node_property (
    node_id        INT NOT NULL REFERENCES node(node_id) ON DELETE CASCADE,
    property_key   VARCHAR(255) NOT NULL,
    property_value TEXT,
    PRIMARY KEY (node_id, property_key)
);
CREATE TABLE edge_property (
    edge_id        INT NOT NULL REFERENCES edge(edge_id) ON DELETE CASCADE,
    property_key   VARCHAR(255) NOT NULL,
    property_value TEXT,
    PRIMARY KEY (edge_id, property_key)
);

Support multiple models in the same database

How can we meta-model JSON documents into a relational database?

Support multiple models in the same database

CREATE TABLE document (
    document_id INT PRIMARY KEY,
    created_at  TIMESTAMP
);
CREATE TABLE node (
    node_id        INT PRIMARY KEY,
    document_id    INT NOT NULL REFERENCES document(document_id) ON DELETE CASCADE,
    parent_node_id INT NULL     REFERENCES node(node_id)         ON DELETE CASCADE,
    key            VARCHAR(255) NULL,
    index_in_array INT NULL,
    node_type      VARCHAR(10) NOT NULL CHECK (node_type IN ('OBJECT','ARRAY','VALUE')),
);
CREATE TABLE node_value (
    node_id       INT PRIMARY KEY REFERENCES node(node_id) ON DELETE CASCADE,
    value_type    VARCHAR(10) NOT NULL CHECK (value_type IN ('STRING','NUMBER','BOOLEAN','NULL')),
    value_string  TEXT NULL,
    value_number  NUMERIC NULL,
    value_boolean BOOLEAN NULL
);

Polyglot Persistence in PostgreSQL

PostgreSQL enables polyglot persistence by blending: - JSON for semi-structured document data
- Apache AGE for graph data (Cypher + property graphs)
- TimescaleDB for time-series and analytical workloads
- PostGIS for geospatial and GIS operations

Multiple data models coexist in one engine and one SQL interface

Enables hybrid relational/document/graph/time-series/geospatial systems

CREATE TABLE sensor_event (
    event_id     SERIAL PRIMARY KEY,
    payload      JSONB,                 -- document model
    location     geography(Point, 4326) -- PostGIS
);

Polyglot persistence: main challenges

Data model heterogeneity

Support multiple models in the same database
Or integrate data from different databases using different query languages

Inter-collection: different records in different collections have different schemas

Not a new problem: think federated databases, corporate mergers, etc.

Intra-collection: different records in the same collection have different schemas

Emerged with NoSQL databases
Reconcile inconsistent versions of the same data (inter- or intra-collection)

Data model heterogeneity

Polyglot Persistence

Basic solutions

Some DBMSs offer multi-model support.

Extended RDBMSs
- KV implementable as a table with two fields: a string key, and a blob value
- Cypher query language on top of a relational implementation of a graph
- Hstore data type in PostgreSQL for wide-column-like implementation
- Scalabilty issue remains
Multi-model NoSQL DBMSs
- ArangoDB, OrientDB
- Support all NoSQL data models, but not the relational one

Some approaches suggest strategies to model everything within RDBMSs

(DiScala and Abadi 2016)
(Tahara, Diamond, and Abadi 2014)

A taxonomy for distributed solutions (Tan et al. 2017)

Federated database system

Homogeneous data stores, exposes a single standard query interface
Features a mediator-wrapper architecture, employs schema-mapping and entity-merging techniques for integration of relational data

Polyglot system

Homogeneous data stores, exposes multiple query interfaces
Takes advantage of the semantic expressiveness of multiple interfaces (e.g., declarative, procedural)

Multistore system

Heterogeneous data stores, exposes a single query interface
Provides a unified querying layer by adopting ontologies and applying schema-mapping and entity-resolution techniques

Polystore system

Heterogeneous data stores, exposes multiple query interfaces
Choose from a variety of query interfaces to seamlessly query data residing in multiple data stores

Advanced solutions

The challenge is to balance two often conflicting forces.

Location Independence: A query is written, and the system figures out which storage engine it targets
Semantic Completeness: A query can exploit the full set of features provided by a storage engine

Example of a polystore

Island = a middleware application to support a set of operations on a given data model
Shim = a wrapper to convert from the island’s query language to the target DB’s query language

BigDAWG (Mattson, Rogers, and Elmore 2019)

Advanced solutions

BigDAWG middleware consists of

Optimizer: parses the input query and creates a set of viable query plan trees with possible engines for each subquery
Monitor: uses performance data from prior queries to determine the query plan tree with the best engine for each subquery
Executor: figures out how to best join the collections and then executes the query
Migrator: moves data from engine to engine when the plan calls for such data motion

Middleware

… and of course we have metadata

Catalog: stores metadata about the system

Databases: Databases, their engine membership, and connection authentication information.
Objects: Data objects (i.e., tables), field names, and object-to-database membership.

Middleware

Most notable multistore/polystore proposals

BigDAWG: Focus on the ability to “move” data from one DB to another to improve query efficiency (Mattson, Rogers, and Elmore 2019)
Estocada: Focus on taking advantage of possible (consistent) redundancy and previous query results (Alotaibi et al. 2020)
Awesome: Focus on supporting common analytical functions (Dasgupta, Coakley, and Gupta 2016)
CloudMdsQL: Focus on taking advantage of local data store native functionalities (Kolev et al. 2016)

Beyond data model heterogeneity

Entity resolution

Every approach needs some kind of integrated knowledge
Ample research from federated database systems
Usually “out-of-scope”

Schema heterogeneity

Heterogeneous data stored with variant schemas and structural forms

Management of schema heterogeneity and data inconsistency
- Usually addressed as different problems in the literature
Missing/additional attributes
Different names/types of attributes
Different nested structures

Schema heterogeneity

Two main problems

Understand the data
Query the data

Understanding the data

Early work on XML

To deal with the widespread lack of DTDs and XSDs
Extract regular expressions to describe the content of elements in a set of XML documents

Recent work on JSON

Concise view: a single representation for all schema variations
- Union of all attributes (Klettke, Störl, and Scherzinger 2015)
- A skeleton as the smallest set of core attributes according to a frequency-based formula (Wang et al. 2015)
Comprehensive view: multiple representations (a different schema for every document) (Ruiz, Morales, and Molina 2015)
Schema profile: explain why there are different schemas (Gallinucci, Golfarelli, and Rizzi 2018)

Schema profiling

Schema profiles explain

What are the differences between schemas
When/why is one schema used instead of the other

The problem of schema profiling is quite similar to a classification problem.

Classifiers are also used to describe the rules for assigning a class to an observation based on the other observation features
Based on the requirements collected from potential users, decision trees emerged as the most adequate

SchemaID	User	Activity	Weight	Duration	Repetitions
S1	Jack	Run		108
S2	John	Leg press	80	4	23
S1	Kate	Walk		42
S3	John	Push-ups		8	40

Schema profiling

The documents are the observations

The schema are the classes

Schema profiling

Preliminary activities

Semi-structured interviews with 5 users

Application domains: fitness equipment sales, software development
Understand goals, requirements, and visualization format
Not one complete/correct dataset description

Definition of schema profile characteristics

Explicativeness
Precision
Conciseness

Explicativeness

Value-based (VB) conditions are preferred to schema-based (SB) ones

SB: acknowledge a difference between schemas
VB: explain it in terms of the values taken by an attribute

The fewer the SB conditions, the more explicativeness

Precision

A decision tree is precise if all the leaves are pure.

A leaf \(v_j\) is pure if all its observations belong to the same class (i.e., \(entropy(v_j) = 0\))
Entropy is strictly related to precision: divisive approaches typically stop only when the leaves are all pure

Precision and conciseness

Minimization of entropy often leads to splitting observations of the same class among several leaves

Entropy’s sole focus is on node purity
More frequent when the number of classes is high

Documents with the same schema are sparse

Precision and conciseness

Typically, precision is more important than readability.

In schema profiling, readability is a critical problem.
It conflicts with the conciseness requirement

Conciseness

A maximally concise schema profile is one where there is a single rule for each schema

Entropy: a leaf is pure if it contains only documents with the same class
Schema entropy: a schema is pure if all its documents are in the same leaf

Conciseness

Purity

Schema profiling example

Examples

Schema profiling algorithm

Introduced the notion of schema entropy loss

Defined a criterion for comparing two splits in the decision tree

Query the data

One thing is understanding the data; another is enabling querying over heterogeneous data.

What we need

Integration techniques to solve schema heterogeneity and produce a global knowledge base
Query rewriting techniques to translate queries on the global knowledge to queries on the actual schemas

Focus on OLAP queries

Integration techniques

Integration at the intensional level

Schema matching and mapping (Rahm and Bernstein 2001)
- A match is a correspondence between attributes
- A mapping is a function to explain the relationship between attributes
  - E.g., S1.FullName = CONCAT(S2.FirstName, S2.LastName)

Integration at the extensional level

Entity resolution (a.k.a. record linkage or duplicate detection)
- Identifying (or linking, or grouping) different records referring to the same real-world entity
- Aims at removing redundancy and increasing conciseness
Data fusion (Mandreoli and Montangero 2019)
- Fuse records on the same real-world entity into a single record and resolve possible conflicts
- Aims at increasing correctness of data

OLAP querying

A first approach to OLAP on heterogeneous data

(Gallinucci, Golfarelli, and Rizzi 2019)

OLAP querying

Some limitations

Expensive querying
- Does not scale well with the number of schemas
Expensive integration
- High levels of heterogeneity imply complex rewriting rules (requiring knowledge and time)
- Assuming to be always able to obtain a global schema is a bit pretentious

“One does not simply define a global schema”

New integration techniques

(Curry 2020)

Dataspace

Replace the global schema with a dataspace (Curry 2020)

A dataspace is a lightweight integration approach providing basic query expressive power on a variety of data sources, bypassing the complexity of traditional integration approaches and possibly returning best-effort or approximate answers (Franklin, Halevy, and Maier 2005)

Replace traditional integration with a pay-as-you-go approach.

The system incrementally integrates data over time by asking users to confirm matches as the system runs (Jeffery, Franklin, and Halevy 2008)

Dataspace

Introducing new concepts

Entities: representation of a real-world entity
- E.g., Customers, Products, Orders, etc.
Features: univocal representation of a group of semantically equivalent attributes
- E.g., CustomerName = { S1.name, S2.fullname, S3.customer, S4.cName, ... }
- Mapping functions must be defined/definable between every couple

The dataspace becomes an abstract view in terms of features and entities.

New OLAP querying

Previous issues

Expensive querying
- Schema heterogeneity solved at query time
- Requires complex - but feasible - algorithms
Expensive integration
- Pay-as-you-go approach is quicker, iterative, and more flexible
- Dataspace is conceptual, untied to logical data modeling

Now we have a multistore dealing with multiple data models and schema heterogeneity

Data inconsistency

Intra-collection

Due to denormalized data modeling

Inter-collection

Due to analytical data offloading
- To reduce costs and optimize performance, the historical depth of databases is kept limited
- After some years, data are offloaded to cheaper/bigger storage, e.g., cloud storage, data lakes
- Offloading implies a change of data model, a change of schema, and obviously an overlapping of instances with the original data
Due to multi-cloud architectures
- Enables the exploitation of data spread across different providers and architectures, all the while overcoming data silos through data virtualization
- Typical in the presence of many company branches

Solutions?

Traditional ETL
Solve inconsistencies on the fly

Data fusion

Merge operator

Originally introduced as “full outer join merge” (Naumann, Freytag, and Leser 2004)
Aims to keep as much information as possible when joining the records of two schemas
- Avoid any loss of records
- Resolve mappings by providing transcoded output
- Resolving conflicts whenever necessary

Merge operator

On-the-fly data fusion

Merge operator in a query plan

Take the data from heterogeneous sources (in different colors)
Extract records of the single entities (e.g., customer, products)
Merge each entity
Join and produce the final result

Now we have a multistore dealing with multiple data models, schema heterogeneity, and data inconsistency

Are we done? Not yet!

On-the-fly data fusion

Main issue: performance

Collections accessed more than once
Most effort pulled to the middleware

What can we do about it?

Exploit more the local DBMSs
Exploit local data modelling
Carry out multi-entity merges

Issues

Several query plans could be devised
Hard to find the most efficient one

Logical optimization

Logical rules to transform a query plan into a more efficent one

Predicate push-down: applying selection predicates as close to the source as possible
- Not always feasible (e.g., in the presence of inconsistent data)
Column pruning: extracting the only attributes relevant for the query
- Not for granted when writing a custom query language
Join sequence reordering: changing the order to do binary joins
- Not so easy when merges are involved as well
- Not so easy when data comes from different sources

Same query, several query plans

What is the most efficient solution?

Single-entity merge and subsequent joins
Nest relational data and multi-merge with documents
Join relational data and multi-merge with flattened documents

Depends on several factors

On the capabilities of each DBMS/middleware
On the presence of indexes and statistics
On the resources available to each DBMS/middleware
On the number of records involved on each side

… which can change over time

Cost modelling

White-box cost modelling

Associate theoretical formulas with each query operator, then build up the cost of a query by summing the cost of each operation
Cost can be determined in terms of disk I/O, CPU, and network
Requires an enormous effort to effectively model the many factors that contribute to query costs in a heterogeneous multistore

Cost modelling

Relational algebra

Use algebraic structures for modeling data and defining queries on it with well founded semantics.
Operate on relations, i.e. homogeneous sets of tuples \(S=\{(s_{j1},s_{j2},...s_{jn})|j\in 1...m\}\)
- \(m\) is the number of tuples in a table
- \(n\) is the number of columns.
- All entries in each column have the same type.
A relation also has a unique tuple called the header which gives each column a unique name or attribute inside the relation.
Projection is a unary operation \(\Pi _{a_{1},\ldots ,a_{n}}(R)\) where \(a_{1},\ldots ,a_{n}\) is a set of attribute names.
Selection is a unary operation \(\sigma _{\varphi }(R)\)
- E.g., \(\sigma _{{\text{isFriend = true}}\,\lor \,{\text{isBusinessContact = true}}}({\text{addressBook}})\)
Cartesian product is a binary operator \(R\times S:=\{(r_{1},r_{2},\dots ,r_{n},s_{1},s_{2},\dots ,s_{m})|(r_{1},r_{2},\dots ,r_{n})\in R,(s_{1},s_{2},\dots ,s_{m})\in S\}\)

Cost modelling

Nested relational algebra

Nested data

Cost modelling

White-box cost modelling example (Forresi, Francia, et al. 2021)

Cost modelling

Black-box cost modelling

Hide the behavior of an execution engine within a black-box
- The known information is mostly limited to the issued queries and the given response times
Cost is determined in terms of time
Easily adapts to evolving environments
Suffers from cold-start

Cost modelling

Black-box cost modelling example (Forresi et al. 2023)

Regression tree

References

Alotaibi, Rana, Bogdan Cautis, Alin Deutsch, Moustafa Latrache, Ioana Manolescu, and Yifei Yang. 2020. “ESTOCADA: Towards Scalable Polystore Systems.” Proc. VLDB Endow. 13 (12): 2949–52. https://doi.org/10.14778/3415478.3415516.

Curry, Edward. 2020. “Dataspaces: Fundamentals, Principles, and Techniques.” In Real-Time Linked Dataspaces - Enabling Data Ecosystems for Intelligent Systems, edited by Edward Curry, 45–62. Springer. https://doi.org/10.1007/978-3-030-29665-0\_3.

Dasgupta, Subhasis, Kevin L. Coakley, and Amarnath Gupta. 2016. “Analytics-Driven Data Ingestion and Derivation in the AWESOME Polystore.” In 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington DC, USA, December 5-8, 2016, edited by James Joshi, George Karypis, Ling Liu, Xiaohua Hu, Ronay Ak, Yinglong Xia, Weijia Xu, et al., 2555–64. IEEE Computer Society. https://doi.org/10.1109/BIGDATA.2016.7840897.

DiScala, Michael, and Daniel J. Abadi. 2016. “Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data.” In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, edited by Fatma Özcan, Georgia Koutrika, and Sam Madden, 295–310. ACM. https://doi.org/10.1145/2882903.2882924.

Forresi, Chiara, Matteo Francia, Enrico Gallinucci, and Matteo Golfarelli. 2021. “Optimizing Execution Plans in a Multistore.” In Advances in Databases and Information Systems - 25th European Conference, ADBIS 2021, Tartu, Estonia, August 24-26, 2021, Proceedings, edited by Ladjel Bellatreche, Marlon Dumas, Panagiotis Karras, and Raimundas Matulevicius, 12843:136–51. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-030-82472-3\_11.

———. 2023. “Cost-Based Optimization of Multistore Query Plans.” Inf. Syst. Frontiers 25 (5): 1925–51. https://doi.org/10.1007/S10796-022-10320-2.

Forresi, Chiara, Enrico Gallinucci, Matteo Golfarelli, and Hamdi Ben Hamadou. 2021. “A Dataspace-Based Framework for OLAP Analyses in a High-Variety Multistore.” The VLDB Journal 30 (6): 1017–40. https://doi.org/10.1007/S00778-021-00682-5.

Franklin, Michael J., Alon Y. Halevy, and David Maier. 2005. “From Databases to Dataspaces: A New Abstraction for Information Management.” SIGMOD Rec. 34 (4): 27–33. https://doi.org/10.1145/1107499.1107502.

Gallinucci, Enrico, Matteo Golfarelli, and Stefano Rizzi. 2018. “Schema Profiling of Document-Oriented Databases.” Inf. Syst. 75: 13–25. https://doi.org/10.1016/J.IS.2018.02.007.

———. 2019. “Approximate OLAP of Document-Oriented Databases: A Variety-Aware Approach.” Inf. Syst. 85: 114–30. https://doi.org/10.1016/J.IS.2019.02.004.

Jeffery, Shawn R., Michael J. Franklin, and Alon Y. Halevy. 2008. “Pay-as-You-Go User Feedback for Dataspace Systems.” In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, edited by Jason Tsong-Li Wang, 847–60. ACM. https://doi.org/10.1145/1376616.1376701.

Klettke, Meike, Uta Störl, and Stefanie Scherzinger. 2015. “Schema Extraction and Structural Outlier Detection for JSON-Based NoSQL Data Stores.” In Datenbanksysteme für Business, Technologie Und Web (BTW), 16. Fachtagung Des GI-Fachbereichs "Datenbanken Und Informationssysteme" (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, edited by Thomas Seidl, Norbert Ritter, Harald Schöning, Kai-Uwe Sattler, Theo Härder, Steffen Friedrich, and Wolfram Wingerath, P-241:425–44. LNI. GI. https://dl.gi.de/handle/20.500.12116/2420.

Kolev, Boyan, Patrick Valduriez, Carlyna Bondiombouy, Ricardo Jiménez-Peris, Raquel Pau, and José Pereira. 2016. “CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language.” Distributed Parallel Databases 34 (4): 463–503. https://doi.org/10.1007/S10619-015-7185-Y.

Mandreoli, Federica, and Manuela Montangero. 2019. “Dealing with Data Heterogeneity in a Data Fusion Perspective: Models, Methodologies, and Algorithms.” In Data Handling in Science and Technology, 31:235–70. Elsevier.

Mattson, Tim, Jennie Rogers, and Aaron J. Elmore. 2019. “The BigDAWG Polystore System.” In Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker, edited by Michael L. Brodie, 22:279–89. ACM Books. ACM / Morgan & Claypool. https://doi.org/10.1145/3226595.3226620.

Naumann, Felix, Johann Christoph Freytag, and Ulf Leser. 2004. “Completeness of Integrated Information Sources.” Inf. Syst. 29 (7): 583–615. https://doi.org/10.1016/J.IS.2003.12.005.

Rahm, Erhard, and Philip A. Bernstein. 2001. “A Survey of Approaches to Automatic Schema Matching.” VLDB J. 10 (4): 334–50. https://doi.org/10.1007/S007780100057.

Ruiz, Diego Sevilla, Severino Feliciano Morales, and Jesús Garcı́a Molina. 2015. “Inferring Versioned Schemas from NoSQL Databases and Its Applications.” In Conceptual Modeling - 34th International Conference, ER 2015, Stockholm, Sweden, October 19-22, 2015, Proceedings, edited by Paul Johannesson, Mong-Li Lee, Stephen W. Liddle, Andreas L. Opdahl, and Oscar Pastor López, 9381:467–80. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-319-25264-3\_35.

Tahara, Daniel, Thaddeus Diamond, and Daniel J. Abadi. 2014. “Sinew: A SQL System for Multi-Structured Data.” In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, edited by Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, 815–26. ACM. https://doi.org/10.1145/2588555.2612183.

Tan, Ran, Rada Chirkova, Vijay Gadepally, and Timothy G. Mattson. 2017. “Enabling Query Processing Across Heterogeneous Data Models: A Survey.” In 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11-14, 2017, edited by Jian-Yun Nie, Zoran Obradovic, Toyotaro Suzumura, Rumi Ghosh, Raghunath Nambiar, Chonggang Wang, Hui Zang, et al., 3211–20. IEEE Computer Society. https://doi.org/10.1109/BIGDATA.2017.8258302.

Wang, Lanjun, Oktie Hassanzadeh, Shuo Zhang, Juwei Shi, Limei Jiao, Jia Zou, and Chen Wang. 2015. “Schema Management for Document Stores.” Proc. VLDB Endow. 8 (9): 922–33. https://doi.org/10.14778/2777598.2777601.