Challenges and Applications
To each application, the appropriate DBMS
Polyglot Persistence
To each application, the appropriate DBMS: works well for OLTP
Polyglot Persistence
How can we meta-model graphs into a relational database?
CREATE TABLE node (
node_id INT PRIMARY KEY,
created_at TIMESTAMP
);
CREATE TABLE edge (
edge_id INT PRIMARY KEY,
from_node_id INT NOT NULL,
to_node_id INT NOT NULL,
edge_type VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (from_node_id) REFERENCES node(node_id) ON DELETE CASCADE,
FOREIGN KEY (to_node_id) REFERENCES node(node_id) ON DELETE CASCADE
);
CREATE TABLE node_label (
node_id INT NOT NULL REFERENCES node(node_id) ON DELETE CASCADE,
label VARCHAR(255) NOT NULL,
PRIMARY KEY (node_id, label)
);
CREATE TABLE node_property (
node_id INT NOT NULL REFERENCES node(node_id) ON DELETE CASCADE,
property_key VARCHAR(255) NOT NULL,
property_value TEXT,
PRIMARY KEY (node_id, property_key)
);
CREATE TABLE edge_property (
edge_id INT NOT NULL REFERENCES edge(edge_id) ON DELETE CASCADE,
property_key VARCHAR(255) NOT NULL,
property_value TEXT,
PRIMARY KEY (edge_id, property_key)
);How can we meta-model JSON documents into a relational database?
CREATE TABLE document (
document_id INT PRIMARY KEY,
created_at TIMESTAMP
);
CREATE TABLE node (
node_id INT PRIMARY KEY,
document_id INT NOT NULL REFERENCES document(document_id) ON DELETE CASCADE,
parent_node_id INT NULL REFERENCES node(node_id) ON DELETE CASCADE,
key VARCHAR(255) NULL,
index_in_array INT NULL,
node_type VARCHAR(10) NOT NULL CHECK (node_type IN ('OBJECT','ARRAY','VALUE')),
);
CREATE TABLE node_value (
node_id INT PRIMARY KEY REFERENCES node(node_id) ON DELETE CASCADE,
value_type VARCHAR(10) NOT NULL CHECK (value_type IN ('STRING','NUMBER','BOOLEAN','NULL')),
value_string TEXT NULL,
value_number NUMERIC NULL,
value_boolean BOOLEAN NULL
);PostgreSQL enables polyglot persistence by blending: - JSON for semi-structured document data
- Apache AGE for graph data (Cypher + property graphs)
- TimescaleDB for time-series and analytical workloads
- PostGIS for geospatial and GIS operations
Multiple data models coexist in one engine and one SQL interface
Data model heterogeneity
Inter-collection: different records in different collections have different schemas
Intra-collection: different records in the same collection have different schemas
Polyglot Persistence
Some DBMSs offer multi-model support.
Some approaches suggest strategies to model everything within RDBMSs
Federated database system
Polyglot system
Multistore system
Polystore system
The challenge is to balance two often conflicting forces.
Example of a polystore
BigDAWG (Mattson, Rogers, and Elmore 2019)
BigDAWG middleware consists of
Middleware
… and of course we have metadata
Catalog: stores metadata about the system
Middleware
Most notable multistore/polystore proposals
Entity resolution
Heterogeneous data stored with variant schemas and structural forms
Two main problems
Early work on XML
Recent work on JSON
Schema profiles explain
The problem of schema profiling is quite similar to a classification problem.
| SchemaID | User | Activity | Weight | Duration | Repetitions |
|---|---|---|---|---|---|
| S1 | Jack | Run | 108 | ||
| S2 | John | Leg press | 80 | 4 | 23 |
| S1 | Kate | Walk | 42 | ||
| S3 | John | Push-ups | 8 | 40 |
The documents are the observations
The schema are the classes
Schema profiling
Schema profiling
Schema profiling
Schema profiling
Semi-structured interviews with 5 users
Definition of schema profile characteristics


Value-based (VB) conditions are preferred to schema-based (SB) ones
The fewer the SB conditions, the more explicativeness
A decision tree is precise if all the leaves are pure.


Minimization of entropy often leads to splitting observations of the same class among several leaves
Documents with the same schema are sparse
Typically, precision is more important than readability.
A maximally concise schema profile is one where there is a single rule for each schema


Purity
Examples
Introduced the notion of schema entropy loss

Defined a criterion for comparing two splits in the decision tree

One thing is understanding the data; another is enabling querying over heterogeneous data.
What we need
Focus on OLAP queries
Integration at the intensional level
S1.FullName = CONCAT(S2.FirstName, S2.LastName)Integration at the extensional level

A first approach to OLAP on heterogeneous data

Some limitations
“One does not simply define a global schema”
Replace the global schema with a dataspace (Curry 2020)
A dataspace is a lightweight integration approach providing basic query expressive power on a variety of data sources, bypassing the complexity of traditional integration approaches and possibly returning best-effort or approximate answers (Franklin, Halevy, and Maier 2005)
Replace traditional integration with a pay-as-you-go approach.
Introducing new concepts
Customers, Products, Orders, etc.CustomerName = { S1.name, S2.fullname, S3.customer, S4.cName, ... }The dataspace becomes an abstract view in terms of features and entities.


Previous issues
Now we have a multistore dealing with multiple data models and schema heterogeneity
Intra-collection
Inter-collection
Solutions?
Merge operator


Merge operator
Merge operator in a query plan
Now we have a multistore dealing with multiple data models, schema heterogeneity, and data inconsistency

Main issue: performance
What can we do about it?
Issues

Logical rules to transform a query plan into a more efficent one
What is the most efficient solution?
Depends on several factors
… which can change over time

White-box cost modelling
Relational algebra
Nested relational algebra
Nested data
White-box cost modelling example (Forresi, Francia, et al. 2021)


Black-box cost modelling
Black-box cost modelling example (Forresi et al. 2023)
Regression tree
Matteo Francia - Data Platforms and Artificial Intelligence - A.Y. 2025/26