Challenges and Applications
Lacking smart support to govern the complexity of data and transformations
Data transformations must be governed to prevent DP turning into a swamp
Knowledge representation
Knowledge exploitation
A classification of metadata (Sharma and Thusoo 2016)
Technical metadata
Operational metadata
Business metadata
Another classification of metadata (Sawadogo et al. 2019)
Intra-object metadata
Inter-object metadata
Global metadata
Pull strategy
Push strategy
Still one of the main issues in data platforms!
Semantic enrichment
Data indexing
Link generation and conservation
Data polymorphism
Data versioning
Usage tracking

Constance: (Hai, Geisler, and Quix 2016)

GEMMS: (Quix, Hai, and Vatov 2016)

GOODS: (Halevy et al. 2016)


Ground: (Hellerstein et al. 2017)


KAYAK: (Maccioni and Torlone 2018)

Three areas:


Tune the trade-off between the level of detail of the functionalities and the required computational effort
MOSES: (Francia et al. 2021)

| Functionality | Supported |
|---|---|
| Semantic enrichment | Yes |
| Data indexing | No |
| Link generation | Yes |
| Data polymorphism | Yes |
| Data versioning | Yes |
| Usage tracking | Yes |
Back in the database community
R. Angles et al. Foundations of Modern Query Languages for Graph Databases

Formal definition:

Graph traversal: “the ability to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description” [Marko Rodriguez]
Totally opposite to set theory (on which relational databases are based on)
Discoverability is a key requirement for data platforms
Two levels of querying
Return all objects of a given project
Return small objects with a given name pattern in the landing area
MATCH (o:Object)-[]-(d:DataLakeArea)
WHERE d.name = "Landing" AND o.name LIKE "2021_%"AND o.size < 100000
RETURN oSchema-driven search: return objects that contain information referring to a given Domain

Provenance-driven search
MATCH (obj1:Object)-[:readsFrom]-(o:Operation)-[:writesTo]-(obj2:Object)
CREATE (obj1)-[:ancestorOf]->(obj2)Discover objects obtained from a given ancestor
Discover object(s) from which another has originated
Example: a ML team wants to use datasets that were publicized as canonical for certain domains, but they find these datasets being too “groomed” for ML

Similarity-driven search
Discover datasets to be merged in a certain query
Discover datasets to be joined in a certain query
Group similar objects and enrich the search results

Semantics-driven search
Search objects without having any knowledge of theirphysical or intensional properties, but simply exploitingtheir traceability to a certain semantic concept

Profiling
MATCH (o:Object)-[]-(:OntologyType {name:"Table"}),
(o)-[]-(s:Schema)-[]-(a:Attribute),
(o)-[r:similarTo]-(o2:Object),
(o)-[:ancestorOf]-(o3:Object),
(o4:Object)-[:ancestorOf]-(o)
RETURN o, s, a, r, o2, o3, o4
Matteo Francia - Data Platforms and Artificial Intelligence - A.Y. 2025/26