Metadata Challenges

Lacking smart support to govern the complexity of data and transformations

Data transformations must be governed to prevent DP turning into a swamp

Amplified in data science, with data scientists prevailing data architects
Leverage descriptive metadata and maintenance to keep control over data

Metadata Challenges

Knowledge representation

Which metadata must be captured
How should metadata be organized

Knowledge exploitation

Which features do metadata enable

Which metadata must be captured?

A classification of metadata (Sharma and Thusoo 2016)

Technical metadata

Capture the form and structure of each dataset
E.g.: type of data (text, JSON, Avro); structure of the data (the fields and their types)

Operational metadata

Capture lineage, quality, profile, and provenance of the data
E.g.: source and target locations of data, size, number of records, and lineage

Business metadata

Captures what it all means to the user
E.g.: business names, descriptions, tags, quality, and masking rules for privacy

Another classification of metadata (Sawadogo et al. 2019)

Intra-object metadata

Properties provide a general description of an object in the form of key-value pairs
Summaries and previews provide an overview of the content or structure of an object
Semantic metadata are annotations that help understand the meaning of data

Inter-object metadata

Objects groupings organize objects into collections, each object being able to belong simultaneously to several collections
Similarity links reflect the strength of the similarity between two objects
Parenthood relationships reflect the fact that an object can be the result of joining several others

Global metadata

Semantic resources, i.e., knowledge bases (ontologies, taxonomies, thesauri, dictionaries) used to generate other metadata and improve analyses
Indexes, i.e., data structures that help find an object quickly
Logs, used to track user interactions with the data lake

Capturing the metadata

Pull strategy

The system actively collects new metadata
Requires scheduling: when does the system activate itself?
- Event-based (CRUD)
- Time-based
Requires wrappers: what does the system capture?
- Based on data type and/or application
- A comprehensive monitoring is practically unfeasible

Push strategy

The system passively receives new metadata
Requires an API layer
Mandatory for operational metadata

Still one of the main issues in data platforms!

How should metadata be organized?

Semantic enrichment

Add contextual descriptions (e.g., tags) for easier interpretation.

Data indexing

Use structures to retrieve datasets by characteristics like keywords.

Link generation and conservation

Detect similarities or maintain links between datasets.

Data polymorphism

Store multiple data versions to avoid repeated pre-processing.

Data versioning

Track data changes while preserving previous states.

Usage tracking

Record user interactions with the data.

Constance: (Hai, Geisler, and Quix 2016)

Few details given on metamodel and functionalities.
No metadata collected on operations.

GEMMS: (Quix, Hai, and Vatov 2016)

No discussion about the functionalities provided.
No metadata collected on operations and agents.

GOODS: (Halevy et al. 2016)

Crawls Google’s storage systems to extract basic metadata on datasets and their relationship with other datasets.
Performs metadata inference (e.g., determine the schema of a dataset, trace the provenance of data, or annotate data with their semantics).
- Strictly coupled with the Google platform.
- Mainly focuses on object description and searches.
- No formal description of the metamodel.

Ground: (Hellerstein et al. 2017)

Version graphs represent data versions.
Model graphs represent application metadata, i.e., how data are interpreted for use.
Lineage graphs capture usage information.
Not enough details given to clarify which metadata are actually handled.
Functionalities are described at a high level.

KAYAK: (Maccioni and Torlone 2018)

Support users in creating and optimizing the data processing pipelines.
Only goal-related metadata are collected.

MOSES

Three areas:

Technical (blue)
Operational (green)
Business (red)

Not pre-defined
Domain-independent
Extensible

Tune the trade-off between the level of detail of the functionalities and the required computational effort

MOSES: (Francia et al. 2021)

Functionality	Supported
Semantic enrichment	Yes
Data indexing	No
Link generation	Yes
Data polymorphism	Yes
Data versioning	Yes
Usage tracking	Yes

How would you implement the meta-model?

The Property Graph Data Model

Back in the database community

Meant to be queried and processed
THERE IS NO STANDARD!

R. Angles et al. Foundations of Modern Query Languages for Graph Databases

Example of Property Graph

Formal definition:

Traversal Navigation

Graph traversal: “the ability to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description” [Marko Rodriguez]

Totally opposite to set theory (on which relational databases are based on)

Sets of elements are operated by means of the relational algebra

Object search

Discoverability is a key requirement for data platforms

Simple searches to let users locate “known” information
Data exploration to let users uncover “unknown” information
Common goal: identification and description of Objects

Two levels of querying

Metadata level (most important)
Data level (can be coupled with the first one)

Return all objects of a given project

MATCH (o:Object)-[]-(:Project {name:"ABC"})
RETURN o

Return small objects with a given name pattern in the landing area

MATCH (o:Object)-[]-(d:DataLakeArea)
WHERE d.name = "Landing" AND o.name LIKE "2021_%"AND o.size < 100000
RETURN o

Schema-driven search: return objects that contain information referring to a given Domain

MATCH (o:Object)-[]-(:Schema)-[]-(a:Attribute), (a)-[]-(:Domain {name: "FiscalCode"})
RETURN o

Provenance-driven search

MATCH (obj1:Object)-[:readsFrom]-(o:Operation)-[:writesTo]-(obj2:Object)
CREATE (obj1)-[:ancestorOf]->(obj2)

Discover objects obtained from a given ancestor

MATCH (:Object {id:123})-[:ancestorOf\*]-(obj:Object)
RETURN obj

Discover object(s) from which another has originated

MATCH (obj:Object)-[:ancestorOf\*]-(:Object {id:123})
RETURN obj

Example: a ML team wants to use datasets that were publicized as canonical for certain domains, but they find these datasets being too “groomed” for ML

Provenance links can be used to browse upstream and identify the less-groomed datasets that were used to derive the canonical datasets

Similarity-driven search

Discover datasets to be merged in a certain query

MATCH (:Object {id:123})-[r:similarTo]-(o:Object)
WHERE r.similarityType="affinity"
RETURN o

Discover datasets to be joined in a certain query

MATCH (:Object {id:123})-[r:similarTo]-(o:Object)
WHERE r.similarityType="joinability"
RETURN o

Group similar objects and enrich the search results

List the main objects from each group
Restrict the search to the objects of a single group

Semantics-driven search

MATCH (o:Object)-[:isDescribedBy]-(:OntologyTerm {uri:"http://..."})
RETURN o

Search objects without having any knowledge of theirphysical or intensional properties, but simply exploitingtheir traceability to a certain semantic concept

MATCH (o:Object)-[\*]-(any), (any)-[:isDescribedBy]-(:OntologyTerm {uri:"http://..."})
RETURN o

Profiling

MATCH (o:Object)-[]-(:OntologyType {name:"Table"}),
      (o)-[]-(s:Schema)-[]-(a:Attribute),
      (o)-[r:similarTo]-(o2:Object),
      (o)-[:ancestorOf]-(o3:Object),
      (o4:Object)-[:ancestorOf]-(o)
RETURN o, s, a, r, o2, o3, o4

Shows an object’s properties, list the relationships with other objects in terms of similarity and provenance
Compute a representation of the intensional features that mostly characterize a group of objects(see slides on schema heterogeneity)

References

Francia, Matteo, Enrico Gallinucci, Matteo Golfarelli, Anna Giulia Leoni, Stefano Rizzi, and Nicola Santolini. 2021. “Making Data Platforms Smarter with MOSES.” Future Generation Computer Systems 125: 299–313. https://doi.org/10.1016/J.FUTURE.2021.06.031.

Hai, Rihan, Sandra Geisler, and Christoph Quix. 2016. “Constance: An Intelligent Data Lake System.” In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, edited by Fatma Özcan, Georgia Koutrika, and Sam Madden, 2097–2100. ACM. https://doi.org/10.1145/2882903.2899389.

Halevy, Alon Y., Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. “Managing Google’s Data Lake: An Overview of the Goods System.” IEEE Data Eng. Bull. 39 (3): 5–14. http://sites.computer.org/debull/A16sept/p5.pdf.

Hellerstein, Joseph M., Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, et al. 2017. “Ground: A Data Context Service.” In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf.

Maccioni, Antonio, and Riccardo Torlone. 2018. “KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake.” In International Conference on Advanced Information Systems Engineering, 474–89. Springer.

Quix, Christoph, Rihan Hai, and Ivan Vatov. 2016. “GEMMS: A Generic and Extensible Metadata Management System for Data Lakes.” In Proceedings of the CAiSE’16 Forum, at the 28th International Conference on Advanced Information Systems Engineering (CAiSE 2016), Ljubljana, Slovenia, June 13-17, 2016, edited by Sergio España, Mirjana Ivanovic, and Milos Savic, 1612:129–36. CEUR Workshop Proceedings. CEUR-WS.org. https://ceur-ws.org/Vol-1612/paper17.pdf.

Sawadogo, Pegdwendé N., Étienne Scholly, Cécile Favre, Éric Ferey, Sabine Loudcher, and Jérôme Darmont. 2019. “Metadata Systems for Data Lakes: Models and Features.” In New Trends in Databases and Information Systems, ADBIS 2019 Short Papers, Workshops BBIGAP, QAUCA, SemBDM, SIMPDA, M2P, MADEISD, and Doctoral Consortium, Bled, Slovenia, September 8-11, 2019, Proceedings, edited by Tatjana Welzer, Johann Eder, Vili Podgorelec, Robert Wrembel, Mirjana Ivanovic, Johann Gamper, Mikolaj Morzy, Theodoros Tzouramanis, Jérôme Darmont, and Aida Kamisalic Latific, 1064:440–51. Communications in Computer and Information Science. Springer. https://doi.org/10.1007/978-3-030-30278-8\_43.

Sharma, Ben, and Ashish Thusoo. 2016. “Architecting Data Lakes Publisher.” O’Reilly Media, Inc. Release Date: April.

Data Platforms and Artificial Intelligence

Metadata Challenges

Metadata Challenges

Which metadata must be captured?

Which metadata must be captured?

Capturing the metadata

How should metadata be organized?

How should metadata be organized?

MOSES

How would you implement the meta-model?

The Property Graph Data Model

Example of Property Graph

Traversal Navigation

Object search

References