Data Platforms and Artificial Intelligence

Challenges and Applications

Matteo Francia
DISI — University of Bologna
m.francia@unibo.it

Metadata Challenges

Lacking smart support to govern the complexity of data and transformations

Data transformations must be governed to prevent DP turning into a swamp

  • Amplified in data science, with data scientists prevailing data architects
  • Leverage descriptive metadata and maintenance to keep control over data

Metadata Challenges

Knowledge representation

  • Which metadata must be captured
  • How should metadata be organized

Knowledge exploitation

  • Which features do metadata enable

Which metadata must be captured?

Which metadata must be captured?

A classification of metadata (Sharma and Thusoo 2016)

Technical metadata

  • Capture the form and structure of each dataset
  • E.g.: type of data (text, JSON, Avro); structure of the data (the fields and their types)

Operational metadata

  • Capture lineage, quality, profile, and provenance of the data
  • E.g.: source and target locations of data, size, number of records, and lineage

Business metadata

  • Captures what it all means to the user
  • E.g.: business names, descriptions, tags, quality, and masking rules for privacy

Another classification of metadata (Sawadogo et al. 2019)

Intra-object metadata

  • Properties provide a general description of an object in the form of key-value pairs
  • Summaries and previews provide an overview of the content or structure of an object
  • Semantic metadata are annotations that help understand the meaning of data

Inter-object metadata

  • Objects groupings organize objects into collections, each object being able to belong simultaneously to several collections
  • Similarity links reflect the strength of the similarity between two objects
  • Parenthood relationships reflect the fact that an object can be the result of joining several others

Global metadata

  • Semantic resources, i.e., knowledge bases (ontologies, taxonomies, thesauri, dictionaries) used to generate other metadata and improve analyses
  • Indexes, i.e., data structures that help find an object quickly
  • Logs, used to track user interactions with the data lake

Capturing the metadata

Pull strategy

  • The system actively collects new metadata
  • Requires scheduling: when does the system activate itself?
    • Event-based (CRUD)
    • Time-based
  • Requires wrappers: what does the system capture?
    • Based on data type and/or application
    • A comprehensive monitoring is practically unfeasible

Push strategy

  • The system passively receives new metadata
  • Requires an API layer
  • Mandatory for operational metadata

Still one of the main issues in data platforms!

How should metadata be organized?

How should metadata be organized?

Semantic enrichment

  • Add contextual descriptions (e.g., tags) for easier interpretation.

Data indexing

  • Use structures to retrieve datasets by characteristics like keywords.

Link generation and conservation

  • Detect similarities or maintain links between datasets.

Data polymorphism

  • Store multiple data versions to avoid repeated pre-processing.

Data versioning

  • Track data changes while preserving previous states.

Usage tracking

  • Record user interactions with the data.

Constance: (Hai, Geisler, and Quix 2016)

  • Few details given on metamodel and functionalities.
  • No metadata collected on operations.

GEMMS: (Quix, Hai, and Vatov 2016)

  • No discussion about the functionalities provided.
  • No metadata collected on operations and agents.

GOODS: (Halevy et al. 2016)

  • Crawls Google’s storage systems to extract basic metadata on datasets and their relationship with other datasets.
  • Performs metadata inference (e.g., determine the schema of a dataset, trace the provenance of data, or annotate data with their semantics).
    • Strictly coupled with the Google platform.
    • Mainly focuses on object description and searches.
    • No formal description of the metamodel.

Ground: (Hellerstein et al. 2017)

  • Version graphs represent data versions.
  • Model graphs represent application metadata, i.e., how data are interpreted for use.
  • Lineage graphs capture usage information.
  • Not enough details given to clarify which metadata are actually handled.
  • Functionalities are described at a high level.

KAYAK: (Maccioni and Torlone 2018)

  • Support users in creating and optimizing the data processing pipelines.
  • Only goal-related metadata are collected.

MOSES

Three areas:

  • Technical (blue)
  • Operational (green)
  • Business (red)

  • Not pre-defined
  • Domain-independent
  • Extensible

Tune the trade-off between the level of detail of the functionalities and the required computational effort

MOSES: (Francia et al. 2021)

Functionality Supported
Semantic enrichment Yes
Data indexing No
Link generation Yes
Data polymorphism Yes
Data versioning Yes
Usage tracking Yes

How would you implement the meta-model?

The Property Graph Data Model

Back in the database community

  • Meant to be queried and processed
  • THERE IS NO STANDARD!

R. Angles et al. Foundations of Modern Query Languages for Graph Databases

Example of Property Graph

Example of property graph

Formal definition:

Formal definition of property graph

Traversal Navigation

Graph traversal: “the ability to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description” [Marko Rodriguez]

Totally opposite to set theory (on which relational databases are based on)

  • Sets of elements are operated by means of the relational algebra

Return all objects of a given project

MATCH (o:Object)-[]-(:Project {name:"ABC"})
RETURN o

Return small objects with a given name pattern in the landing area

MATCH (o:Object)-[]-(d:DataLakeArea)
WHERE d.name = "Landing" AND o.name LIKE "2021_%"AND o.size < 100000
RETURN o

Schema-driven search: return objects that contain information referring to a given Domain

MATCH (o:Object)-[]-(:Schema)-[]-(a:Attribute), (a)-[]-(:Domain {name: "FiscalCode"})
RETURN o

Provenance-driven search

MATCH (obj1:Object)-[:readsFrom]-(o:Operation)-[:writesTo]-(obj2:Object)
CREATE (obj1)-[:ancestorOf]->(obj2)

Discover objects obtained from a given ancestor

MATCH (:Object {id:123})-[:ancestorOf\*]-(obj:Object)
RETURN obj

Discover object(s) from which another has originated

MATCH (obj:Object)-[:ancestorOf\*]-(:Object {id:123})
RETURN obj

Example: a ML team wants to use datasets that were publicized as canonical for certain domains, but they find these datasets being too “groomed” for ML

  • Provenance links can be used to browse upstream and identify the less-groomed datasets that were used to derive the canonical datasets

Similarity-driven search

Discover datasets to be merged in a certain query

MATCH (:Object {id:123})-[r:similarTo]-(o:Object)
WHERE r.similarityType="affinity"
RETURN o

Discover datasets to be joined in a certain query

MATCH (:Object {id:123})-[r:similarTo]-(o:Object)
WHERE r.similarityType="joinability"
RETURN o

Group similar objects and enrich the search results

  • List the main objects from each group
  • Restrict the search to the objects of a single group

Semantics-driven search

MATCH (o:Object)-[:isDescribedBy]-(:OntologyTerm {uri:"http://..."})
RETURN o

Search objects without having any knowledge of theirphysical or intensional properties, but simply exploitingtheir traceability to a certain semantic concept

MATCH (o:Object)-[\*]-(any), (any)-[:isDescribedBy]-(:OntologyTerm {uri:"http://..."})
RETURN o

Profiling

MATCH (o:Object)-[]-(:OntologyType {name:"Table"}),
      (o)-[]-(s:Schema)-[]-(a:Attribute),
      (o)-[r:similarTo]-(o2:Object),
      (o)-[:ancestorOf]-(o3:Object),
      (o4:Object)-[:ancestorOf]-(o)
RETURN o, s, a, r, o2, o3, o4
  • Shows an object’s properties, list the relationships with other objects in terms of similarity and provenance
  • Compute a representation of the intensional features that mostly characterize a group of objects(see slides on schema heterogeneity)

References

Francia, Matteo, Enrico Gallinucci, Matteo Golfarelli, Anna Giulia Leoni, Stefano Rizzi, and Nicola Santolini. 2021. “Making Data Platforms Smarter with MOSES.” Future Generation Computer Systems 125: 299–313. https://doi.org/10.1016/J.FUTURE.2021.06.031.
Hai, Rihan, Sandra Geisler, and Christoph Quix. 2016. “Constance: An Intelligent Data Lake System.” In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, edited by Fatma Özcan, Georgia Koutrika, and Sam Madden, 2097–2100. ACM. https://doi.org/10.1145/2882903.2899389.
Halevy, Alon Y., Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. “Managing Google’s Data Lake: An Overview of the Goods System.” IEEE Data Eng. Bull. 39 (3): 5–14. http://sites.computer.org/debull/A16sept/p5.pdf.
Hellerstein, Joseph M., Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, et al. 2017. “Ground: A Data Context Service.” In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf.
Maccioni, Antonio, and Riccardo Torlone. 2018. “KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake.” In International Conference on Advanced Information Systems Engineering, 474–89. Springer.
Quix, Christoph, Rihan Hai, and Ivan Vatov. 2016. GEMMS: A Generic and Extensible Metadata Management System for Data Lakes.” In Proceedings of the CAiSE’16 Forum, at the 28th International Conference on Advanced Information Systems Engineering (CAiSE 2016), Ljubljana, Slovenia, June 13-17, 2016, edited by Sergio España, Mirjana Ivanovic, and Milos Savic, 1612:129–36. CEUR Workshop Proceedings. CEUR-WS.org. https://ceur-ws.org/Vol-1612/paper17.pdf.
Sawadogo, Pegdwendé N., Étienne Scholly, Cécile Favre, Éric Ferey, Sabine Loudcher, and Jérôme Darmont. 2019. “Metadata Systems for Data Lakes: Models and Features.” In New Trends in Databases and Information Systems, ADBIS 2019 Short Papers, Workshops BBIGAP, QAUCA, SemBDM, SIMPDA, M2P, MADEISD, and Doctoral Consortium, Bled, Slovenia, September 8-11, 2019, Proceedings, edited by Tatjana Welzer, Johann Eder, Vili Podgorelec, Robert Wrembel, Mirjana Ivanovic, Johann Gamper, Mikolaj Morzy, Theodoros Tzouramanis, Jérôme Darmont, and Aida Kamisalic Latific, 1064:440–51. Communications in Computer and Information Science. Springer. https://doi.org/10.1007/978-3-030-30278-8\_43.
Sharma, Ben, and Ashish Thusoo. 2016. “Architecting Data Lakes Publisher.” O’Reilly Media, Inc. Release Date: April.