Big Data and Cloud Platforms (Module 2)

Towards data platforms

Matteo Francia
DISI — University of Bologna
m.francia@unibo.it

How did we get here?

Data-Driven Innovation

  • Use of data and analytics to foster new products, processes, and markets
  • Drive discovery and execution of innovation, achieving new services with a business value

Analytics

  • A catch-all term for different business intelligence (BI) and application-related initiatives
    • E.g., analyzing information from a particular domain
    • E.g., applying BI capabilities to a specific content area (e.g., sales, service, supply chain)

Advanced Analytics

  • (Semi-)Autonomous examination of data to discover deeper insights, make predictions, or generate recommendations (e.g., through data/text mining and machine learning)

Augmented Analytics

  • Use of technologies such as machine learning and AI to assist with data preparation, insight generation, and insight explanation to augment how people explore and analyze data

https://www.gartner.com/en/information-technology/glossary (accessed 2022-08-01)

How did we get here?

From descriptive to prescriptive analytics

Data platform

Companies are collecting tons of data to enable advanced analytics.

  • Raw data is difficult to obtain, interpret, and maintain
  • Data is more and more heterogeneous
  • There is a need for curating data to make it consumable

Where are we collecting/processing data?

  • Getting value from data is not (only) a matter of storage
  • Need integrated and multilevel analytical skills and techniques

Data platform

“It is a capital mistake to theorize before one has data. Insensibly, one begins to twist the facts to suit theories, instead of theories to suit facts.”

– Sherlock Holmes

Getting value from data is not (only) a matter of storage

  • Any example?

Introduction

Managing large photo collections is increasingly challenging.

Common issues include:

  • Difficulty searching and organizing images
  • Loss of contextual information about each photo
  • Time-consuming manual tagging and categorization

Why Metadata Matters

Metadata provides critical information about images:

  • File-level info: name, size, creation date
  • Camera info: make, model, GPS, exposure
  • Content info: objects, people, and scenes

Well-structured metadata enables:

  • Fast search and filtering
  • Smarter organization and categorization
  • Enhanced user experience

Using YOLO for Metadata Extraction

YOLO (You Only Look Once) is a state-of-the-art object detection system.

  • Automatically detects objects in images
  • Provides labels, bounding boxes, and confidence scores
  • Converts visual content into structured metadata

How YOLO Sees the Image

  • Input image is divided into a grid (e.g., 13x13)
  • Each grid cell is responsible for objects whose center falls in the cell
  • Each cell predicts bounding boxes + confidence + class probabilities

What Each Cell Predicts

For each bounding box:

  • Coordinates: (x, y, width, height)
  • Confidence: probability an object exists × IoU with ground truth
  • Class probabilities: likelihood of each object class

How Bounding Boxes Are Calculated

  • Each cell predicts offsets relative to the cell
  • Uses anchor boxes for different object sizes
  • Confidence score = Pr(object) × IOU(predicted box, ground truth)

YOLO Neural Network

  • CNN backbone extracts features from the image
  • Detection layers predict boxes and class probabilities for all grid cells
  • Outputs a single tensor containing all predictions

Post-Processing Predictions

  1. Thresholding: Remove boxes with confidence < threshold
  2. Non-Maximum Suppression (NMS): Remove overlapping boxes that predict the same object

Result: clean set of detected objects with labels, boxes, and confidence

YOLO Versions

  • YOLOv1: Original, fast, less accurate for small objects
  • YOLOv2/v3: Anchor boxes, improved accuracy
  • YOLOv4/v5/v8: Modern improvements, faster, real-time capable

Why YOLO is Fast

  • Single forward pass per image → all objects detected at once
  • No sliding windows or region proposals (unlike R-CNN)
  • Real-time capable: 30–150 FPS depending on model

MongoDB for Accessible Metadata

MongoDB stores per-image JSON metadata efficiently:

  • Flexible schema for diverse metadata
  • Enables querying by labels, confidence, dates, or GPS
  • Scalable for large galleries
  • Integrates easily with web apps or dashboards

Making Metadata Accessible

Sample queries:

  • Find all images containing “dog”:
db.images.find({"detections.label": "dog"})
  • Find images taken after a certain date:
db.images.find({"file_info.created_time": {"$gte": "2025-01-01"}})

End of the case study

From databases to data platforms

Data platform

Database (Özsu 2018)

“A database is a structured and persistent collection of information about some aspect of the real world, organized and stored in a way that facilitates efficient retrieval and modification. The structure of a database is determined by an abstract data model. Primarily, it is this structure that differentiates a database from a data file.”

Relational database

Data platform

At the beginning, computer science was seen as a subsidiary discipline that makes information management faster and cheaper

  • … but did not create (new) profits in itself

The main goal of databases in companies has been that of storing operational data (Codd 1970)

  • Data generated by operations carried out within business processes

Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems.

  • “Online” refers to the fact that such systems are expected to respond to user requests and process them in real-time.
  • Large number of database transactions (writing and reading data) in real-time

Data platform

An exponential increase in operational data has made computers the only tools suitable for decision-making performed by business users

  • The massive use of techniques for analyzing enterprise data made information systems a key factor to achieve business goals

The role of computer science in companies has radically changed since the early 70’s.

  • ICT systems turned from simple tools to improve process efficiency…
  • … into key factors of company organizations capable of deeply impacting on the structure of business processes

Problem: how do we manage data heterogeneity?

Data platform

Schemaless databases

There is no predefined schema that the data must conform to before being added to the database. As a result, you don’t need to know the structure of your data, enabling you to store all your data more easily and quickly.

NoSQL

Problem: how do we transform data into actionable business insights?

Data platform

Big Data must be transformed into Small Data so that it can be exploited for decision-making purposes

  • Small data is data that is “small” enough for human comprehension.
  • It is data in a volume and format that makes it accessible, informative and actionable.

Big vs small data

Data platform

Scenario:

  • Large company, with several branches
  • Managers wish to quantify and evaluate the contribution given from each branch to the global profit

Typical scenario

Data platform

Data Warehouse

Data platform

Evolving generations of data platforms

Data platform

Data Warehouse (DWH) (Golfarelli and Rizzi 2009)

“A collection of data that supports decision-making processes. It provides the following features: subject-oriented, integrated and consistent, not volatile.”

DWH

Data platform

A DWH is a collection of data that supports decision-making processes. It provides the following features:

  • It is subject-oriented;
  • It is integrated and consistent;
  • It shows its evolution over time and it is not volatile

The multidimensional model is the key for representing and querying information in a DW

  • (Business) Facts of interest are represented in (hyper) cubes where:
    • Each cell stores numerical measures that quantify the fact from different points of view;
    • Each axis is a dimension for analyzing measure values;
    • Each dimension can be the root of a hierarchy of attributes used to aggregated measure values

Online analytical processing (OLAP) to answer multi-dimensional analytical queries.

  • Interactive sessions of analysis reading large amounts of data.

Data platform

Querying the cube with OLAP operators: roll up

Multidimensional cube

Roll up

Data platform

Querying the cube with OLAP operators: drill down

Roll up

Drill down

Data platform

The Dimensional Fact Model (Golfarelli, Maio, and Rizzi 1998b) is a graphical conceptual model for DWH design, devised to:

  • create an environment in which user queries may be formulated intuitively
  • make communication possible between designers and end users with the goal of formalizing requirement specifications
  • build a stable platform for logical design (independently of the target logical model)
  • provide clear and expressive design documentation

DFM

Data platform

DFM

Corresponding Entity-Relationsip representation (Golfarelli, Maio, and Rizzi 1998a)

OLTP vs OLAP?

Data platform: OLTP vs OLAP

Characteristic OLTP OLAP
Type of workload Constant transactions (reads/writes) Periodic large writes, interactive large reads
Examples Accounting database, online retail transactions Reporting, decision support
Type Operational data Business data
Data retention Short-term (2-6 months) Long-term (2-5 years)
Storage Gigabytes (GB) Terabytes (TB) / Petabytes (PB)
Users Many Few

Data platform

Evolving generations of data platforms

Data platform

Data lake (Couto et al. 2019)

“A DL is a central repository system for storage, processing, and analysis of raw data, in which the data is kept in its original format and is processed to be queried only when needed. It can store a varied amount of formats in big data ecosystems, from unstructured, semi-structured, to structured data sources.”

Data lake

Data platform

Data lake

Data platform

The data lake started with the Apache Hadoop movement, using the Hadoop File System (HDFS) for cheap storage

  • Schema-on-read architecture
  • Agility of storing any data at low cost
  • Eludes the problems of quality and governance

Downsides of data lakes

  • Security
    • All the data is stored and managed as files
    • No fine-grained access control on the contents of files, but only coarse-grained access governing who can access what files or directories
  • Quality
    • Hard to prevent data corruption and manage schema changes
    • Challenging to ensure atomic operations when writing a group of files
    • No roll-back mechanism
  • Query performance
    • Formats are not optimized for fast access

It is often said that the lake easily turns into a swamp

DWH vs Data Lake?

Data platform: DWH vs Data Lake

Data warehouse Data lake
Data model Relational Non-relational and relational
Schema Designed prior to implementation
(schema-on-write)
Written at the time of analysis
(schema-on-read)
Price/performance Faster queries, higher cost storage Slower queries, lower-cost storage
Data quality Highly curated data, the central version of the truth Any (raw) data
Users Business analysts Data scientists, developers, and business analysts
Analytics Batch reporting, BI, and visualizations Machine learning, predictive analytics, data discovery, and profiling.

Data platform

A two-tier data lake + DWH architecture is dominant in the industry

  • HDFS replaced by cloud data lakes (e.g., S3, ADLS, GCS)
  • Data lake directly accessible to a wide range of analytics engines
  • A subset of data is “ETL-ed” to a DWH for important decision support and BI apps

While the data lake and DWH architecture is ostensibly cheap, a two-tier architecture is highly complex for users

  • Problems?

Data platform

(Some) main problems:

  • Data is first ELTed into lakes, and then again ETLed into DWHs
    • Reliability. Keeping the data lake and DWH consistent is difficult and costly
    • Data staleness. The data in the DWH is stale compared to that of the data lake, with new data frequently taking days to load
  • Limited support for advanced analytics.
    • Businesses want to ask predictive questions using their DWHs
    • Machine learning systems does not work well on (directly) top of DWHs
  • Process large datasets using complex non-SQL code

Data platform

Evolving generations of data platforms

Data lakehouse (Armbrust et al. 2021)

Data lakehouse

Data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of DWHs, enabling business intelligence (BI) and machine learning (ML) on all data

Key technologies used to implement Data Lakehouses

  • Databricks’ Delta Tables
  • Apache Hudi
  • Apache Iceberg

Data lakehouse

https://databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

https://www.databricks.com/glossary/data-lakehouse

Data lakehouse

Main features:

  • Store data in a low-cost object store using a standard file format such as Apache Parquet
  • Implement a transactional metadata layer on top of the object store that defines which objects are part of a table version
  • Implement management features within the metadata layer

Challenges:

  • The metadata layer is insufficient to achieve good SQL performance
  • DWHs use several techniques to get state-of-the-art performance
    • Storing hot data on fast devices such as SSDs, maintaining statistics, building efficient indexes, etc.
  • In a Lakehouse it is not possible to change the format, but it is possible to implement other optimizations that leave the data files unchanged

Data lakehouse

Challenges:

  • Most data lakes (e.g., cloud object stores) are simple key-value stores without cross-key consistency.
  • Multi-object updates aren’t atomic, and there’s no isolation between queries.
    • Queries updating multiple objects can expose readers to partial updates as each object is written individually.
  • For large tables with millions of objects, metadata operations are costly.
    • The latency of cloud object stores can make data-skipping checks slower than the query itself.

Delta Lake uses a transaction log compacted into Apache Parquet for much faster metadata operations on large tabular datasets.

  • The log resides in the _delta_log subdirectory.
  • It contains:
    • JSON objects with sequential IDs for log records (i.e., commits)
    • Periodic checkpoints summarizing the log up to that point

Delta lake

Data lakehouse

Data Warehouse Data Lake Data Lakehouse
Data format Proprietary format Open format Open format
Types of data Structured, limited semi-structured * *
Data access SQL-only, no file access File access APIs for SQL, R, Python, etc.
Reliability High quality, ACID transactions Low quality, “data swamp” High quality, ACID transactions
Governance & security Fine-grained (row/column-level) Weak governance; file-level security Fine-grained (row/column-level)
Performance High Low High
Scalability Expensive at scale Scales cheaply for any data type Scales cheaply for any data type
Use case BI Machine learning Unified for BI and ML

Data platform

Data lakes (or lakehouses) have increasingly taken the role of data hubs

  • Eliminate up-front costs of ingestion and ETL since data are stored in the original format
  • Once in DL, data are available for analysis by everyone in the organization

Drawing a sharp line between storage/computation/analysis is hard

  • Is a database just storage?
  • What about SQL/OLAP?

Blurring of the architectural borderlines

  • DL is often replaced by “data platform” or “data ecosystem”
  • Encompass systems supporting data-intensive storage, computation, analysis

Data platform

Data platform

A unified infrastructure that facilitates the ingestion, storage, management, and exploitation of large volumes of heterogeneous data. It provides a collection of independent and well-integrated services meeting end-to-end data needs.

  • Unified: is conceptually a data backbone
  • Independent: a service is not coupled with any other
  • Well-integrated: services have interfaces that enable easy and frictionless composition
  • End-to-end: services cover the entire data life cycle

Rationale: relieve users from the complexity of administration and provision

  • Not only technological skills, but also privacy, access control, etc.
  • Users should only focus on functional aspects

Data platform

Are we done? No!

  • Lacking smart support to govern the complexity of data and transformations
  • Data transformations must be governed to prevent DP from turning into a swamp
    • Amplified in data science, with data scientists prevailing over data architects
    • Leverage descriptive metadata and maintenance to keep control over data

Which functionalities for (automated) data management can you think about?

Data governance

Managing data platforms

  • Tracking data pipelines: Data provenance
  • Tracking changes: Data versioning
  • Saving storage: Compression
  • Understanding the data: Data profiling
  • Integrating data sources: Entity resolution

Data governance

Data governance

Techniques to ensure that data is secure, private, accurate, available, and usable

  • It includes the actions people must take, the processes they must follow, and the technology that supports them throughout the data life cycle
  • Every organization needs data governance since industries proceed on their digital-transformation journeys

A data steward is a role that ensures that data governance processes are followed and that guidelines are enforced, and recommends improvements to data governance processes.

Data governance

Case study: data governance

Problem: how can we manage a datalake?

Context:

  • You are working for a consultancy company specialized in data engineering.
    • This is a typical use cases of internships or thesis projects in DTM or ISI.
  • Your client is a large enterprise that has accumulated massive amounts of heterogeneous data across departments.
            ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
            │                                                DATA LAKE                                                        │
            └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
            ┌──────────────────────────┐    ┌────────────────────────────┐    ┌───────────────────────────┐   ┌───────────────┐
            │  bucket: dept-a          │    │  bucket: dept-b            │    │  bucket: shared           │   │  bucket: ...  │
            │  (sales)                 │    │  (support)                 │    │  (reference)              │   │               │
            │                          │    │                            │    │                           │   │               │
            │  /customers/             │    │  /customers/               │    │  /geo/                    │   │               │
            │   ├─ customers_part1.csv |    │   ├─ customers_v2.json     │    │   ├─ countries.xlsx       │   │               │
            │   ├─ customers_part2.csv |    │   └─ preferences.csv       │    │   └─ continents.parquet   │   │               │
            │   └─ README.md           │    │                            │    │                           │   │               │
            │                          │    │  /tickets/                 │    │  /products/               │   │               │
            │  /sales/                 │    │   ├─ tickets_2023.csv      │    │   ├─ products_v1.json     │   │               │
            │   ├─ sales_part1.csv     │    │   └─ tickets_2024.csv      │    │   └─ products_v2.json     │   │               │
            │   ├─ sales_part2.csv     │    │                            │    │                           │   │               │
            │   ├─ sales_part3.csv     │    │  /exports/                 │    │  /dictionaries/           │   │               │
            │   └─ sales_part4.csv     │    │   └─ customer_dump.parquet |    |   └─ country_codes.csv    │   |               │
            │                          │    │                            │    │                           │   │               │
            │  /.../                   │    │  /.../                     │    │  /.../                    │   │               │
            │   └─ ...                 │    │   └─ ...                   │    |    └─ ...                 |   |               |  
            │                          │    │                            │    │                           │   │               │
            │  /logs/                  │    │  /logs/                    │    │  /schemas/                │   │               │
            │   └─ ingestion.log       │    │   └─ api_events.log        │    │   └─ inferred_schema.yaml |   │               |
            └──────────────────────────┘    └────────────────────────────┘    └───────────────────────────┘   └───────────────┘

Objectives: assess and organize the data lake to produce unified data assets ready for downstream analytics.

Brainstorming: which problems do you see?

Brainstorming: problems

Data lakes collect data from different sources (e.g., departments)

Files may have:

  • Different formats (CSV, Excel, JSON)
  • Different schemas (column names vary)
  • Duplicated or overlapping concepts

A (table) schema is the blueprint or structure that defines how data is organized (within a table), specifying the names, data types, and constraints (for each column).

Users need to find and integrate relevant data but they don’t know:

  • What is inside each file
  • Which files describe the same entities

… and more

Brainstorming: where do we start?

Step 0: What do we have in the data lake?

ls -R /home/datalake/data/bronze/
/home/datalake/data/bronze/:
department_a  department_b

/home/datalake/data/bronze/department_a:
countries.xlsx  customers  products  sales

/home/datalake/data/bronze/department_a/customers:
customers_part1.csv  customers_part2.csv

/home/datalake/data/bronze/department_a/products:
products_a.json  products_b.json

/home/datalake/data/bronze/department_a/sales:
sales_part1.csv  sales_part2.csv  sales_part3.csv  sales_part4.csv  sales_part5.csv

/home/datalake/data/bronze/department_b:
customers  products  reviews

/home/datalake/data/bronze/department_b/customers:
customers_v2.csv

/home/datalake/data/bronze/department_b/products:
products_c.json

/home/datalake/data/bronze/department_b/reviews:
product_reviews.csv

Step 0: What do we have in the data lake?

Let’s keep to simple, and focus only csv, Excel, and JSON files

Step 1: Schema Extraction

In CSV files, the schema is in the header (first row)

  • For each file, read the first rows to collect column names

Examples:

head -n1 /home/datalake/data/bronze/department_a/customers/customers_part1.csv
customer_id,first_name,email,country,telephone
head -n1 /home/datalake/data/bronze/department_a/sales/sales_part1.csv
order_key,cust_id,product_id,quantity,date,amount_usd

What about .json and .xlsx files?

Step 1: Schema Extraction

For each file, read the first rows to collect column names

Examples:

head -n1 /home/datalake/data/bronze/department_a/customers/customers_part1.csv
customer_id,first_name,email,country,telephone
head -n1 /home/datalake/data/bronze/department_a/sales/sales_part1.csv 
order_key,cust_id,product_id,quantity,date,amount_usd

Problems:

  • What about .json and .xlsx files?
  • Manual integration is:
    • Time-consuming
    • Error-prone
    • Not scalable
  • Even if we build a Python program (e.g., with Pandas), we have many files to manually analyze

Step 2: Clustering by Schema Similarity

We can use (agglomerative) clustering to group files with similar schemas

Example of clustering

How do we define the schema similarity (or distance)?

Example:

  • \(Schema_A = \{`id', `name', `email'\}\)
  • \(Schema_B = \{`id', `fullname', `mail'\}\)

Step 2.a: Jaccard Index (Similarity Measures)

Measures overlap of sets: \(J(Schema_A, Schema_B) = \frac{|Schema_A \cap Schema_B|}{|Schema_A \cup Schema_B|}\)

Example:

  • \(Schema_A = \{`id', `name', `email'\}\)
  • \(Schema_B = \{`id', `fullname', `mail'\}\)
  • \(|Schema_A \cap Schema_B| = |\{`id'\}| = 1\)
  • \(|Schema_A \cup Schema_B| = |\{`id', `name', `email', `fullname', `mail'\}| = 5\)
  • \(J(Schema_A, Schema_B) = \frac{1}{5} = 0.2\)

What do you think about this solution?

Step 2.a: Jaccard Index (Similarity Measures)

countries.xlsx customers_part1.csv customers_part2.csv products_a.json products_b.json sales_part1.csv
countries.xlsx 1 0 0 0 0 0
customers_part1.csv 0 1 1 0 0 0
customers_part2.csv 0 1 1 0 0 0
products_a.json 0 0 0 1 0.625 0.1
products_b.json 0 0 0 0.625 1 0.0769231

Agglomerative Clustering

Step 2.b: Fuzzy Matching (Similarity Measures)

Capture spelling variations in column names

  • String similarity based on edit distance
  • Two attributes are match if they differ by a few characters (i.e., if their similarity is above a threshold \(\alpha\))

Example:

  • \(Schema_A = \{`id', `name', `email'\}\)
  • \(Schema_B = \{`id', `fullname', `mail'\}\)
  • \(|Schema_A \cap Schema_B| = |\{`id', `email'\}| = 2\)
  • \(|Schema_A \cup Schema_B| = |\{`id', `name', `email', `fullname'\}| = 4\)
  • \(J_{fuz}(Schema_A, Schema_B) = \frac{2}{4} = 0.5\)

Step 2.b: Clustering by Schema Similarity

What about the computational cost/complexity?

\(\operatorname {lev} (a,b)={\begin{cases}|a|&{\text{ if }}|b|=0,\\|b|&{\text{ if }}|a|=0,\\\operatorname {lev} {\big (}\operatorname {tail} (a),\operatorname {tail} (b){\big )}&{\text{ if }}\operatorname {head} (a)=\operatorname {head} (b),\\1+\min {\begin{cases}\operatorname {lev} {\big (}\operatorname {tail} (a),b{\big )}\\\operatorname {lev} {\big (}a,\operatorname {tail} (b){\big )}\\\operatorname {lev} {\big (}\operatorname {tail} (a),\operatorname {tail} (b){\big )}\\\end{cases}}&{\text{ otherwise}}\end{cases}}\)

The complexity of computing the distance between two strings of length \(m\) and \(n\) is \(\Theta (m \cdot n)\)

Step 2.b: Clustering by Schema Similarity

Given a metric to measure the distance between two elements of a set, we can use a BK-tree (Burkhard and Keller 1973).

  • All nodes in a subtree has an equal distance to the root node
  • The edge weight of the edge connecting the subtree to the root is equal to the distance.
  • Each subtree of a BK-tree is a BK-tree.

In the average case, insertion and lookup take \(\Theta (\log n)\)

Step 2.c: Embeddings (Similarity Measures)

Fuzzy Matching does not capture semantic similarity between column names

  • We can use transformers to represent words as vectors
  • Capturing semantic meaning, not just spelling

See the demo

Step 2.c: Embeddings (Similarity Measures)

Example of word embeddings

Computing the embeddings:

{
  'telephone': [-0.11484879  0.06089507  0.01468573 ... ],
  'first_name': [-0.11216891  0.04061386 -0.01322238 ... ],
  'customer_id': [-0.06831077  0.03470041 -0.02761879 ... ],
... }

Computing the similarity between embeddings:

sim_cos('country', 'location'): 0.63932264,
sim_cos('location', 'name'): 0.39198124,
sim_cos('name', 'first_name'): 0.6186802

Example:

  • \(Schema_A = \{`id', `name', `email', `country'\}\)
  • \(Schema_B = \{`id', `first\_name', `mail', `location'\}\)
  • \(|Schema_A \cap Schema_B| = |\{`id', `email', `name', `country'\}| = 4\)
  • \(|Schema_A \cup Schema_B| = |\{`id', `email', `name', `country'\}| = 4\)
  • \(J_{emb}(Schema_A, Schema_B) = \frac{4}{4} = 1.0\)

Step 2.c: Embeddings (Similarity Measures)

countries.xlsx customers_part1.csv customers_part2.csv products_a.json products_b.json sales_part1.csv
countries.xlsx 1 0.4 0.4 0 0 0
customers_part1.csv 0.2 1 1 0.2 0.153846 0
customers_part2.csv 0.2 1 1 0.2 0.153846 0
products_a.json 0 0.2 0.2 1 0.625 0.3
products_b.json 0 0.153846 0.153846 0.625 1 0.230769

Agglomerative Clustering

Step 3: Merging

Apply unified schema to all files in a cluster

Concatenate into a single governed dataset

Step 3: Problems?

Step 3: Merging

Within each cluster:

  • Detect overlapping columns via hybrid similarity
  • Map to canonical names
  • Collapse multiple variants into a single column

Table 1:

customer_id first_name email country
1 Alice a@ex.com US
2 Bob b@ex.com UK

Table 2:

customer_id name mail location
1 Chuck c@mail.com United States
2 Dan d@mail.com United Kingdom

Unified table:

customer_id first_name email country
1 Alice a@ex.com US
2 Bob b@ex.com UK
1 Chuck c@mail.com United States
2 Dan d@mail.com United Kingdom

Are we done?

Conclusion

  • Automated semantic clustering and schema unification
  • Provides governed integration in data lakes
  • Enables reliable, auditable, and transparent data pipelines

End of the case study

Data governance: data profiling

Data profiling

Data profiling

Data profiling

Data profiling (Naumann 2014)

  • A broad range of methods to efficiently analyze a given data set
  • E.g., in a relational scenario, tables of a relational database are scanned to derive metadata, such as data types, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules

Characteristics of data profiling

Data profiling: single column (Iris dataset)

Schema and statistics

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
0   sepal length (cm)  150 non-null    float64
1   sepal width (cm)   150 non-null    float64
2   petal length (cm)  150 non-null    float64
3   petal width (cm)   150 non-null    float64
4   species            150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

Data profiling: single columns (Iris dataset)

Statistics

df.describe()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150 150 150 150
mean 5.80917 3.06167 3.72667 1.18333
std 0.823805 0.449123 1.75234 0.752289
min 4.3 2 1 0.1
25% 5.1 2.8 1.5 0.3
50% 5.75 3 4.25 1.3
75% 6.4 3.4 5.1 1.8
max 7.7 4.4 6.7 2.5

How much metadata are we producing?

  • 8 statistics × 4 columns = 32 metadata values
  • The dataset contains 150 rows
  • Metadata-to-data ratio: \(\frac{32}{150} ≈ 21.3\%\)

Data profiling: multiple columns (Iris dataset)

df.corr(method='pearson', numeric_only=True)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
sepal length (cm) 1 -0.106926 0.862175 0.80148
sepal width (cm) -0.106926 1 -0.432089 -0.369509
petal length (cm) 0.862175 -0.432089 1 0.962577
petal width (cm) 0.80148 -0.369509 0.962577 1

Data profiling

The results of data profiling are computationally heavy to discover

  • E.g., discovering keys/dependencies usually involves some sorting step for each considered column

… and we need to verify them for groups of columns in a database

Complexity: given a table with four columns \(\{w, x, y, z\}\), how many groups of columns?

w x y z
1 1 2 2
1 2 1 4

Data profiling

Given a table with four columns \(\{w, x, y, z\}\)

w x y z
1 1 2 2
1 2 1 4

Powerset of four columns
  • To extract the (distinct) cardinality of each column, I will consider \(\binom{|C|}{1}=|C|\) columns \(\{(w), (x), (y), (z)\}\)
  • To extract the correlations between pairs of columns, I will consider \(\binom{|C|}{2}\) groups \(\{(w, x), (w, y), (w, z), (x, y), ...\}\)
  • Extracting the relationships among all possible groups of columns generalizes to \(\sum_{n=1}^{|C|}\binom{|C|}{n}=2^{|C|}−1\) groups
    • 3 columns → 7 groups, 4 columns → 15 groups, 5 columns → 31 groups, 10 columns → 1023 groups, …

Data profiling

Use cases

  • Query optimization
    • Performed by DBMS to support query optimization with statistics about tables and columns
    • Profiling results can be used to estimate the selectivity of operators and the cost of a query plan
  • Data cleansing (typical use case is profiling data)
    • Prepare a cleansing process by revealing errors (e.g., in formatting), missing values, or outliers
  • Data integration and analytics

Data profiling: statistic query optimization

Table R

a b
10 20
20 30
30 40

min(R.a) = 10

max(R.a) = 30

No need to access R.a, min(R.a) >= 10

Data governance: data provenance

Data provenance (Herschel, Diestelkämper, and Ben Lahmar 2017)

Provenance (also referred to as lineage, pedigree, parentage, genealogy)

  • The description of the origins of data and the process by which it arrived at the database
  • Not only data products (e.g., tables, files), but also the processes that created them

Data provenance: astronomers

Examples of use cases (Simmhan et al. 2005)

  • Astronomy changed from being an individualistic to a collective enterprise
  • Telescope time is devoted/allocated to systematic sky surveys, and analysis is performed using data from the archives
  • A federation of astronomical data resources coupled with computational resources needed to exploit the data scientifically
  • Astronomers are increasingly relying on data that they did not take themselves
  • Raw data bear many instrumental signatures that must be removed in the process of generating data products (Mann 2002)

Data provenance

Provenance is a graph, for which we have a standard: PROV-DM (Moreau and Groth 2022)

Entity

  • Physical/conceptual things

Activity

  • Dynamic aspects of the world, such as actions
  • How entities come into existence, often making use of previously existing entities

Agent

  • A person, a piece of software
  • Takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place

PROV standard

Data provenance

Measures of centrality

  • Betweenness centrality (A)
    • Number of shortest paths between two nodes that pass from a certain node
  • Closeness centrality (B)
    • Sum of distances to all other nodes.
  • Eigenvector centrality (C)
    • The score of a node is influenced by score of adjacent nodes (Page rank)
  • Degree centrality (D)
    • Number of adjacent nodes

Centrality measures

Data provenance

Characteristics of data provenance (Simmhan et al. 2005)

Data provenance

Granularity (Simmhan et al. 2005)

  • Fine-grained (instance level): tracking data items (e.g., a tuple in a dataset) transformations
  • Coarse-grained (schema-level): tracking dataset transformations

Data provenance: fine-grained vs coarse-grained

How would you document this table update?

Original data

product_id sales_v1
“p1” 10
“p2” 20
“p3” 30
“p4” 10
“p5” 20
“p6” 30

Update action: sales_v2 = sales_v1 + 1

product_id sales_v2
“p1” 11
“p2” 21
“p3” 31
“p4” 11
“p5” 21
“p6” 31

Data provenance: fine-grained vs coarse-grained

Coarse grained

Fine grained

Data provenance: fine-grained vs coarse-grained

Coarse-grained provenance is appropriate to ensure a broad coverage of heterogeneous transformations involving several applications

Fine-grained provenance is typically used for single vertical applications

  • It requires to collect huge amounts of detailed information to enable a very detailed tracing

Choosing a granularity is the result of a trade-off between accuracy and computational effort

  • Storing only the name and the version of a clustering algorithm enables an approximate reproducibility of the results
  • Storing all its parameters makes this functionality much more accurate

Data provenance

Queries (Ikeda and Widom 2009)

  • Where provenance: given some output, which inputs did the output come from?
  • How provenance: given some output, how were the inputs manipulated?
  • Why provenance: given some output, why was data generated?
    • E.g., in the form of a proof tree that locates source data items contributing to its creation

Fine grained

Data provenance

Use cases for data provenance.

  • Accountability and auditing
  • Data quality
    • Monitoring of the quality (e.g., accuracy) of the objects produced
    • Notify when a transformation pipeline is not behaving as expected
  • Debugging
    • Inferring the cause of pipeline failures is challenging
    • Store inputs of each operation with versions and environmental settings (RAM, CPUs, etc.)

Data governance: data versioning

Data versioning

Version control

  • A class of systems responsible for managing changes to computer programs, documents, or data collections
  • Changes are identified by a number/letter code, termed the revision/version number

However, data pipelines are not only about code but also about

  • Model Version control
  • Data Version Control
  • Model Parameter Tracking
  • Model Performance Comparison

CRISP-DM

Data versioning

Support CRUD (Create, Read, Update, Delete) operations with versions.

For example, on AWS (PUT, GET, DELETE)

GET

PUT

DELETE

What about updating?

Data governance: data compression

Compression

Data compression is the process of encoding data using fewer bits than the original representation

  • Lossless: reduces bits by identifying and eliminating redundancy (no information is lost)
    • Examples?
  • Lossy: reduces bits by removing less important information
    • Examples?

Compression

Data compression

Compression

Summarization: find a concise representation of a dataset in a comprehensible, informative manner

Transaction dataset

TID Products
T1 Beer, Diapers, Cola
T2 Cola, Fanta
T3 Beer, Cola
T3 Beer, Diapers

Frequent itemset mining

Itemset 1: {Cola}

Itemset 2: {Beer}

Itemset 3: {Beer, Diapers}

Characteristics of data summarization (Ahmed 2019)

Data governance: entity resolution

Entity resolution

Entity resolution (Papadakis et al. 2020), also known as entity matching or linking

  • Find records that refer to the same entity across different data sources

Example of intra-source entity resolution

Example of inter-source entity resolution

Data governance: data catalog

Data catalog

A data catalog is an organized inventory of the company’s data (Olesen-Bagneux 2023)

  • The data catalog provides an overview at a metadata level only
  • No actual data values are exposed.
  • Everyone see everything without fear of exposing confidential or sensitive data.

Data catalog

Data catalog

Metadata is organized into domains.

  • A domain is a group of entities sharing knowledge, goals, methods of operation, and communication

Organization of domains

Data catalog: processes vs capabilities

Processes describe how a company performs its tasks.

Capabilities describe what tasks a company performs.

  • Neither processes nor capabilities reflect the business units of your company 1:1.

Process oriented

Capability oriented

Case study: data governance

Check Colab

Open In Colab

Case study: data catalog

Link to the DataHub catalog

End of the case study

Data governance: meta-metadata management

Data platform

Are we done? No!

  • Metadata can become bigger than the data itself

We need meta meta-data (or models)…

  • … chasing our own tails

Data management is still a (research) issue in data platforms.

How do we build data platforms?

Data fabric

Data fabric is a “vision for data management […] that seamlessly connects different clouds, whether they are private or public environments.” (2016)

Frictionless access and sharing of data in a distributed data environment

  • Enables a single and consistent data management framework
  • Continuously identifies and connects data from disparate applications to discover unique business-relevant relationships

It is a unified architecture with an integrated set of technologies and services.

  • Designed to deliver integrated and enriched data at the right time, in the right method, and to the right data consumer
  • Combines key data management technologies such as data governance, integration, pipelining, and orchestration

Gartner (2021). Data fabric architecture

K2View Whitepaper (2021). What is a Data Fabric? The Complete Guide

Data fabric: example from Cloudera

Cloudera SDX

Data fabric

It is a design concept

  • It optimizes data management by automating repetitive tasks
  • According to Gartner estimates, 25% of data management vendors will provide a complete framework for data fabric by 2024 – up from 5% today

Providers

Providers

Gartner (2021). Data fabric is key to modernizing data management

Data fabric

  • Catalog all data: Cover business glossary, design-time, and runtime metadata.
  • Enable self-service capabilities: Support data discovery, profiling, exploration, quality checks, and consumption.
  • Provide a knowledge graph: Visualize links among data, people, processes, and systems for actionable insight.
  • Deliver automatic integration: Help IT and business users with data integration, transformation, virtualization, and federation.
  • Manage unified lifecycle: Maintain a consistent end-to-end Data Fabric lifecycle across platforms and teams.

Data fabric

Active metadata: leverage open APIs to connect platform services using metadata.

Active metadata is always on

  • Automatically and continually collect metadata from various sources.

Active metadata is “intelligent”:

  • Constantly process metadata to connect the dots and create intelligence.
  • E.g., auto-classify sensitive data, automatic suggestions to document data.

Active metadata is action-oriented

  • Active metadata should drive actions such as curating recommendations

Active metadata is open by default

  • Use APIs to hook into every piece of the data platform.

Data fabric

Gartner (2021). Data fabric is key to modernizing data management

Automation through orchestration

The orchestrator is the component in charge of controlling the execution of computation activities

  • Either through a regular scheduling of the activities
  • Or by triggering a process in response to a certain event

Several entities (either processes or human beings) can cover this role to activate some data processes

Automation through orchestration

Dynamic/condition-based behavior

  • Decide what data process should be activated under different conditions
  • Decide how to tune the parameters in case of parametric data processes

Triggering

  • Decide when to trigger a certain data process

Scoping

  • Assess the trustworthiness of objects to decide if a certain data process should be activated or not

Resource estimation or prediction (based on previous executions and current settings)

  • Decide the optimal amount of resources required to terminate successfully while leaving sufficient resources to the other concurrent process
  • Negotiate the resources with the cluster’s resource manager

Automation through orchestration

R1 Compute/CPU resource provisioning

  • Determine the right amount of resources
  • Continuous management in a dynamic environment

R2 Storage

  • Choose the right cloud storage resource, data location, and format (if the application is parametric)

R3 Data movement

  • Dynamically transfer large datasets between compute and storage resources

R4 Synchronization and asynchronization

  • Manage data flow dependencies across analytics tasks

Automation through orchestration

R5 Analytic Task Scheduling and Execution

  • Coordinate workflows across diverse big data models
  • Track and capture data provenance

R6 Service Level Agreement

  • Meet user-defined QoS, such as strict deadlines

R7 Security

  • Support anonymous computation, verify outcomes in multi-party settings, etc.

R8 Monitoring and Failure-Tolerance

  • Ensure streamlined execution
  • Handle or predict failures as they occur

Data mesh

Distributed data architecture under centralized governance for interoperability, enabled by a shared and harmonized self-serve data infrastructure

  • Domain-oriented decentralized data ownership
    • Decentralization and distribution of responsibility to people who are closest to the data, to support continuous change and scalability
    • Each domain exposes its own operational/analytical APIs
  • Data as a product (quantum)
    • Products must be discoverable, addressable, trustworthy, self-describing, and secure
  • Self-serve data infrastructure as a platform
    • High-level abstraction of infrastructure to provision and manage the lifecycle of data products
  • Federated computational governance
    • A governance model that embraces decentralization and domain self-sovereignty, interoperability through global standardization, a dynamic topology, and automated execution of decisions by the platform

Zhamak Dehghani (2019). Data monolith to mesh

Zhamak Dehghani (2020). Data mesh principles

Zhamak Dehghani (2021). Introduction to Data Mesh

Data mesh

Data Mesh organizes data around business domain owners and transforms relevant data assets (data sources) to data products that can be consumed by distributed business users from various business domains or functions.

  • Data products are created, governed, and used in an autonomous, decentralized, and self-service manner
  • Self-service capabilities, which we have already referenced as a Data Fabric capability, enable business organizations to entertain a data marketplace with shopping-for-data characteristics

Data mesh

What makes data a product?

A data product is raw data transformed into a business context

  • Data products are registered in the knowledge catalog through specifications (XML, JSON, etc.).
  • Data products need to be searchable and discoverable by potential data product consumers and business users.
    • Cataloged: All data products need to be registered in the knowledge catalog.
    • Data product description: The data product needs to be well-described.
    • Access methods: for example, REST APIs, SQL, NoSQL, etc., and where to find the data asset.
    • Policies and rules: who is allowed to consume the data product for what purpose.
    • SLAs: agreements regarding the data product availability, performance characteristics, functions, and cost of data product usage.
  • Data products themselves are not stored in the knowledge catalog

Case study: data mesh

Check the seminars from HERA.

End of the case study

Data mesh vs data fabric

They are design concepts, not things.

  • They are not mutually exclusive
  • They are architectural frameworks, not architectures
    • The frameworks must be adapted and customized to your needs, data, processes, and terminology
    • Gartner estimates 25% of data management vendors will provide a complete data fabric solution by 2024 – up from 5% today

Alex Woodie, 2021 https://www.datanami.com/2021/10/25/data-mesh-vs-data-fabric-understanding-the-differences/

Dave Wells, 2021 https://www.eckerson.com/articles/data-architecture-complex-vs-complicated

Data mesh vs data fabric

Both provide an architectural framework to access data across multiple technologies and platforms.

Data fabric

  • Attempts to centralize and coordinate data management
  • Tackles the complexity of data and metadata in a smart way that works well together
  • Focus on the architectural, technical capabilities, and intelligent analysis to produce active metadata supporting a smarter, AI-infused system to orchestrate various data integration styles

Data mesh

  • Emphasis on decentralization and data domain autonomy
  • Focuses on organizational change; it is more about people and process
  • Data are primarily organized around domain owners who create business-focused data products, which can be aggregated and consumed across distributed consumers

Alex Woodie, 2021 https://www.datanami.com/2021/10/25/data-mesh-vs-data-fabric-understanding-the-differences/

Dave Wells, 2021 https://www.eckerson.com/articles/data-architecture-complex-vs-complicated

Data mesh vs data fabric

Data mesh vs data fabric

Data mesh vs data fabric

Data Fabric and Mesh are the results of the data architecture evolution.

  • Many capabilities were in existence already long before the terms were coined

Takeaway:

  • Abstract the “building blocks” of such platforms
  • Let them evolve according to scalability and flexibility requirements

Data mesh

2024

2025

As new technologies and solutions mature to support a centralized approach to data access, distributed approaches like Data Mesh are expected to fall increasingly out of favor in enterprise IT.

Data mesh

2023

2024

As new technologies and solutions mature to support a centralized approach to data access, distributed approaches like Data Mesh are expected to fall increasingly out of favor in enterprise IT.

Wooclap (examples of exam questions)

(Some) References

Ref 1

Ref 2

Ref 3

Example of data platform: Hadoop-based

A data platform on the Hadoop stack requires several tools.

Hadoop-based data platform

How many levels of complexity are hidden here?

How do you provision it?

  • Manual provisioning on-premises
  • Semi-automatic provisioning on-premises
  • Automatic provisioning in the cloud

“Example” of architecture

https://xkcd.com/2347/

(*) “Ormai sta xkcd é una base troppo usata” A. Tappi

On-premises manual provisioning

It is hardly advisable, if not for small and local tests.

Technical challenges

  • Installation: how do I set up a new machine?
  • Networking: how do I cable dozens of machines?
  • Management: how do I replace a broken disk?
  • Upgrade: how do I extend the cluster with new services/machines?
  • (energy and cooling, software licenses, insurance…)

Technological challenges

  • How do we configure a distributed environment?
  • How do we set up/integrate/control independent services?
  • How do we orchestrate data flows?

Business challenges

  • Can we afford to spend resources on tasks that are not mission-oriented?
  • No free lunch, each choice has a cost/benefit
  • How much time does it take to master a technology?
  • How many people do I need?

From DevOps…

DevOps combines development and operations to increase the efficiency, speed, and security of software development and delivery compared to traditional processes.

DevOps practices enable software development (dev) and operations (ops) teams to accelerate delivery through automation, collaboration, fast feedback, and iterative improvement.

DevOps

https://about.gitlab.com/topics/devops/ (accessed 2023-06-03)

… to DataOps

DataOps refers to a general process aimed at shortening the end-to-end data analytics life-cycle time by introducing automation in the data collection, validation, and verification process

Evolution of DataOps (Munappy et al. 2020)

Evolution of DataOps

DataOps

From DevOps to DataOps

  • “A collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization
  • Data analytics improved in terms of velocity, quality, predictability, and scale of software engineering and deployment

Some key rules

  • Establish progress and performance measurements at every stage
  • Automate as many stages of the data flow as possible
  • Establish governance discipline (governance-as-code)
  • Design process for growth and extensibility

DevOps

DataOps

Gartner, 2020 https://www.gartner.com/smarterwithgartner/how-dataops-amplifies-data-and-analytics-business-value

Andy Palmer, 2015 https://www.tamr.com/blog/from-devops-to-dataops-by-andy-palmer/

William Vorhies, 2017 https://www.datasciencecentral.com/profiles/blogs/dataops-it-s-a-secret

Summing up

Storage should be flexible enough to support heterogeneous data models and raw data.

  • From operational databases to DWHs (why?)
  • From relational data models to NoSQL (why?)
  • Data lake to (directly) ingest raw data

Storage, per se, is insufficient to get value from the data (examples?)

  • We also need data processing and fruition
  • Data lakes are blurring into data platforms

Data platforms support end-to-end data needs (which ones?)

  • Building data platforms is hard (why?)
  • Managing data platforms is hard; exploit metadata to ease this task
  • Data lineage, compression, profiling, resolution, etc.

Open question: How do we deploy working data platforms?

How do I know if I am prepared for the exam?

Open a random slide, are you able to explain its content and connect it to the rest of the topics?

References

Ahmed, Mohiuddin. 2019. “Data Summarization: A Survey.” Knowledge and Information Systems 58 (2): 249–73.
Armbrust, Michael, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. “Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics.” In Proceedings of CIDR, 8:28.
Barika, Mutaz, Saurabh Garg, Albert Y Zomaya, Lizhe Wang, Aad Van Moorsel, and Rajiv Ranjan. 2019. “Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.” ACM Computing Surveys (CSUR) 52 (5): 1–41.
Bruno, Nicolas, and Surajit Chaudhuri. 2002. “Exploiting Statistics on Query Expressions for Optimization.” In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 263–74.
Burkhard, Walter A., and Robert M. Keller. 1973. “Some Approaches to Best-Match File Searching.” Communications of the ACM 16 (4): 230–36.
Codd, E. F. 1970. “A Relational Model of Data for Large Shared Data Banks.” Commun. ACM 13 (6): 377–87. https://doi.org/10.1145/362384.362685.
Couto, Julia, Olimar Teixeira Borges, Duncan D Ruiz, Sabrina Marczak, and Rafael Prikladnicki. 2019. “A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures.” In SEKE, 453–578.
Forresi, Chiara, Enrico Gallinucci, Matteo Golfarelli, and Hamdi Ben Hamadou. 2021. “A Dataspace-Based Framework for OLAP Analyses in a High-Variety Multistore.” The VLDB Journal 30 (6): 1017–40. https://doi.org/10.1007/S00778-021-00682-5.
Golfarelli, Matteo, Dario Maio, and Stefano Rizzi. 1998a. “Conceptual Design of Data Warehouses from E/R Schema.” In Thirty-First Annual Hawaii International Conference on System Sciences, Kohala Coast, Hawaii, USA, January 6-9, 1998, 334–43. IEEE Computer Society. https://doi.org/10.1109/HICSS.1998.649228.
———. 1998b. “The Dimensional Fact Model: A Conceptual Model for Data Warehouses.” Int. J. Cooperative Inf. Syst. 7 (2-3): 215–47. https://doi.org/10.1142/S0218843098000118.
Golfarelli, Matteo, and Stefano Rizzi. 2009. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, Inc.
Herschel, Melanie, Ralf Diestelkämper, and Houssem Ben Lahmar. 2017. “A Survey on Provenance: What for? What Form? What From?” The VLDB Journal 26 (6): 881–906.
Ikeda, Robert, and Jennifer Widom. 2009. “Data Lineage: A Survey.” Stanford InfoLab.
Lcvenshtcin, VI. 1966. “Binary Coors Capable or ‘Correcting Deletions, Insertions, and Reversals.” In Soviet Physics-Doklady. Vol. 10. 8.
Mann, Bob. 2002. “Some Data Derivation and Provenance Issues in Astronomy.” In Workshop on Data Derivation and Provenance, Chicago.
Moreau, Luc, and Paul Groth. 2022. Provenance: An Introduction to PROV. Springer Nature.
Munappy, Aiswarya Raj, David Issa Mattos, Jan Bosch, Helena Holmström Olsson, and Anas Dakkak. 2020. “From Ad-Hoc Data Analytics to Dataops.” In Proceedings of the International Conference on Software and System Processes, 165–74.
Naumann, Felix. 2014. “Data Profiling Revisited.” ACM SIGMOD Record 42 (4): 40–49.
Olesen-Bagneux, Ole. 2023. The Enterprise Data Catalog. "O’Reilly Media, Inc.".
Özsu, M. Tamer. 2018. “Database.” In Encyclopedia of Database Systems, edited by Ling Liu and M. Tamer Özsu, 931–32. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4614-8265-9_80734.
Papadakis, George, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. “Blocking and Filtering Techniques for Entity Resolution: A Survey.” ACM Computing Surveys (CSUR) 53 (2): 1–42.
Simmhan, Yogesh L, Beth Plale, Dennis Gannon, et al. 2005. “A Survey of Data Provenance Techniques.” Computer Science Department, Indiana University, Bloomington IN 47405: 69.