Data pipelines on cloud (Computing)
Computing
We can choose the XaaS configuration to build our pipelines

Amazon Elastic Compute Cloud (EC2)
The instance type determines the hardware
Amazon Machine Image is a software template
Interact with EC2 instance as with any computer




Amazon EMR is a data platform based on the Hadoop stack
Example of workload
EMR cluster
Master group
Core groups
(Optional) Task instances
The central component of Amazon EMR is the cluster
The node type identifies the role within the cluster
On-Demand Instance
Spot Instance
Spot Instance cost strategies
Capacity-optimized strategy
Lowest-price strategy
Choose to launch master, core, or task on Spot Instances
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html
Amazon EMR provides two main file systems
hdfs://path (or just path)
s3://DOC-EXAMPLE-BUCKET1/path (EMRFS)
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html
Software and steps
Hardware
General cluster settings
Security
Infrastructure as code: using CLI (command line interface)
aws emr create-cluster
--auto-scaling-role EMR_AutoScaling_DefaultRole
--termination-protected
--applications Name=Hadoop Name=Hive Name=Hue Name=JupyterEnterpriseGateway Name=Spark
--ebs-root-volume-size 10
--ec2-attributes '{"KeyName":"bigdata","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-5fa2f912","EmrManagedSlaveSecurityGroup":"sg-07818b5690a50b3f1","EmrManagedMasterSecurityGroup":"sg-0e2f5550a2cb98f79"}'
--service-role EMR_DefaultRole
--enable-debugging
--release-label emr-6.2.0
--log-uri 's3n://aws-logs-604905954159-us-east-1/elasticmapreduce/'
--name 'BigData'
--instance-groups '
[{ "InstanceCount": 1,
"BidPrice": "OnDemandPrice",
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [{
"VolumeSpecification": { "SizeInGB": 32, "VolumeType": "gp2" },
"VolumesPerInstance": 2
}]},
"InstanceGroupType": "MASTER",
"InstanceType": "m4.xlarge",
"Name": "Master - 1"
}, {
"InstanceCount": 1,
"BidPrice": "OnDemandPrice",
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [{
"VolumeSpecification": { "SizeInGB": 32, "VolumeType": "gp2" },
"VolumesPerInstance": 2
}]},
"InstanceGroupType": "CORE",
"InstanceType": "m4.xlarge",
"Name": "Core - 2"}]'
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION
--region us-east-1Creating a cluster (it takes ~10 minutes)

STARTING: EMR provisions EC2 instances for each required instance
BOOTSTRAPPING: EMR runs actions that you specify on each instance
RUNNING: a step for the cluster is currently being run
WAITING: after steps run successfully
TERMINATING: after manual shutdown
A step is a user-defined unit of processing
Step states
PENDING: The step is waiting to be runRUNNING: The step is currently runningCOMPLETED: The step completed successfullyCANCELLED: The step was canceled before running because an earlier step failedFAILED: The step failed while runningMatteo Francia - Big Data and Cloud Platforms (Module 2) - A.Y. 2025/26