ETL Options in AWS
Enterprises and Organizations all wants to extract value out of data. So to achieve this, most of the times data has to be transformed , loaded into data warehouses or to other modern data platforms and then analyzed to get insights. In this article, I want to discuss different services in AWS that caters to ETL needs. Here I want to focus on core etl services such as EMR, Glue and Lambda.
EMR(Elastic Map Reduce)
- EMR is a managed service for processing large amounts of data using big data frameworks like Hadoop, Pig, hive, Spark, Flink and with advanced frameworks such as TensorFlow and MXNet.
- There are two patterns here, one where we can have an EMR cluster up and running for longer times called persistent clusters and other, transient clusters where we can bring up the cluster, run the load and shut it down. Below are some of the use cases for these.
- Transient clusters - Nightly loads or loads to DWH, to churn huge data as part of daily batch process or an Machine Learning job.
- Persistent clusters - Streaming jobs, Machine learning notebooks, or a platform running many jobs that run through out the day loading data lakes, DWH.
- To orchestrate the dependencies between jobs we can use AWS Stepfunction and Livy. Other schedulers/orchestration tools like Airflow, Luigi on EC2 can also be used.
- We can perform both batch and stream processing and also run machine learning models in EMR.
- It provides the ability to read and write to dynamodb, rds, elastic search, redshift, kinesis and s3.
- We can create Jupyter Notebooks where we can perform data cleaning, transformations, machine learning modelling and sharing it.
- It can utilize spot instances for core and task nodes and also can auto scale the nodes to handle spike in the loads
Glue
- It is a fully managed and serverless ETL service.
- On a high level it has two components, data catalog and ETL.
- Glue crawlers can crawl the data and builds the data catalog, exposing the data as a table (Athena) on top of S3 data.
- Glue ETL Jobs can run Spark code (Python, Scala) and also Python shell.
- Glue ETL can read and write to S3, Aurora, RDS, Redshift and also other on-prem databases via jdbc (has to run in a vpc).
- It also can orchestrate and schedule the Glue ETL jobs and crawlers using Glue Workflows.
- We can create a development endpoint which creates an EMR cluster under the hood and run glue scripts while doing active development.
- It also provides Sagemaker notebooks and Zepplin notebooks.
- We can perform both batch and streaming processing with Glue ETL.
- It does not provide auto scaling yet.
Lambda
- Lambda runs a piece of code in a stateless fashion and it is serverless.
- It can be triggered by various event sources like an API, S3 event, SNS, SQS and many other event sources.
- It has many use cases such as using it as micro service for web apps and backends, chatbots, Alexa, stream processing and data processing is one of the use case.
- We can write lambda code using Python, Java, Nodejs, C# , go and also embed custom libraries.
- Lambda code needs a memory configuration which can be from 128MB to 3GB and maximum run time for the code which can be up to 15 mins currently. It can have concurrent executions and is highly scalable.
- We can use Lambda for light weight ETL for batch or stream processing that takes less memory( < 3GB ) and can be executed in short period of time (< 15 mins). Some of the ETL use cases would be moving a file from one s3 bucket to other, reading a file and storing it in dynamodb or rds with minimal transformation.
- The cost of lambda is dependent on three factors, number of executions, memory allocated and the run time. With respect to run time, we are charged only for the amount of time the lambda runs and not based on the maximum run time we set to lambda.
Final thoughts
Use the right tool for the right job. Lambda's are great for light weight ETL tasks and they integrate well with services like SQS, SNS, Step function. They can be very cost effective as well. If you have a persistent EMR cluster already in the environment, then any new heavy to medium weight ETLs should leverage it. If you are starting fresh and doesn’t have emr presence in the environment, Glue is the best tool. Right now the only disadvantage I see with Glue is the absence of auto scaling feature, which I am sure AWS is working on it. There were lot of new features like Streaming, Machine Learning capabilities were added recently. Also, it is ok to have both EMR and Glue in the eco system.