June 12, 2020
AWS Glue with an example
AWS Glue is a fully managed serverless ETL service. It is used for ETL purposes and perhaps most importantly used in data lake eco systems. Its high level capabilities can be found in one of my previous post here, but in this post I want to detail Glue Catalog, Glue Jobs and an example to illustrate a simple job.
- Glue catalog is a metadata repository built automatically by crawling the datasets by Glue Crawlers. It contains tables with in a database created by crawlers and these tables can be queried via AWS Athena. Crawlers can crawl S3, RDS, Dynamo DB, Redshift and any on-prem databases that can connect via JDBC. These crawled datasets can further be used as a source or target connection in Glue while developing jobs.
- The way crawlers work is, it has built in classifiers and runs these classifiers against the dataset in an orderly fashion. If the classifier cant recognize the data, the crawler invokes the next classifier. We can also build custom classifier and use that. These crawlers can be scheduled to scan at a regular period of time and there is an option to update the schema of the data catalog tables or ignore updates.
- Here are some of bad experiences with crawlers that I had which you have to be careful about. One, crawling a dataset in s3 which has millions of small s3 files will take time and can be costly affair. Two, make sure data is organized in partitions and the data set that is crawled has similar files in folder. If that is not the case, say if it has 100 files of different schema, crawler can end up creating 100 tables.
- Glue provides two shells, python shell and spark shell to execute a piece of code. Python shell can be used to can execute plain python code and it is a non distributed environment. The Spark Shell is a distributed environment to execute spark code written in either PySpark or Scala.
- While using Spark shell, in addition to data frames and other constructs that spark has, Glue has a new construct called dynamic frames which requires no schema unlike data frame. Dynamic frames has few transformation functions and we can always convert dynamic frame to data frame and vice versa to use each other transformation functions.
- Glue dev end point provides an environment to author the jobs and test it. It is an EMR cluster with all glue utilities.
- Here are some of the important configuration parameters for Glue jobs.
- Bookmark is a feature to track data that has already processed by the job in previous runs by persisting the state information. For example if we have a Glue that reads from S3 and the data is partitioned, it reads only new partition data if bookmark option is enabled(disabled by default). This feature can be used for relational sources and it keeps track of the new data by keys defined for the table.
- DPU(Data Processing Unit) is the term used to denote the processing power allocated to the glue job. A single DPU is 4 vCpu and 16 GB of memory. Spark job requires minimum of 2 DPUs. Worker type tells what type of nodes. There is standard, G1.X(for memory intensive) and G.2X( for ML Transforms).
- Delay notification threshold time can be set to notify (cloud watch) jobs running above the threshold time.
- Run time parameters can be passed to job as well.
- Other Features
- Metrics - Glue provides spark web ui to monitor and debug the jobs. Job logs can be viewed in cloudwatch.
- It also supports streaming ETL where we can set up continuous ingestion pipelines from sources like Kinesis, Kafka and ingest into S3 or other data stores.
- It has a feature called workflow to orchestrate the crawlers and jobs with dependencies and triggers.
- It also provides an interface to create and work with Sagemaker and Zepplin notebooks.
- User defined custom python packages/modules can be used in both the shells by having it zipped and stored in S3.
- There is a warm up time involved when you run a Glue job and it use to be around 10 mins. With Glue 2.0 it is brought down to 1 min.
- Pricing - You are billed only for the time the ETL job takes to run and no upfront cost for startup or shutdown time. It is charged based on number of DPU's used for the job and it is $0.44 per DPU-hour. Glue version 2.0 have a 1-minute billing duration and older versions have a 10-minute minimum billing duration. Data catalog and crawler runs have additional charges.
Here is an example of Glue PySpark Job which reads from S3, filters data and writes to Dynamo Db. The job can be created from console or done normally using infrastructure as service tools like AWS cloudformation, Terraform etc. The code can be found here.