January 5, 2020 · AWS Data Lake

Building a Data Lake in AWS

Data Lakes are one of the happening things currently in the data landscape along with data science and cloud data warehouses. There are many definitions, metaphors and views out there on what constitutes a data lake. At its core, a data lake is a central repository to store all the raw data, regardless of its source or format. It can have refined data out of the raw data and whole can be treated as single source of truth for organizational needs

Data is growing more exponentially than we think, and there is a constant effort to unlock value out of it, to build data products and to democratize the data. So, data lake is a perfect and right place to store data for these kind of efforts

We can build a data lake on-prem, in the cloud or a hybrid one. Here I want to explain on what it takes to build one on AWS , focus more on what components and AWS services required to build a data lake ecosystem.

Storage

S3 : Storage is the heart of the data lake, and S3 is best suited for it. It is very cost effective, reliable and durable service. Based on how you want to organize the data, you can have different set of buckets. Generally organizations take a zone/layer based approach where they have landing zone, raw zone and refined zone. There are few best practices to follow while we store data in S3 like partitioning the data, compressing the files and using columnar format like ORC or parquet to store refined data.

Compute

Compute is required to refine the data, to deliver data to external parties or to load data in cloud DWH from the data lake.

Lambda : It lets you run code and is a serverless service. It is a good one to copy s3 files and other light weight compute things and the limitation is that the code should complete in 15 mins as of this writing.

Glue : It is a serverless ETL service where you can run Python and Spark code. It is a batch service, has enhanced spark framework with dynamic frames and can set up workflows. It can also connect to on-prem data sources via odbc and can be used to ingest data into data lake.

EMR : It is a managed Hadoop cluster where you can run big data loads using Spark and other frameworks. When to use Glue and EMR is another topic for discussion, but on high level, if you need auto scaling, have to install specific libraries or real time processing needs use EMR.

Kinesis : If you need to ingest streaming data or process data in real time, Kinesis can be used and it has components like Streams to collect data in real time, Firehose to store data, Kinesis Analytics to process data in real time.

Catalog

Data Catalog is another important component of data lake to know what data we have and where we have it.

Glue Crawler/Catalog : Glue has a crawler that crawls all the s3 files and builds a schema and table on top it. It is not a full fledged catalog service where we can search, look for the lineage etc. but it just catalogs the data set and it can be queried using Athena service.

Logging, Monitoring and Alerting

Cloudwatch : It is the service to capture logs for processes, alerts for any failure events and for monitoring. The logs can be shipped to other third party tools for further analysis and monitoring.

Security

IAM: Identity and Access Management service is used to control access to assets and always use the principle of least privileges while granting access.

KMS: Key Management Service is used to encrypt the data and there are different ways to encrypt the data based on the type of encryption method chosen.

Orchestration

Step Functions : It is a workflow orchestration service to orchestrate the jobs, and it has native integration with Lambda, Glue and other AWS services.

CICD

Infrastructure as code and continuous delivery pipelines are essential part of modern days software delivery. Cloudformation service provisions all resources in the AWS and enables infrastructure as code and services like Code Pipeline along with Code Build, Code Deploy, Code Commit enables to have CICD pipelines for fast and reliable deployments

Wrapping up …

AWS Lake Formation is a new fully managed service launched late last year, which became GA couple of month ago that makes it easier to build, secure and manage the data lakes. It is kind of template or blue print that automates steps like ingestion, cleansing, moving, cataloging and securing the data. This service can help if you are setting up a data lake and it can reduce the set up time from months to days.

There are other services like Athena(Query Service), Redshift (Cloud Data Warehouse), SageMaker (Machine Learning), QuickSights (Visualization) etc that will exist in ecosystem and complement the data lake.

Data lakes are here to stay, though we have the next generation data architecture beyond data lakes evolving like data mesh. We have to pay more attention to how we build the data lake with right governance, data quality and security or else it can turn into data swamps. The best way to start is to start small, take one line of business or domain at a time and build upon it.