AWS Serverless DataLake : Built Realtime with Apache Hudi , AWS Glue and Kinesis Stream
In an enterprise system, populating a data lake relies heavily on interdependent batch processes. Typically these datalakes are updated at a frequency set to a few hours. Today’s business demands high quality data not in matter of hours or days but in minutes or seconds.
The typical steps to update data lake are (a) build incremental data (b) Read the existing data lake files, update incremental changes and rewrite the data lake files (note: S3 files are immutable) . This also brings in the challenge of ACID compliance between readers and writers of data lake.
Apache Hudi stands for hadoop upserts and incrementals. Hudi is a data storage framework which sits on top of HDFS , S3 etc. Hudi brings in streaming primitives to allow incrementally process Update/Delete of records & fetch records that have changed.
In our set up we have dynamoDB as the primary database. Changes in DynamoDB need to reflect in S3 data lake almost immediately. The setup to bring this together -
- Enable Change Data Capture (CDC) on DynamoDB. The changes are pushed to Kinesis stream.
- A Glue (Spark) job acts as a consumer of this change stream. The changes are microbatched using window length. In the script below this length is 100 second. The records are then processed and pushed to S3 using hudi connector libraries.
- The script assumes a simple customer record in DynamoDB with following attributes getting inserted or changed — id (partition key), custName (sort key), email, registrationDate
- Various configuration settings are telling Hudi on how to function. We have enabled sync with Hive, which means that meta tables will also get created in AWS Glue Catalog. This table can then also be accessed using AWS Athena to query data real time.
- There is a partition key mentioned in configuration. This partition key for demonstration purposes is the last digit of customer id. This will essentially create 10 partitions (0–9) and place customer data in various partitions. There is also a setting here which tells hudi to create hive style partitions.
- You can also control the number of commits through configuration. This will allow you to time travel in data.
- The hudi write happens using connectors. There are two lines here (one commented). Either you can use MarketPlace Connector Or use your own custom connector.
- Finally, you notice the glue line where we set up the consumer to get a bunch of records every 100 seconds.
Set up of Hudi Connector in AWS Glue
- Market Place Connector — You can go to AWS marketplace and search for “Apache Hudi Connector”. The steps from there on are pretty simple and guided through AWS console
- Custom Connector — In some organizations, AWS marketplace access is not available. In order to enable this, you would need two Jar files (a) Hudi-Spark bundle — hudi-spark-bundle_2.11–0.5.3-rc2 . I have compiled this jar and placed it in my github repo for easy reference. Also you need an Avro-Schema jar , again available on my repo. In order to create the connector — go to AWS Glue Studio -> Create Custom connector. Select hudi-spark-bundle_2.11–0.5.3-rc2 Jar as S3 URL , Connector Type : Spark & Class Name : org.apache.hudi.DefaultSource . Also while creating your Glue job using a custom connector, include Avro-Schema jar as dependent jar.
That is it, now all inserts/updates in Dynamodb, will seamlessly reflect in your S3 data lake with real time access. The script and jars are available in my github repo.
Happy Coding !