Recently I built an application that uses AWS lambda to load data from datalake to Redshift at regular intervals. The steps to compile the adapter suitable for AWS Lambda environment is given here. I also uploaded it to github here and one can use it without having to go through compilation steps.
AWS lambda has gained huge momentum in the last couple of years and enabled software architects/ developers to build FaaS (Function as a Service). As much as Lambda helps in scaling applications, it has some limitations like execution duration or memory space availability, etc. For long running jobs, typically in the backend or batch processing, 5 minute duration can be a deal breaker. But with appropriate data partitions and architecture it is still an excellent option for enterprises to scale their applications and be cost effective.
In the recent project, I architected data be loaded from a datalake into Redshift. The data is produced by an engine in batches and pushed to s3. The data partitioned on time scale and a consumer Python application will load this data at regular intervals into Redshift staging environment. For scalable solution datalake can be populated from multiple producers and similarly one or more consumers can drain the datalake queue to load to Redshift. The data from multiple staging tables are then loaded to final table after deduping and data augmentation.
Read More »