Portfolio

Projects

Open-Source Data Lakehouse

Architected and built a scalable data lakehouse on Kubernetes (EKS) using open-source tools including Apache Airflow, Apache Spark, Apache Iceberg, Trino, Hive Metastore, and JupyterLab.
Utilized Amazon S3 as distributed storage for optimized data access and cost efficiency.
Automated complex ETL workflows leveraging Apache Airflow DAGs with Spark and Iceberg for efficient batch and real-time data processing.
Implemented secure cloud-native authentication using IAM Roles for Service Accounts (IRSA) to seamlessly authorize workloads across Spark, Airflow, and Trino, enhancing security and scalability.

Real-Time Location-Based Marketing

Developed a Spark Streaming solution to capture real-time location data from ISP towers.
Extracted and validated customer data from the data warehouse to the data lake for targeted marketing.
Integrated with Cassandra NoSQL to track message recipients and ensured GDPR compliance with a 6-month data retention policy.
Automated data processing to efficiently send relevant customer data to external marketing APIs.

User Data Anonymization in Data Lake

Implemented a GDPR-compliant data anonymization pipeline using Apache Hudi and Iceberg within a Docker-based data lake environment.
Masked Personally Identifiable Information (PII) to safeguard user privacy.
Conducted performance testing and evaluation to validate pipeline scalability and effectiveness.
Documented implementation outcomes and provided best practice recommendations for production deployment.