
Projects
Open-Source Data Lakehouse
- Architected and built a scalable data lakehouse on Kubernetes (EKS) using open-source tools including Apache Airflow, Apache Spark, Apache Iceberg, Trino, Hive Metastore, and JupyterLab.
- Utilized Amazon S3 as distributed storage for optimized data access and cost efficiency.
- Automated complex ETL workflows leveraging Apache Airflow DAGs with Spark and Iceberg for efficient batch and real-time data processing.
- Implemented secure cloud-native authentication using IAM Roles for Service Accounts (IRSA) to seamlessly authorize workloads across Spark, Airflow, and Trino, enhancing security and scalability.
Real-Time Location-Based Marketing
- Developed a Spark Streaming solution to capture real-time location data from ISP towers.
- Extracted and validated customer data from the data warehouse to the data lake for targeted marketing.
- Integrated with Cassandra NoSQL to track message recipients and ensured GDPR compliance with a 6-month data retention policy.
- Automated data processing to efficiently send relevant customer data to external marketing APIs.
User Data Anonymization in Data Lake
- Implemented a GDPR-compliant data anonymization pipeline using Apache Hudi and Iceberg within a Docker-based data lake environment.
- Masked Personally Identifiable Information (PII) to safeguard user privacy.
- Conducted performance testing and evaluation to validate pipeline scalability and effectiveness.
- Documented implementation outcomes and provided best practice recommendations for production deployment.
Contact

