Portfolio

Projects

Open-Source Data Lakehouse

  • Architected and built a scalable data lakehouse on Kubernetes (EKS) using open-source tools including Apache Airflow, Apache Spark, Apache Iceberg, Trino, Hive Metastore, and JupyterLab.
  • Utilized Amazon S3 as distributed storage for optimized data access and cost efficiency.
  • Automated complex ETL workflows leveraging Apache Airflow DAGs with Spark and Iceberg for efficient batch and real-time data processing.
  • Implemented secure cloud-native authentication using IAM Roles for Service Accounts (IRSA) to seamlessly authorize workloads across Spark, Airflow, and Trino, enhancing security and scalability.

Real-Time Location-Based Marketing

  • Developed a Spark Streaming solution to capture real-time location data from ISP towers.
  • Extracted and validated customer data from the data warehouse to the data lake for targeted marketing.
  • Integrated with Cassandra NoSQL to track message recipients and ensured GDPR compliance with a 6-month data retention policy.
  • Automated data processing to efficiently send relevant customer data to external marketing APIs.

User Data Anonymization in Data Lake

  • Implemented a GDPR-compliant data anonymization pipeline using Apache Hudi and Iceberg within a Docker-based data lake environment.
  • Masked Personally Identifiable Information (PII) to safeguard user privacy.
  • Conducted performance testing and evaluation to validate pipeline scalability and effectiveness.
  • Documented implementation outcomes and provided best practice recommendations for production deployment.

Contact