Rahul Dastidar is a Data Engineering Consultant with 7+ years of experience in cloud and distributed computing, specializing in PySpark development and Master Data Management. He has a proven track record of optimizing data pipelines and ensuring data integrity.
Designed and implemented scalable PySpark pipelines on Databricks and AWS EMR.
Achieved up to 35% reduction in execution time and 20% cost savings through optimization.
Automated end-to-end CI/CD for PySpark pipelines using GitHub Actions & Jenkins.
Established data governance frameworks ensuring 99% data integrity.
Built and optimized PySpark pipelines handling 1B+ records with 40% faster processing.
Successfully integrated enterprise datasets into Reltio MDM, enabling unified customer 360 view.
Improved data survivorship and match/merge accuracy by 25% through advanced rule configuration.
Key outcomes:
Designed and implemented PySpark pipelines for multi-terabyte datasets.
Optimized Spark jobs, reducing execution time by 35%.
Automated ingestion pipelines using AWS Lambda and Step Functions.
Key outcomes:
Led migration of customer master data into Reltio MDM.
Built PySpark ETL workflows for master data cleansing.
Ensured 99% data integrity through governance frameworks.
Key outcomes:
Architected big data pipelines on AWS EMR for real-time data ingestion.
Developed ETL jobs integrating customer data with MDM systems.
Tuned Spark jobs for 20% cost efficiency.
Key outcomes:
Designed data workflows with PySpark and Hive to improve reporting accuracy.
Configured ELK-based monitoring for Spark job failures.
Collaborated on data modeling and cleansing strategies.
Rahul Dastidar
PySpark