Rahul Dastidar  ·  Senior PySpark Data Engineer  ·  7+ yrs

Senior
7+ years experienceremote
Available within 48 hrs

Proof of scale

1B+ records processed
99% data integrity
35% reduction in execution time
1B+ records processed35% reduction in execution time20% cost savings99% data integrity

About Rahul

Rahul Dastidar is a Data Engineering Consultant with 7+ years of experience in cloud and distributed computing, specializing in PySpark development and Master Data Management. He has a proven track record of optimizing data pipelines and ensuring data integrity.

7+ years of commercial experience in

Skills(16)

PySparkAWSApache SparkGitHub ActionsReltio MDMAzureREST APIsDatabricksAWS LambdaAWS Step FunctionsSQLAzure DatabricksAWS EMRInformaticaHiveElasticsearch

Why hire Rahul?

Production deploy authorityMentored 5+ juniors

Designed and implemented scalable PySpark pipelines on Databricks and AWS EMR.

Achieved up to 35% reduction in execution time and 20% cost savings through optimization.

Automated end-to-end CI/CD for PySpark pipelines using GitHub Actions & Jenkins.

Established data governance frameworks ensuring 99% data integrity.

Built and optimized PySpark pipelines handling 1B+ records with 40% faster processing.

Successfully integrated enterprise datasets into Reltio MDM, enabling unified customer 360 view.

Improved data survivorship and match/merge accuracy by 25% through advanced rule configuration.

Project highlights(4)

Project 1DevOps & Data Engineering Consultant

  • Focuses on PySpark development and data pipeline implementation on Databricks.
  • Integrates enterprise data with Reltio MDM and builds data quality rules.
  • Designed and implemented PySpark pipelines on Databricks to process and transform multi-terabyte datasets.
  • Integrated enterprise data sources with Reltio MDM using REST APIs for entity synchronization.
  • Built data quality rules (deduplication, enrichment, survivorship logic) before ingestion into Reltio.
  • Optimized Spark jobs with partitioning and caching, reducing execution time by 35%.
  • Automated workflows with AWS Lambda + Step Functions, orchestrating ingestion pipelines.
PySparkDatabricksReltio MDMREST APIsApache SparkAWS LambdaAWS Step FunctionsSQL

Key outcomes:

  • Designed and implemented PySpark pipelines for multi-terabyte datasets.

  • Optimized Spark jobs, reducing execution time by 35%.

  • Automated ingestion pipelines using AWS Lambda and Step Functions.

Project 2Senior Manager - Cloud & Data Engineering

  • Led migration of client's customer master data into Reltio MDM.
  • Developed PySpark ETL workflows for master record cleansing and enrichment.
  • Led migration of client's customer master data into Reltio MDM, configuring match & merge rules.
  • Built PySpark ETL workflows in Azure Databricks for cleansing, enrichment, and deduplication of master records.
  • Implemented Reltio entity modeling and integrated downstream applications through REST APIs.
  • Established data governance frameworks ensuring 99% data integrity.
Reltio MDMPySparkAzure DatabricksREST APIsApache Spark

Key outcomes:

  • Led migration of customer master data into Reltio MDM.

  • Built PySpark ETL workflows for master data cleansing.

  • Ensured 99% data integrity through governance frameworks.

Project 3System Architect – AWS & Big Data

  • Architected big data pipelines with PySpark on AWS EMR for real-time data ingestion.
  • Developed ETL jobs for integrating customer data with MDM systems (Reltio & Informatica).
  • Architected big data pipelines with PySpark on AWS EMR, enabling real-time ingestion of enterprise data.
  • Developed ETL jobs integrating customer data with MDM systems (Reltio & Informatica).
  • Automated synchronization between Reltio MDM and analytics platforms, ensuring high-quality master data.
  • Monitored Spark cluster performance and tuned jobs for cost efficiency (20% savings).
PySparkAWS EMRReltio MDMInformaticaApache Spark

Key outcomes:

  • Architected big data pipelines on AWS EMR for real-time data ingestion.

  • Developed ETL jobs integrating customer data with MDM systems.

  • Tuned Spark jobs for 20% cost efficiency.

Project 4Technical Architect – Cloud & Automation

  • Designed data workflows using PySpark + Hive to improve reporting accuracy.
  • Configured ELK-based monitoring for Spark job failures and data pipeline health.
  • Designed data workflows with PySpark + Hive, improving reporting accuracy.
  • Configured ELK-based monitoring for Spark job failures and data pipeline health.
  • Collaborated with data architects on data modeling & cleansing strategies.
PySparkHiveElasticsearchApache Spark

Key outcomes:

  • Designed data workflows with PySpark and Hive to improve reporting accuracy.

  • Configured ELK-based monitoring for Spark job failures.

  • Collaborated on data modeling and cleansing strategies.

Industry experience

HealthTech

2 projects
  • ProjectSystem Architect – AWS & Big DataPySpark · AWS EMR · Reltio MDM · Informatica +1
  • ProjectTechnical Architect – Cloud & AutomationPySpark · Hive · Elasticsearch · Apache Spark

Ready to work with Rahul?

Schedule an interview and onboard within 48 hours. No long hiring cycles.

At a Glance

Experience7+ years
Work moderemote
Starting from₹1.6 L/mo
Direct hirePossible
Start within48 hours
From₹1.6 L/ month

Single contract. No agency markup confusion.

Typically responds within 4 business hours.

5-day replacement guarantee
48-hour onboarding, single invoice
Direct chat — no recruiter middleman
Seniority signals
Owns production deploysGreenfield architectSystem ownerCode reviewerMentor / leads juniors
VerifiedVetted by Witarist
Technical skills assessed & verified
Background & identity checked
English communication verified
Ready to onboard in 48 hours

Not sure if this is the right fit?

Tell us your requirements and we'll match you with the best candidates.

Rahul Dastidar

PySpark