Rahul Dastidar · Senior PySpark Data Engineer · 7+ yrs

Senior

7+ years experienceremote

Available within 48 hrs

Proof of scale

1B+ records processed

99% data integrity

35% reduction in execution time

1B+ records processed35% reduction in execution time20% cost savings99% data integrity

About Rahul

Rahul Dastidar is a Data Engineering Consultant with 7+ years of experience in cloud and distributed computing, specializing in PySpark development and Master Data Management. He has a proven track record of optimizing data pipelines and ensuring data integrity.

7+ years of commercial experience in

HealthTech

Skills(16)

PySparkAWSApache SparkGitHub ActionsReltio MDMAzureREST APIsDatabricksAWS LambdaAWS Step FunctionsSQLAzure DatabricksAWS EMRInformaticaHiveElasticsearch

Why hire Rahul?

Production deploy authorityMentored 5+ juniors

Designed and implemented scalable PySpark pipelines on Databricks and AWS EMR.

Achieved up to 35% reduction in execution time and 20% cost savings through optimization.

Automated end-to-end CI/CD for PySpark pipelines using GitHub Actions & Jenkins.

Established data governance frameworks ensuring 99% data integrity.

Built and optimized PySpark pipelines handling 1B+ records with 40% faster processing.

Successfully integrated enterprise datasets into Reltio MDM, enabling unified customer 360 view.

Improved data survivorship and match/merge accuracy by 25% through advanced rule configuration.

Project highlights(4)

Project 1 – DevOps & Data Engineering Consultant

Focuses on PySpark development and data pipeline implementation on Databricks.
Integrates enterprise data with Reltio MDM and builds data quality rules.

Designed and implemented PySpark pipelines on Databricks to process and transform multi-terabyte datasets.
Integrated enterprise data sources with Reltio MDM using REST APIs for entity synchronization.
Built data quality rules (deduplication, enrichment, survivorship logic) before ingestion into Reltio.
Optimized Spark jobs with partitioning and caching, reducing execution time by 35%.
Automated workflows with AWS Lambda + Step Functions, orchestrating ingestion pipelines.

PySparkDatabricksReltio MDMREST APIsApache SparkAWS LambdaAWS Step FunctionsSQL

Key outcomes:

Designed and implemented PySpark pipelines for multi-terabyte datasets.
Optimized Spark jobs, reducing execution time by 35%.
Automated ingestion pipelines using AWS Lambda and Step Functions.

Project 2 – Senior Manager - Cloud & Data Engineering

Led migration of client's customer master data into Reltio MDM.
Developed PySpark ETL workflows for master record cleansing and enrichment.

Led migration of client's customer master data into Reltio MDM, configuring match & merge rules.
Built PySpark ETL workflows in Azure Databricks for cleansing, enrichment, and deduplication of master records.
Implemented Reltio entity modeling and integrated downstream applications through REST APIs.
Established data governance frameworks ensuring 99% data integrity.

Reltio MDMPySparkAzure DatabricksREST APIsApache Spark

Key outcomes:

Led migration of customer master data into Reltio MDM.
Built PySpark ETL workflows for master data cleansing.
Ensured 99% data integrity through governance frameworks.

Project 3 – System Architect – AWS & Big Data

Architected big data pipelines with PySpark on AWS EMR for real-time data ingestion.
Developed ETL jobs for integrating customer data with MDM systems (Reltio & Informatica).

Architected big data pipelines with PySpark on AWS EMR, enabling real-time ingestion of enterprise data.
Developed ETL jobs integrating customer data with MDM systems (Reltio & Informatica).
Automated synchronization between Reltio MDM and analytics platforms, ensuring high-quality master data.
Monitored Spark cluster performance and tuned jobs for cost efficiency (20% savings).

PySparkAWS EMRReltio MDMInformaticaApache Spark

Key outcomes:

Architected big data pipelines on AWS EMR for real-time data ingestion.
Developed ETL jobs integrating customer data with MDM systems.
Tuned Spark jobs for 20% cost efficiency.

Project 4 – Technical Architect – Cloud & Automation

Designed data workflows using PySpark + Hive to improve reporting accuracy.
Configured ELK-based monitoring for Spark job failures and data pipeline health.

Designed data workflows with PySpark + Hive, improving reporting accuracy.
Configured ELK-based monitoring for Spark job failures and data pipeline health.
Collaborated with data architects on data modeling & cleansing strategies.

PySparkHiveElasticsearchApache Spark

Key outcomes:

Designed data workflows with PySpark and Hive to improve reporting accuracy.
Configured ELK-based monitoring for Spark job failures.
Collaborated on data modeling and cleansing strategies.

Industry experience

HealthTech

2 projects

•Project— System Architect – AWS & Big DataPySpark · AWS EMR · Reltio MDM · Informatica +1
•Project— Technical Architect – Cloud & AutomationPySpark · Hive · Elasticsearch · Apache Spark

Ready to work with Rahul?

Schedule an interview and onboard within 48 hours. No long hiring cycles.

At a Glance

Experience7+ years

Work moderemote

Starting from₹1.6 L/mo

Direct hirePossible

Start within48 hours

From₹1.6 L/ month

Single contract. No agency markup confusion.

Typically responds within 4 business hours.

5-day replacement guarantee

48-hour onboarding, single invoice

Direct chat — no recruiter middleman

Seniority signals

Owns production deploysGreenfield architectSystem ownerCode reviewerMentor / leads juniors

Vetted by Witarist

Technical skills assessed & verified

Background & identity checked

English communication verified

Ready to onboard in 48 hours

Not sure if this is the right fit?

Tell us your requirements and we'll match you with the best candidates.

Rahul Dastidar

PySpark