Soumyadeep is a Data Engineer with 8+ years of experience in building and modernizing data solutions across multiple industries. He has extensive expertise in cloud technologies and big data processing, demonstrating strong leadership in managing data pipelines and analytics.
Ownership of full data pipeline lifecycle from ingestion to visualization across multiple projects.
Extensive experience with multi-cloud deployments (AWS, Azure, GCP) and hybrid solutions.
Proven ability to migrate and modernize ETL processes using cloud-native services.
Strong background in big data processing and real-time streaming architectures.
Technical leadership experience demonstrated in managing big data initiatives.
Robust scalable pipelines with Databricks + DataFactory + Data Fusion
Streaming ingestion with Kafka + Kinesis
CDC into Synapse + Snowflake (via snowpipes)
Overview: This project focuses on modern data engineering practices using cloud-native services to build robust and scalable data solutions. Responsibilities: Installed and configured softwares, services, VMs, and databases leveraging Terraform with Azure Resource Manager on AWS, Google Cloud, and Azure servers. Engaged in data streaming, filtering, extraction, and analysis activities utilizing Hive, PySpark, Kafka, AWS EMR, Lambda, Data Proc, and Databricks. Developed a scalable data pipeline using Databricks and DataFactory to extract parquet files from ADLS2, transform them, and load into Synapse for advanced analytics.
Key outcomes:
Successfully configured multi-cloud infrastructure using Terraform.
Implemented data streaming and analysis pipelines with a variety of big data tools.
Created a scalable data pipeline for advanced analytics.
Overview: This project focused on data ingestion, transformation, and analysis within a data warehousing context, including streaming and NoSQL data. Responsibilities: Performed data ingestion, transformation, and analysis using Redshift, BigQuery, Synapse, and Snowflake data warehouse platforms. Constructed scalable data pipelines with Data Fusion and Data Proc to extract CSV/JSON from GCS, transform, and load into BigQuery. Built a streaming ingestion solution using Kafka to capture CDC in SQL databases and write records to Synapse for analysis. Worked with NoSQL databases (Hbase, Cassandra, MongoDB) and distributed SQL query engines (Drill, Presto, Phoenix, Athena) for data analysis. Developed an ETL pipeline using AWS Glue to extract CSV/JSON from S3, transform, and load into Redshift.
Key outcomes:
Implemented streaming CDC solutions using Kafka.
Managed diverse data sources including NoSQL and distributed SQL engines.
Developed comprehensive ETL pipelines for cloud data warehouses.
Overview: This project involved big data transformation, pipeline setup, and visualization using various cloud and open-source tools. Responsibilities: Wrote programs in PySpark and Scala for big data transformation activities. Set up scalable data pipelines and data ingestion activities using Databricks, DataFactory, Data Fusion, and Glue to transform unstructured/semi-structured data. Developed Dashboards and visualizations using PowerBI and Tableau. Built a streaming ingestion pipeline using Kinesis/Data Firehose to ingest data from Cloudwatch logs and write to Redshift.
Key outcomes:
Successfully transformed big data using PySpark and Scala.
Implemented scalable data pipelines for diverse data formats.
Delivered data insights through dashboards and visualizations.
Overview: This project involved leading big data initiatives focusing on streaming, extraction, analysis, and transformation, along with cloud integration. Responsibilities: Involved in various big data streaming, extraction, filtering, analysis, and transformation activities using HDFS, MapReduce, Pig, Hive, PySpark, Python, Scala, Scoop, Flume, and Hbase. Utilized AWS, Azure, and Google Cloud as service providers for big data products. Loaded data into Snowflake using snowpipes from S3 buckets, creating external stages and snowflake streams for change data capture. Wrote Python, Scala, and Spark programs to analyze and store big data workloads (CSV, JSON, Parquet, relational/NoSQL databases) into data warehouses for analytical purposes.
Key outcomes:
Led big data streaming and transformation efforts across multiple tools.
Implemented Change Data Capture (CDC) into Snowflake from S3.
Developed custom programs for big data workload analysis and warehousing.
Soumyadeep
Data Engineer