Saidulu is a Data Engineer with 6+ years of experience specializing in Azure Data Engineering technologies. He has a proven track record in building scalable data infrastructures and optimizing performance for large datasets.
Led the migration of on-premises workloads to Azure Cloud, leveraging Azure Data Factory and Azure Databricks.
Implemented performance tuning techniques in Spark and Azure Data Factory, optimizing data processing and pipeline efficiency.
Designed and built scalable infrastructure on Azure for collecting, processing, and analyzing large datasets.
Developed dynamic data pipelines using parameterization and control tables, enhancing flexibility and reusability.
Successfully interacted with business users to gather requirements and report project progress, ensuring alignment with business needs.
On-prem → Azure Cloud migration
Spark + ADF performance tuning
Overview: This project focused on data extraction, integration, and migration to Azure Data Lake from various sources. Responsibilities: Responsible for data extraction and integration from different data sources into Azure Data Lake using Azure Data Factory and Azure Databricks ETL pipelines. Converted data into appropriate formats to optimize reads, memory, and calculate key metrics. Implemented Spark using Python in Databricks, leveraging Data Frames and Spark-SQL API for faster data processing. Tuned ADF copy activity efficiently and transformations using ADF Mapping Data Flows. Migrated data from on-premises and database/legacy applications to Azure using ADF and ADLS. Transformed SQL, T-SQL, and SSIS flows into Spark SQL and Data Frames in ADB, extracting from MySQL and loading into ADLS to create unmanaged tables. Validated results and created test documents for migrated tables; pushed notebooks to Azure Repos and managed pipeline changes. Fetched and processed data from Data Lake Gen2 or SQL databases in Azure Databricks. Parameterized datasets for dynamic object discovery and movement into curated zones. Fetched files from raw Data Lake containers for transformations, loading curated data into database objects, and creating snapshots/incremental data. Validated, debugged, and published pipelines, and created daily triggers for scheduling.
Key outcomes:
Tuned Azure Data Factory copy activity and transformations using Mapping Data Flows efficiently.
Validated migrated data results and created comprehensive test documents.
Managed notebook version control and deployment using Azure Repos, including branching and merging strategies.
Overview: This project focused on demand forecasting, processing large datasets from HDFS to identify user behavior and product expectations. Responsibilities: Participated in requirements gathering, project inception, and story sizing. Developed technical specifications based on client requirements. Analyzed data using Spark SQL queries and scripts to understand user behavior and identify desired facilities from product history. Involved in a Demand Forecasting processor to process data from HDFS and store it in Hbase. Created and partitioned Hive tables to store processed results in a tabular format. Developed Data Frames using Case classes for required input data. Created RDDs and Data Frames for input data, performing data transformations with Spark-core. Wrote SQL queries to process data using Spark SQL.
Key outcomes:
Analyzed user behavior and product expectations by performing Spark SQL queries and scripts.
Involved in processing demand forecasting data from HDFS.
Developed Data Frames using Case classes for efficient data processing.
Overview: This project involved creating dynamic data pipelines using Azure Data Factory and Databricks, focusing on data ingestion, transformation, and export. Responsibilities: Created dynamic pipelines using parameterization and control tables. Involved in creating Hive tables, loading data, and writing Hive queries. Imported data from Oracle to Hive using Sqoop. Replaced the default Derby metadata storage system for Hive with MySQL. Loaded and transformed large sets of structured and semi-structured (logs) data. Implemented business logic using Databricks with PySpark and ADF. Imported required tables from RDBMS to Azure using ADF, mapping data to the target data model using mapping data flows. Used Hive to form an abstraction on top of structured data in HDFS, implementing Partitions, Dynamic Partitions, and Buckets on HIVE tables. Exported analyzed data to relational databases using Sqoop for visualization and Power BI reporting. Worked on Spark SQL performance tuning techniques including Execution Plan Analysis, Data Management (Catching, Broadcasting), Tungsten Leverages, and Catalyst Optimizer. Loaded files to HDFS and wrote Hive queries; used Hive queries in Spark-SQL for analysis. Experienced in using Parquet, Avro, and ORC file formats for efficient compression.
Key outcomes:
Created dynamic pipelines with parameterization and control tables, improving pipeline flexibility.
Implemented business logic efficiently using Databricks with PySpark and Azure Data Factory.
Optimized Hive tables with Partitions, Dynamic Partitions, and Buckets for efficient data abstraction and querying.
Improved Spark SQL performance through various tuning techniques.
Saidulu
Big Data Engineer