pronnoy@data-eng:~zsh • main

// Lead Data Engineer | System Architect | Cloud Platform Builder

Pronnoy Dutta

❯~/pipeline

5.5+ years designing, building, and operating large-scale distributed data pipelines and cloud-native platforms. Specialized in Spark optimization, data modeling, and end-to-end platform ownership on AWS.

View Experience Resume.pdf

src/about.py

about.pysrc/core/

I'm a Lead Data Engineer at Axtria, leading a team that owns the full lifecycle of batch data pipelines powering business-critical analytics. My work involves architecting cloud-native data platforms on AWS, diagnosing Spark performance bottlenecks, and translating business requirements into scalable solutions.

Previously at Infosys, I built cloud-based analytics pipelines and star-schema data warehouses for enterprise clients.

I'm an AWS Community Builder (Security team), hold multiple AWS certifications, and am passionate about distributed systems, data modeling, and building pipelines that don't page you at 3 AM.

Pipeline Architect

End-to-end design & delivery at enterprise scale

Perf Engineer

Spark tuning, query optimization, resource mgmt

Cloud Native

AWS certified architect, production platforms

Team Lead

Engineering leadership, delivery & stability

logs/git_log.sh

mainApr 2025 — Present • Axtria, Gurugram

axtria_lead.py

Project Lead — Data Engineering

Associate → Sr. Associate → Project Lead

Own end-to-end architecture and delivery of batch data pipelines on AWS (S3, Glue, EMR, Redshift) supporting analytics for a 50M-patient commercial pharma platform, maintaining 99.9% SLA compliance across all production runs.
Lead a team of 4 data engineers — managing sprint planning, technical design reviews, code quality standards, and career development for junior members.
Serve as primary escalation owner for production data incidents; resolved 20+ P1/P2 events and implemented proactive monitoring guardrails that cut repeat incidents by 60%.
Translated evolving business requirements from pharma analytics stakeholders into scalable data solutions, consistently delivering within committed timelines.
Drove adoption of CI/CD practices and Git-based branching strategy across the team, reducing deployment errors and improving release confidence.

99.9%SLA Compliance

4Engineers Led

20+P1/P2 Resolved

60%Fewer Incidents

mainMay 2024 — Apr 2025 • Axtria, Gurugram

axtria_senior.py

Senior Data Engineer

Migrated 15+ legacy Hive and SQL pipelines to PySpark-based distributed processing on AWS EMR, achieving a 30% reduction in end-to-end pipeline runtime — cutting daily processing wall-time from ~9 hrs to ~6 hrs.
Diagnosed and resolved Spark performance bottlenecks (data skew, suboptimal partitioning, oversized shuffles) on pipelines ingesting 200M records/day, reducing estimated compute costs by ~₹33L/yr in EMR cluster spend.
Designed SCD Type-2 dimensional data models supporting 3+ years of full historical auditability for patient-level pharma data, enabling downstream teams to retire ad-hoc reconciliation scripts entirely.
Managed ingestion of JSON, CSV, Parquet, and ORC formats across multiple upstream source systems, implementing schema-on-read patterns with AWS Glue Data Catalog.
Mentored 2 junior engineers on PySpark optimization patterns and AWS data services — both promoted within the project cycle.

30%Perf Gain

200MRecords/Day

9→6hrRuntime Cut

15+Pipelines Migrated

mainMar 2022 — Apr 2024 • Axtria, Gurugram

axtria_associate.py

Associate Data Engineer

Built Python-based Tableau refresh orchestration framework automating end-of-cycle dashboard updates across 30+ dashboards, eliminating 7 hours of manual effort per cycle (~350 hrs/yr saved).
Developed and maintained production ETL pipelines using PySpark, AWS Glue, and Control-M with zero-defect delivery across 12 consecutive releases.
Implemented data quality validation checks (null checks, referential integrity, row count reconciliation) at each pipeline stage, reducing data defects reaching downstream consumers by 80%.
Built star-schema fact and dimension tables in Amazon Redshift supporting 5 analytics use cases including territory performance, brand uptake, and rep activity tracking.
Collaborated with business analysts and data scientists to define data contracts, schema agreements, and SLA expectations for 8+ data products.

350hrSaved/Year

30+Dashboards

80%Fewer Defects

8+Data Products

feature/infosysNov 2020 — Mar 2022 • Infosys, Remote

sales_pipeline.py

Systems Engineer

Designed and implemented cloud-based monthly sales analytics pipelines using AWS S3, Python, and PySpark for a retail client operating 500+ stores across 3 regions — automating report generation that previously required 3 days of manual effort per cycle.
Architected a star-schema data warehouse with 8 fact and dimension tables tracking revenue, billing frequency, basket size, and customer engagement KPIs — became the single source of truth for executive dashboards.
Developed KPI computation logic for 12+ business metrics (revenue growth, churn rate, store-level performance), reducing analyst ad-hoc query turnaround from days to under 2 hours.
Optimised complex SQL queries on 100M+ row transaction tables, reducing report generation time by 40% through indexing and query restructuring.

40%Faster Reports

100M+Rows Optimised

12+KPIs Built

3 daysManual Work Cut

AWS S3PySparkPythonStar SchemaData Warehousing

config/skills.yml

languages.yml

Languages

PythonEXPERT

SQLEXPERT

JavaINTERMEDIATE

C#INTERMEDIATE

big_data.yml

Distributed & Big Data

Spark / PySparkEXPERT

Hadoop / YARNADVANCED

HiveADVANCED

KubernetesINTERMEDIATE

cloud_aws.yml

Cloud Platforms (AWS)

S3 / Data LakesEXPERT

Glue / EMRADVANCED

RedshiftADVANCED

RDSADVANCED

devops.yml

Architecture & DevOps

Data Modeling / Star SchemaEXPERT

AirflowADVANCED

Docker / CI/CDADVANCED

Git / LinuxEXPERT

projects/SELECT * FROM builds

pharma_platform.py

Production

Pharma Commercial Data Platform

Cloud-native batch data platform on AWS processing 200M records/day of pharma commercial data — prescriptions, sales force activity, and HCP engagement. Migrated 15+ legacy Hive pipelines to PySpark, achieving 30% performance improvement and reducing daily processing window from 9 hrs to 6 hrs.

200M+Records/Day

30%Faster

99.9%SLA

350hrSaved/Yr

PythonPySparkAWS EMRAWS GlueRedshiftAirflowControl-M

retail_warehouse.sql

Shipped

Retail Sales Analytics Data Warehouse

End-to-end sales analytics pipeline and data warehouse for a retail client with 500+ stores across 3 regions. Automated monthly reporting that previously took 3 days of manual effort. Built an 8-table star-schema warehouse powering executive-level revenue and engagement dashboards, with KPI logic covering 12+ business metrics.

40%Faster Reports

12+KPIs

100M+Rows

PythonPySparkAWS S3SQLStar SchemaAmazon Redshift

tableau_orchestrator.py

Shipped

Tableau Refresh Automation

Python-based orchestration framework automating end-of-cycle Tableau dashboard updates via Airflow across 30+ dashboards — eliminated 7 hours of manual effort per cycle (~350 hrs/yr saved).

350hrSaved/Yr

30+Dashboards

PythonAirflowTableauAWS

scd2_model.sql

Production

SCD Type-2 Dimensional Data Model

Dimensional data model with SCD Type-2 change tracking supporting 3+ years of full historical auditability for patient-level pharma data. Enabled downstream teams to retire ad-hoc reconciliation scripts entirely.

3+ yrHistory

SCD-2Audit Trail

Data ModelingSQLRedshiftStar SchemaPySpark

data/achievements.json

Education

B.Tech, Computer Science

Bharati Vidyapeeth College of Engineering, Pune

July 2016 — June 2020

Senior Secondary School

Delhi Public School, Gurgaon

2016

AWS Community Builders

Official Member — Security Team

Achievements Unlocked

AWS Security Specialty

Amazon Web Services

Unlocked

AWS Solutions Architect Associate

Amazon Web Services

Unlocked

CCNA

Cisco Certified Network Associate

Unlocked

docs/contact.md

README.md

Let's build something together.

Open for data engineering roles, architecture consulting, or challenging data problems at scale.

pronnoy1998@gmail.com

Gurugram, Haryana, India

send_message.sh