Pronnoy Dutta
5.5+ years designing, building, and operating large-scale distributed data pipelines and cloud-native platforms. Specialized in Spark optimization, data modeling, and end-to-end platform ownership on AWS.
I'm a Lead Data Engineer at Axtria, leading a team that owns the full lifecycle of batch data pipelines powering business-critical analytics. My work involves architecting cloud-native data platforms on AWS, diagnosing Spark performance bottlenecks, and translating business requirements into scalable solutions.
Previously at Infosys, I built cloud-based analytics pipelines and star-schema data warehouses for enterprise clients.
I'm an AWS Community Builder (Security team), hold multiple AWS certifications, and am passionate about distributed systems, data modeling, and building pipelines that don't page you at 3 AM.
Pipeline Architect
End-to-end design & delivery at enterprise scale
Perf Engineer
Spark tuning, query optimization, resource mgmt
Cloud Native
AWS certified architect, production platforms
Team Lead
Engineering leadership, delivery & stability
Project Lead — Data Engineering
- Own end-to-end architecture and delivery of batch data pipelines on AWS (S3, Glue, EMR, Redshift) supporting analytics for a 50M-patient commercial pharma platform, maintaining 99.9% SLA compliance across all production runs.
- Lead a team of 4 data engineers — managing sprint planning, technical design reviews, code quality standards, and career development for junior members.
- Serve as primary escalation owner for production data incidents; resolved 20+ P1/P2 events and implemented proactive monitoring guardrails that cut repeat incidents by 60%.
- Translated evolving business requirements from pharma analytics stakeholders into scalable data solutions, consistently delivering within committed timelines.
- Drove adoption of CI/CD practices and Git-based branching strategy across the team, reducing deployment errors and improving release confidence.
Senior Data Engineer
- Migrated 15+ legacy Hive and SQL pipelines to PySpark-based distributed processing on AWS EMR, achieving a 30% reduction in end-to-end pipeline runtime — cutting daily processing wall-time from ~9 hrs to ~6 hrs.
- Diagnosed and resolved Spark performance bottlenecks (data skew, suboptimal partitioning, oversized shuffles) on pipelines ingesting 200M records/day, reducing estimated compute costs by ~₹33L/yr in EMR cluster spend.
- Designed SCD Type-2 dimensional data models supporting 3+ years of full historical auditability for patient-level pharma data, enabling downstream teams to retire ad-hoc reconciliation scripts entirely.
- Managed ingestion of JSON, CSV, Parquet, and ORC formats across multiple upstream source systems, implementing schema-on-read patterns with AWS Glue Data Catalog.
- Mentored 2 junior engineers on PySpark optimization patterns and AWS data services — both promoted within the project cycle.
Associate Data Engineer
- Built Python-based Tableau refresh orchestration framework automating end-of-cycle dashboard updates across 30+ dashboards, eliminating 7 hours of manual effort per cycle (~350 hrs/yr saved).
- Developed and maintained production ETL pipelines using PySpark, AWS Glue, and Control-M with zero-defect delivery across 12 consecutive releases.
- Implemented data quality validation checks (null checks, referential integrity, row count reconciliation) at each pipeline stage, reducing data defects reaching downstream consumers by 80%.
- Built star-schema fact and dimension tables in Amazon Redshift supporting 5 analytics use cases including territory performance, brand uptake, and rep activity tracking.
- Collaborated with business analysts and data scientists to define data contracts, schema agreements, and SLA expectations for 8+ data products.
Systems Engineer
- Designed and implemented cloud-based monthly sales analytics pipelines using AWS S3, Python, and PySpark for a retail client operating 500+ stores across 3 regions — automating report generation that previously required 3 days of manual effort per cycle.
- Architected a star-schema data warehouse with 8 fact and dimension tables tracking revenue, billing frequency, basket size, and customer engagement KPIs — became the single source of truth for executive dashboards.
- Developed KPI computation logic for 12+ business metrics (revenue growth, churn rate, store-level performance), reducing analyst ad-hoc query turnaround from days to under 2 hours.
- Optimised complex SQL queries on 100M+ row transaction tables, reducing report generation time by 40% through indexing and query restructuring.
Pharma Commercial Data Platform
Cloud-native batch data platform on AWS processing 200M records/day of pharma commercial data — prescriptions, sales force activity, and HCP engagement. Migrated 15+ legacy Hive pipelines to PySpark, achieving 30% performance improvement and reducing daily processing window from 9 hrs to 6 hrs.
Retail Sales Analytics Data Warehouse
End-to-end sales analytics pipeline and data warehouse for a retail client with 500+ stores across 3 regions. Automated monthly reporting that previously took 3 days of manual effort. Built an 8-table star-schema warehouse powering executive-level revenue and engagement dashboards, with KPI logic covering 12+ business metrics.
Tableau Refresh Automation
Python-based orchestration framework automating end-of-cycle Tableau dashboard updates via Airflow across 30+ dashboards — eliminated 7 hours of manual effort per cycle (~350 hrs/yr saved).
SCD Type-2 Dimensional Data Model
Dimensional data model with SCD Type-2 change tracking supporting 3+ years of full historical auditability for patient-level pharma data. Enabled downstream teams to retire ad-hoc reconciliation scripts entirely.
Education
B.Tech, Computer Science
Senior Secondary School
AWS Community Builders
Achievements Unlocked
AWS Security Specialty
Amazon Web Services
AWS Solutions Architect Associate
Amazon Web Services
CCNA
Cisco Certified Network Associate
Let's build something together.
Open for data engineering roles, architecture consulting, or challenging data problems at scale.