system_boot.sh
$ initializing data_engineer_os v2.6...
[OK] Loading core modules
[OK] Spark runtime connected
[OK] AWS services authenticated
[INFO] Pipeline orchestrator ready
[OK] Data models validated
[WARN] Visitor detected — requesting access
$ system ready. awaiting input_
pronnoy@data-eng:~zsh • main
// Lead Data Engineer | System Architect | Cloud Platform Builder

Pronnoy Dutta

~/pipeline

5.5+ years designing, building, and operating large-scale distributed data pipelines and cloud-native platforms. Specialized in Spark optimization, data modeling, and end-to-end platform ownership on AWS.

0
Yrs Experience
0
M Records/Day
0
% Perf Gain
0
Certifications
src/about.py
about.pysrc/core/

I'm a Lead Data Engineer at Axtria, leading a team that owns the full lifecycle of batch data pipelines powering business-critical analytics. My work involves architecting cloud-native data platforms on AWS, diagnosing Spark performance bottlenecks, and translating business requirements into scalable solutions.

Previously at Infosys, I built cloud-based analytics pipelines and star-schema data warehouses for enterprise clients.

I'm an AWS Community Builder (Security team), hold multiple AWS certifications, and am passionate about distributed systems, data modeling, and building pipelines that don't page you at 3 AM.

Pipeline Architect

End-to-end design & delivery at enterprise scale

Perf Engineer

Spark tuning, query optimization, resource mgmt

Cloud Native

AWS certified architect, production platforms

Team Lead

Engineering leadership, delivery & stability

logs/git_log.sh
mainApr 2025 — Present • Axtria, Gurugram
axtria_lead.py

Project Lead — Data Engineering

Associate Sr. Associate Project Lead
  • Own end-to-end architecture and delivery of batch data pipelines on AWS (S3, Glue, EMR, Redshift) supporting analytics for a 50M-patient commercial pharma platform, maintaining 99.9% SLA compliance across all production runs.
  • Lead a team of 4 data engineers — managing sprint planning, technical design reviews, code quality standards, and career development for junior members.
  • Serve as primary escalation owner for production data incidents; resolved 20+ P1/P2 events and implemented proactive monitoring guardrails that cut repeat incidents by 60%.
  • Translated evolving business requirements from pharma analytics stakeholders into scalable data solutions, consistently delivering within committed timelines.
  • Drove adoption of CI/CD practices and Git-based branching strategy across the team, reducing deployment errors and improving release confidence.
99.9%SLA Compliance
4Engineers Led
20+P1/P2 Resolved
60%Fewer Incidents
mainMay 2024 — Apr 2025 • Axtria, Gurugram
axtria_senior.py

Senior Data Engineer

  • Migrated 15+ legacy Hive and SQL pipelines to PySpark-based distributed processing on AWS EMR, achieving a 30% reduction in end-to-end pipeline runtime — cutting daily processing wall-time from ~9 hrs to ~6 hrs.
  • Diagnosed and resolved Spark performance bottlenecks (data skew, suboptimal partitioning, oversized shuffles) on pipelines ingesting 200M records/day, reducing estimated compute costs by ~₹33L/yr in EMR cluster spend.
  • Designed SCD Type-2 dimensional data models supporting 3+ years of full historical auditability for patient-level pharma data, enabling downstream teams to retire ad-hoc reconciliation scripts entirely.
  • Managed ingestion of JSON, CSV, Parquet, and ORC formats across multiple upstream source systems, implementing schema-on-read patterns with AWS Glue Data Catalog.
  • Mentored 2 junior engineers on PySpark optimization patterns and AWS data services — both promoted within the project cycle.
30%Perf Gain
200MRecords/Day
9→6hrRuntime Cut
15+Pipelines Migrated
mainMar 2022 — Apr 2024 • Axtria, Gurugram
axtria_associate.py

Associate Data Engineer

  • Built Python-based Tableau refresh orchestration framework automating end-of-cycle dashboard updates across 30+ dashboards, eliminating 7 hours of manual effort per cycle (~350 hrs/yr saved).
  • Developed and maintained production ETL pipelines using PySpark, AWS Glue, and Control-M with zero-defect delivery across 12 consecutive releases.
  • Implemented data quality validation checks (null checks, referential integrity, row count reconciliation) at each pipeline stage, reducing data defects reaching downstream consumers by 80%.
  • Built star-schema fact and dimension tables in Amazon Redshift supporting 5 analytics use cases including territory performance, brand uptake, and rep activity tracking.
  • Collaborated with business analysts and data scientists to define data contracts, schema agreements, and SLA expectations for 8+ data products.
350hrSaved/Year
30+Dashboards
80%Fewer Defects
8+Data Products
feature/infosysNov 2020 — Mar 2022 • Infosys, Remote
sales_pipeline.py

Systems Engineer

  • Designed and implemented cloud-based monthly sales analytics pipelines using AWS S3, Python, and PySpark for a retail client operating 500+ stores across 3 regions — automating report generation that previously required 3 days of manual effort per cycle.
  • Architected a star-schema data warehouse with 8 fact and dimension tables tracking revenue, billing frequency, basket size, and customer engagement KPIs — became the single source of truth for executive dashboards.
  • Developed KPI computation logic for 12+ business metrics (revenue growth, churn rate, store-level performance), reducing analyst ad-hoc query turnaround from days to under 2 hours.
  • Optimised complex SQL queries on 100M+ row transaction tables, reducing report generation time by 40% through indexing and query restructuring.
40%Faster Reports
100M+Rows Optimised
12+KPIs Built
3 daysManual Work Cut
AWS S3PySparkPythonStar SchemaData Warehousing
config/skills.yml
languages.yml
Languages
PythonEXPERT
SQLEXPERT
JavaINTERMEDIATE
C#INTERMEDIATE
big_data.yml
Distributed & Big Data
Spark / PySparkEXPERT
Hadoop / YARNADVANCED
HiveADVANCED
KubernetesINTERMEDIATE
cloud_aws.yml
Cloud Platforms (AWS)
S3 / Data LakesEXPERT
Glue / EMRADVANCED
RedshiftADVANCED
RDSADVANCED
devops.yml
Architecture & DevOps
Data Modeling / Star SchemaEXPERT
AirflowADVANCED
Docker / CI/CDADVANCED
Git / LinuxEXPERT
projects/SELECT * FROM builds
pharma_platform.py
Production

Pharma Commercial Data Platform

Cloud-native batch data platform on AWS processing 200M records/day of pharma commercial data — prescriptions, sales force activity, and HCP engagement. Migrated 15+ legacy Hive pipelines to PySpark, achieving 30% performance improvement and reducing daily processing window from 9 hrs to 6 hrs.

200M+Records/Day
30%Faster
99.9%SLA
350hrSaved/Yr
PythonPySparkAWS EMRAWS GlueRedshiftAirflowControl-M
retail_warehouse.sql
Shipped

Retail Sales Analytics Data Warehouse

End-to-end sales analytics pipeline and data warehouse for a retail client with 500+ stores across 3 regions. Automated monthly reporting that previously took 3 days of manual effort. Built an 8-table star-schema warehouse powering executive-level revenue and engagement dashboards, with KPI logic covering 12+ business metrics.

40%Faster Reports
12+KPIs
100M+Rows
PythonPySparkAWS S3SQLStar SchemaAmazon Redshift
tableau_orchestrator.py
Shipped

Tableau Refresh Automation

Python-based orchestration framework automating end-of-cycle Tableau dashboard updates via Airflow across 30+ dashboards — eliminated 7 hours of manual effort per cycle (~350 hrs/yr saved).

350hrSaved/Yr
30+Dashboards
PythonAirflowTableauAWS
scd2_model.sql
Production

SCD Type-2 Dimensional Data Model

Dimensional data model with SCD Type-2 change tracking supporting 3+ years of full historical auditability for patient-level pharma data. Enabled downstream teams to retire ad-hoc reconciliation scripts entirely.

3+ yrHistory
SCD-2Audit Trail
Data ModelingSQLRedshiftStar SchemaPySpark
data/achievements.json

Education

B.Tech, Computer Science

Bharati Vidyapeeth College of Engineering, Pune
July 2016 — June 2020

Senior Secondary School

Delhi Public School, Gurgaon
2016

AWS Community Builders

Official Member — Security Team

Achievements Unlocked

AWS Security Specialty

Amazon Web Services

Unlocked

AWS Solutions Architect Associate

Amazon Web Services

Unlocked

CCNA

Cisco Certified Network Associate

Unlocked
docs/contact.md
README.md

Let's build something together.

Open for data engineering roles, architecture consulting, or challenging data problems at scale.

Gurugram, Haryana, India
send_message.sh