Led the full-stack architectural overhaul of upGrad's Learning Management System — cutting Core Web Vitals by 75%, reducing API response times below 200ms, and shipping offline-first PWA support so learners on 2G in Tier 3 Indian cities could finish their coursework.

75%
Core Web Vitals improvement
3M+
Active learners served
<200ms
API response time (P50)
~40%
Load-time reduction on slow devices
upGrad is India's largest online higher education platform — MBA programs, tech bootcamps, professional certifications in partnership with top universities. By 2019, the platform had accumulated three years of rapid-growth technical debt: a monolithic React frontend with no code splitting, server-rendered pages with no caching strategy, and a Node.js backend handling everything from course delivery to payments in a single service.
The user base had also diversified in ways the original architecture couldn't handle. A significant cohort of learners were engineers in Tier 2 and Tier 3 cities — high career ambitions, but often on mobile data plans, often switching between WiFi and 4G, sometimes on 2G. A 6-second load time isn't an inconvenience for these learners; it's a session abandonment.
The mandate: rebuild the learner experience stack for performance, resilience, and global delivery — without taking the platform offline for the 3M+ learners already in the middle of enrolled programs.
Migrated from a mixed JS/JSX monolith to a fully typed TypeScript codebase. The type migration was incremental — we ran strict mode on all new files from day one, loosening for legacy pages and tightening module-by-module over six months. Introduced route-based code splitting (dynamic imports) that cut the initial JS bundle from 2.8MB to 680KB for first-load pages.
Extracted the course delivery, progress tracking, and assessment subsystems into dedicated services. The Node.js layer owned learner session, content rendering, and API aggregation. Golang services handled the performance-critical paths: progress write throughput (50K+ concurrent learners during live sessions), video progress checkpointing, and the search index for the content library. Go's goroutine model absorbed burst traffic without the event-loop saturation we were hitting in Node for write-heavy endpoints.
Core course content (lesson HTML, video manifests, quiz payloads) is pre-cached on enrollment via a background sync strategy built on Workbox. When a learner goes offline mid-lesson, the SW intercepts failed fetches and serves cached content. Progress events are queued in IndexedDB and replayed when connectivity restores. We tested against real-world 2G conditions using Chrome's network throttling and fixed three critical failure modes before launch: interleaved quiz submissions, video seek state recovery, and certificate unlock timing.
Course content is served via CloudFront with regional edge caching. Cache keys are structured to maximize hit rate: content is immutable once published (versioned URLs, long TTLs), while progress and personalization data is always origin-fetched. Redis sits in front of the Golang services as a write-through cache for learner progress state, reducing PostgreSQL write pressure by ~70% during peak concurrent sessions. Cache invalidation on enrollment changes is event-driven via SQS.
We defined explicit SLOs: LCP < 2.5s on simulated 4G mid-tier mobile, API P95 < 500ms for core learner journeys, error rate < 0.1%. Datadog dashboards and alerting were configured before the first traffic migration. Performance regressions broke the CI pipeline — we ran Lighthouse CI and a custom Core Web Vitals budget check on every PR, blocking merges that degraded LCP, CLS, or FID beyond threshold.
The 75% Core Web Vitals improvement wasn't an engineering vanity metric — it had direct downstream effects. Load-time reduction on slow-device profiles (~40% in lab conditions, consistent with Lighthouse CI data) directly reduced abandonment for the low-bandwidth user segment. The platform's next growth phase — entering Tier 3 markets aggressively — was unblocked in part by this infrastructure work. Faster load times and PWA support meant upGrad could credibly offer a comparable experience to learners who had previously churned due to connectivity issues.
From an engineering leadership perspective: the performance standards I established (Lighthouse CI budgets, API SLOs, typed codebase conventions) became the baseline for the platform team after my tenure. The architectural patterns — offline-first caching, split Go/Node services for performance-critical writes — persisted in the system through subsequent infrastructure cycles.