InputConvDenseBottleneckReadout
← All Work
EdTech · Performance · PWA · Scale

upGrad LMS Rebuild

Led the full-stack architectural overhaul of upGrad's Learning Management System — cutting Core Web Vitals by 75%, reducing API response times below 200ms, and shipping offline-first PWA support so learners on 2G in Tier 3 Indian cities could finish their coursework.

Client: upGradRole: Lead Software EngineerPeriod: Dec 2019 – Sept 2021
ReactPWAPerformanceOffline-FirstNode.jsGolangScale
upGrad LMS platform

75%

Core Web Vitals improvement

3M+

Active learners served

<200ms

API response time (P50)

~40%

Load-time reduction on slow devices

Context & Challenge

upGrad is India's largest online higher education platform — MBA programs, tech bootcamps, professional certifications in partnership with top universities. By 2019, the platform had accumulated three years of rapid-growth technical debt: a monolithic React frontend with no code splitting, server-rendered pages with no caching strategy, and a Node.js backend handling everything from course delivery to payments in a single service.

The user base had also diversified in ways the original architecture couldn't handle. A significant cohort of learners were engineers in Tier 2 and Tier 3 cities — high career ambitions, but often on mobile data plans, often switching between WiFi and 4G, sometimes on 2G. A 6-second load time isn't an inconvenience for these learners; it's a session abandonment.

The mandate: rebuild the learner experience stack for performance, resilience, and global delivery — without taking the platform offline for the 3M+ learners already in the middle of enrolled programs.

Architecture Decisions

React + TypeScript for Learner UX

Migrated from a mixed JS/JSX monolith to a fully typed TypeScript codebase. The type migration was incremental — we ran strict mode on all new files from day one, loosening for legacy pages and tightening module-by-module over six months. Introduced route-based code splitting (dynamic imports) that cut the initial JS bundle from 2.8MB to 680KB for first-load pages.

Node.js + Golang Microservices Split

Extracted the course delivery, progress tracking, and assessment subsystems into dedicated services. The Node.js layer owned learner session, content rendering, and API aggregation. Golang services handled the performance-critical paths: progress write throughput (50K+ concurrent learners during live sessions), video progress checkpointing, and the search index for the content library. Go's goroutine model absorbed burst traffic without the event-loop saturation we were hitting in Node for write-heavy endpoints.

Offline-First PWA via Service Workers + Workbox

Core course content (lesson HTML, video manifests, quiz payloads) is pre-cached on enrollment via a background sync strategy built on Workbox. When a learner goes offline mid-lesson, the SW intercepts failed fetches and serves cached content. Progress events are queued in IndexedDB and replayed when connectivity restores. We tested against real-world 2G conditions using Chrome's network throttling and fixed three critical failure modes before launch: interleaved quiz submissions, video seek state recovery, and certificate unlock timing.

Edge-Aware CDN + Redis Cache Strategy

Course content is served via CloudFront with regional edge caching. Cache keys are structured to maximize hit rate: content is immutable once published (versioned URLs, long TTLs), while progress and personalization data is always origin-fetched. Redis sits in front of the Golang services as a write-through cache for learner progress state, reducing PostgreSQL write pressure by ~70% during peak concurrent sessions. Cache invalidation on enrollment changes is event-driven via SQS.

SLO Tracking & Performance Guardrails

We defined explicit SLOs: LCP < 2.5s on simulated 4G mid-tier mobile, API P95 < 500ms for core learner journeys, error rate < 0.1%. Datadog dashboards and alerting were configured before the first traffic migration. Performance regressions broke the CI pipeline — we ran Lighthouse CI and a custom Core Web Vitals budget check on every PR, blocking merges that degraded LCP, CLS, or FID beyond threshold.

Business Impact

The 75% Core Web Vitals improvement wasn't an engineering vanity metric — it had direct downstream effects. Load-time reduction on slow-device profiles (~40% in lab conditions, consistent with Lighthouse CI data) directly reduced abandonment for the low-bandwidth user segment. The platform's next growth phase — entering Tier 3 markets aggressively — was unblocked in part by this infrastructure work. Faster load times and PWA support meant upGrad could credibly offer a comparable experience to learners who had previously churned due to connectivity issues.

From an engineering leadership perspective: the performance standards I established (Lighthouse CI budgets, API SLOs, typed codebase conventions) became the baseline for the platform team after my tenure. The architectural patterns — offline-first caching, split Go/Node services for performance-critical writes — persisted in the system through subsequent infrastructure cycles.

Stack

ReactTypeScriptNode.jsGolangRedisPostgreSQLService WorkersWorkboxGraphQLCDN (CloudFront)AWSWebpackCode SplittingWeb Vitals APIDatadog