Node.js runs on a single thread for JavaScript execution. Scaling requires using multiple cores, distributing load across processes and machines, and eliminating I/O bottlenecks. This guide covers patterns from single-server clustering to multi-region deployments.

Understanding the Event Loop Bottleneck

  Request → Event Loop → async I/O (non-blocking)
              ↓
         CPU work (blocking!)
  

CPU-intensive tasks (JSON parsing huge payloads, image resizing, crypto) block the event loop. Solutions:

  • Worker threads for CPU-bound work
  • Separate microservices for heavy computation
  • Horizontal scaling — more instances behind a load balancer

Monitor event loop lag with perf_hooks or prom-client event loop metrics.

Cluster Module (Multi-Core)

  import cluster from 'node:cluster';
import os from 'node:os';
import process from 'node:process';

if (cluster.isPrimary) {
    const numCPUs = os.cpus().length;
    console.log(`Primary ${process.pid} spawning ${numCPUs} workers`);

    for (let i = 0; i < numCPUs; i++) {
        cluster.fork();
    }

    cluster.on('exit', (worker) => {
        console.log(`Worker ${worker.process.pid} died, restarting`);
        cluster.fork();
    });
} else {
    await import('./server.js');
}
  

Each worker is a separate process with its own memory. PM2 automates this:

  pm2 start dist/server.js -i max --name api
pm2 startup && pm2 save
  

Load Balancing

                      Nginx / ALB
                   /     |     \
              Node-1  Node-2  Node-3
                   \     |     /
                      Redis
                    PostgreSQL
  

Nginx upstream

  upstream node_api {
    least_conn;
    server 10.0.1.10:3000;
    server 10.0.1.11:3000;
    server 10.0.1.12:3000;
}

server {
    location / {
        proxy_pass http://node_api;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header X-Real-IP $remote_addr;
    }
}
  

Use least_conn for long-lived connections; round_robin for uniform short requests.

Stateless Application Design

Each instance must handle any request:

Stateful (avoid) Stateless (prefer)
In-memory sessions Redis session store
Local file uploads S3 / object storage
In-process caches only Shared Redis cache
WebSocket on one node Redis adapter for Socket.IO
  import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';

const pub = createClient({ url: process.env.REDIS_URL });
const sub = pub.duplicate();
io.adapter(createAdapter(pub, sub));
  

Connection Pooling

Database connections are expensive. Limit per instance:

  import { Pool } from 'pg';

const pool = new Pool({
    connectionString: process.env.DATABASE_URL,
    max: 20,              // max connections per instance
    idleTimeoutMillis: 30_000,
    connectionTimeoutMillis: 5_000,
});
  

Total connections = instances × pool.max. Stay under database limits (RDS default ~100–500). Use PgBouncer for connection multiplexing.

Caching Layers

  Client → CDN → API Gateway cache → Redis → Database
  
  import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

async function getUser(id: string) {
    const cached = await redis.get(`user:${id}`);
    if (cached) return JSON.parse(cached);

    const user = await db.user.findUnique({ where: { id } });
    await redis.setex(`user:${id}`, 300, JSON.stringify(user));
    return user;
}
  

Invalidate on write:

  await db.user.update({ where: { id }, data });
await redis.del(`user:${id}`);
  

Rate Limiting at Scale

In-memory rate limiters fail with multiple instances. Use Redis:

  import { RateLimiterRedis } from 'rate-limiter-flexible';

const limiter = new RateLimiterRedis({
    storeClient: redis,
    keyPrefix: 'rl',
    points: 100,
    duration: 60,
});

app.use(async (req, res, next) => {
    try {
        await limiter.consume(req.ip);
        next();
    } catch {
        res.status(429).json({ error: 'Too many requests' });
    }
});
  

Auto-Scaling

Kubernetes Horizontal Pod Autoscaler:

  apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  

Scale on CPU, memory, or custom metrics (request rate, queue depth). Set minReplicas ≥ 2 for availability.

Health Checks and Graceful Shutdown

  app.get('/health', async (req, res) => {
    try {
        await pool.query('SELECT 1');
        await redis.ping();
        res.json({ status: 'ok' });
    } catch {
        res.status(503).json({ status: 'degraded' });
    }
});

process.on('SIGTERM', async () => {
    console.log('Shutting down gracefully');
    server.close();
    await pool.end();
    process.exit(0);
});
  

Kubernetes sends SIGTERM before removing pods — finish in-flight requests before exit.

Capacity Planning

Estimate required instances:

  Required RPS = peak traffic × safety factor (1.5–2×)
Per-instance RPS = load test result (e.g., 500 RPS at p95 < 200ms)
Instances = Required RPS / Per-instance RPS
  

Load test with k6:

  import http from 'k6/http';
import { check } from 'k6';

export const options = {
    stages: [
        { duration: '2m', target: 100 },
        { duration: '5m', target: 500 },
        { duration: '2m', target: 0 },
    ],
};

export default function () {
    const res = http.get('https://api.example.com/users');
    check(res, { 'status is 200': (r) => r.status === 200 });
}
  

Observability at Scale

Centralize logs (structured JSON), metrics (Prometheus), and traces (OpenTelemetry):

  import { trace } from '@opentelemetry/api';

const span = trace.getTracer('api').startSpan('getUser');
try {
    const user = await fetchUser(id);
    return user;
} finally {
    span.end();
}
  

Alert on: error rate > 1%, p95 latency doubling, event loop lag > 100ms.

Scaling Checklist

  • Application stateless; sessions in Redis
  • Cluster mode or multiple K8s replicas
  • Load balancer with health checks
  • DB connection pooling with total limit calculated
  • Redis for cache, rate limits, pub/sub
  • CDN for static assets
  • Graceful shutdown handling
  • Load tested at 2× expected peak
  • Auto-scaling policies configured

Scaling Node.js is less about the runtime and more about architecture: stateless services, shared stores, and measured capacity drive reliable growth.