Return to Logs
2024-03-15
8 min read

Scaling WebSockets for 10 Million Users

WebSocketsRedisScaling

Introduction

Building real-time applications that scale to millions of users requires a fundamental shift in how we think about state and connection management. In this log, I'll break down the architecture used to support 10M+ users at Zoho.

The Challenge: Connection Limits

Traditional HTTP request-response cycles are stateless. WebSockets, however, maintain a persistent connection. A single server has a limit on the number of open file descriptors (sockets) it can handle.

# Standard limit check
ulimit -n
# Often defaults to 1024, needs to be raised to 65535+

Architecture: The Pub/Sub Layer

To scale horizontally, we cannot rely on sticky sessions alone. We need a way to broadcast messages across different server nodes.

Redis Pub/Sub

We utilized Redis as a lightweight message broker. When a user connects to Node A, and a message needs to be sent to them from a process on Node B, Node B publishes to a Redis channel that Node A is subscribed to.

  1. Client connects to Load Balancer
  2. Load Balancer routes to Socket Server Instance
  3. Instance subscribes to Redis channel user:{userId}

Optimization Techniques

  • Heartbeat Tuning: Adjusting ping/pong intervals to balance load vs. zombie connection detection.
  • Binary Formats: Using Protobuf instead of JSON for high-frequency telemetry data reduced bandwidth by 40%.

Conclusion

Scaling is not just about adding more servers; it's about reducing the state each server needs to hold and ensuring efficient inter-node communication.