Real-Time Collaboration at Scale: CRDTs, Operational Transform, and WebSocket Architecture

Google Docs makes real-time collaboration look easy. Multiple users edit the same document simultaneously, changes appear instantly, and conflicts resolve automatically. After building CollabDesk, a real-time collaborative workspace, I learned that this illusion of simplicity hides extraordinary complexity.

Our first prototype worked perfectly for two users on the same network. By user three, we saw race conditions. By user ten, conflicts corrupted the document state. By user fifty, the server couldn't keep up. We spent six months solving problems that aren't visible in the product but are critical to its reliability.

This is what it takes to build real-time collaboration that actually works at scale.

The Naive Approach

Our initial architecture was simple:

Client A types "hello"
  → Send to server via WebSocket
  → Server broadcasts to all clients
  → Client B receives and renders "hello"

For a single client typing slowly on a fast network, this works. For multiple clients typing simultaneously, it breaks catastrophically.

The failure modes:

Race condition 1: Concurrent edits

Client A inserts "x" at position 5
Client B inserts "y" at position 5
Server receives both, applies in arbitrary order
Final state depends on network latency and message ordering

Race condition 2: Stale positions

Client A has text: "hello world" (cursor at position 6)
Client B deletes "world" → server broadcasts "delete positions 6-11"
Client A receives delete, applies it, cursor is now at position 6 (end of text)
Client A inserts "foo", sends "insert 'foo' at position 6"
Server applies insert at position 6, but position 6 is now past the end
State corrupted

Race condition 3: Lost updates

Client A and B both edit position 10
Server processes A's edit first, broadcasts to B
B receives A's edit, but B's local edit hasn't been acknowledged
B has two conflicting versions: local optimistic edit and server's broadcast of A's edit
B doesn't know which to keep

We tried locking. Only one client can edit at a time. Others must wait. This eliminates conflicts but destroys the user experience. Real-time collaboration requires allowing simultaneous edits.

The solution: Conflict-free Replicated Data Types (CRDTs) or Operational Transformation (OT).

Operational Transformation vs. CRDTs

Both OT and CRDTs solve the same problem (concurrent edits) with different approaches.

Operational Transformation:

Transform operations based on context
Operation from Client A is transformed to account for Operation from Client B
Requires a central server to order operations and apply transformations
Complex but used by Google Docs

CRDTs:

Data structures designed to merge automatically
Each client has a local replica
Replicas sync eventually, guaranteed to converge
No central server required (but we used one for simplicity)
Simpler conceptually, used by Figma and collaborative editors

We chose CRDTs using Yjs, a JavaScript CRDT library. Here's why:

Simplicity: Yjs handles conflict resolution automatically. We didn't have to implement OT transformation logic.

Offline support: CRDTs work offline. Clients sync when reconnected. OT requires a server.

Proven: Yjs is battle-tested in production (used by Notion, Linear, and others).

TypeScript support: First-class TypeScript types, crucial for our codebase.

The trade-off? CRDT data structures use more memory. For text, Yjs stores character-level metadata (unique ID per character). A 10KB document might consume 50KB in memory. For our use case (documents less than 1MB), this was acceptable.

Yjs Architecture

Yjs represents text as a CRDT data structure where each character has a unique ID. Operations don't reference positions (which change). They reference character IDs (which are immutable).

Example:

import * as Y from 'yjs'

// Create a shared document
const ydoc = new Y.Doc()
const ytext = ydoc.getText('content')

// Client A: Insert "hello"
ytext.insert(0, 'hello')
// Internal representation:
// [h:A1, e:A2, l:A3, l:A4, o:A5]

// Client B: Insert " world" at position 5
ytext.insert(5, ' world')
// [h:A1, e:A2, l:A3, l:A4, o:A5,  :B1, w:B2, o:B3, r:B4, l:B5, d:B6]

// Client A: Delete "hello" (positions 0-5)
ytext.delete(0, 5)
// [ :B1, w:B2, o:B3, r:B4, l:B5, d:B6]
// Deletion marks characters A1-A5 as deleted but preserves their IDs

When clients sync, they exchange operations (not states). Yjs merges operations using character IDs, not positions. This automatically resolves conflicts.

Our WebSocket integration:

import { WebsocketProvider } from 'y-websocket'

// Client-side
const ydoc = new Y.Doc()
const provider = new WebsocketProvider(
  'wss://server.com',
  'document-id',
  ydoc
)

// That's it! Yjs handles sync automatically

Yjs broadcasts local changes to the server, and the server broadcasts to other clients. The server doesn't need to understand CRDT semantics. It's a dumb relay. All conflict resolution happens client-side.

WebSocket Connection Management

Our server initially used Node.js with the ws library:

const WebSocket = require('ws')
const wss = new WebSocket.Server({ port: 8080 })

wss.on('connection', (ws, req) => {
  const docId = req.url.split('/')[1]

  ws.on('message', (message) => {
    // Broadcast to all clients in this document
    wss.clients.forEach((client) => {
      if (client !== ws && client.readyState === WebSocket.OPEN) {
        client.send(message)
      }
    })
  })

  ws.on('close', () => {
    console.log('Client disconnected')
  })
})

This worked for 10 users. At 50 users, the server started dropping messages. At 100 users, it crashed.

The problems:

Problem 1: Lack of backpressure Broadcasting to 100 clients means 100 socket writes per message. If the server can't write fast enough, messages queue in memory. At high throughput, memory usage explodes.

Problem 2: No connection pooling Each WebSocket holds an open TCP connection. 1,000 users = 1,000 open connections. Node.js defaults to 1,000 file descriptors. We hit the limit.

Problem 3: No message batching Every keystroke triggered a broadcast. With 50 users typing, that's 50 broadcasts per second minimum. The server spent all its time broadcasting.

The fixes:

Fix 1: Backpressure handling

ws.on('message', (message) => {
  wss.clients.forEach((client) => {
    if (client !== ws && client.readyState === WebSocket.OPEN) {
      if (client.bufferedAmount > MAX_BUFFER) {
        // Client is slow, skip this message
        console.warn('Skipping message to slow client')
      } else {
        client.send(message)
      }
    }
  })
})

We skip messages to clients that can't keep up. This prevents one slow client from consuming all server memory.

Fix 2: Increase file descriptor limit

ulimit -n 10000

This allows 10,000 concurrent connections. We also optimized connection pooling to reuse closed sockets.

Fix 3: Message batching

Instead of broadcasting every message immediately, we batch messages and send every 50ms:

const messageQueue = []

setInterval(() => {
  if (messageQueue.length > 0) {
    const batch = messageQueue.splice(0, messageQueue.length)
    wss.clients.forEach((client) => {
      if (client.readyState === WebSocket.OPEN) {
        client.send(JSON.stringify(batch))
      }
    })
  }
}, 50)

ws.on('message', (message) => {
  messageQueue.push(message)
})

This reduced broadcast frequency from 50+ per second to 20 per second (one batch every 50ms). CPU usage dropped 40%.

Persistence and Recovery

CRDTs handle live synchronization, but what happens when all clients disconnect? The server needs to persist the document state.

We used Redis for in-memory persistence:

const redis = require('redis')
const client = redis.createClient()

// On message received
ws.on('message', (message) => {
  const update = message  // Yjs update

  // Store update in Redis
  client.rpush(`document:${docId}:updates`, update)

  // Broadcast to other clients
  broadcast(docId, update)
})

// On client connect
wss.on('connection', (ws, req) => {
  const docId = req.url.split('/')[1]

  // Send historical updates to new client
  client.lrange(`document:${docId}:updates`, 0, -1, (err, updates) => {
    updates.forEach((update) => {
      ws.send(update)
    })
  })
})

This approach stores individual updates. Over time, update history grows large. We implemented periodic snapshotting:

// Every 5 minutes, snapshot the document
setInterval(() => {
  const ydoc = new Y.Doc()

  // Apply all updates to reconstruct state
  client.lrange(`document:${docId}:updates`, 0, -1, (err, updates) => {
    updates.forEach((update) => {
      Y.applyUpdate(ydoc, Buffer.from(update))
    })

    // Encode current state as a snapshot
    const snapshot = Y.encodeStateAsUpdate(ydoc)

    // Replace update history with snapshot
    client.del(`document:${docId}:updates`)
    client.rpush(`document:${docId}:updates`, snapshot)
  })
}, 5 * 60 * 1000)

This keeps Redis storage manageable. A document with 10,000 updates compresses to a single snapshot less than 100KB.

Handling Network Failures

Real networks have failures. Clients disconnect, reconnect, and lose messages. Our system needed to handle:

Temporary disconnect: Client loses connection for 5 seconds, reconnects. Yjs should sync missed updates.

Prolonged disconnect: Client is offline for an hour, accumulates local changes. On reconnect, sync with server.

Conflicting offline edits: Two clients edit offline, reconnect simultaneously. CRDTs resolve conflicts.

Yjs handles this automatically with awareness:

const provider = new WebsocketProvider('wss://server.com', docId, ydoc)

provider.on('status', (event) => {
  if (event.status === 'connected') {
    // Yjs automatically syncs missed updates
    console.log('Synced with server')
  }
})

On reconnect, Yjs sends its local state and receives the server's state. Missing updates are applied, and the document converges.

We tested this by:

Client A and B online, synced
Disconnect Client B
Client A edits the document
Reconnect Client B
Verify Client B receives Client A's edits

Yjs passed every test. This is the power of mature CRDT libraries. You get correct conflict resolution without implementing it yourself.

Scaling to 200+ Users

At 200 concurrent users, our single Node.js server couldn't handle the load. We needed horizontal scaling.

The challenge: WebSocket connections are stateful. If Client A connects to Server 1 and Client B connects to Server 2, how do they sync?

We used Redis Pub/Sub:

Client A → Server 1 → Redis Pub/Sub → Server 2 → Client B

Each server subscribes to a Redis channel. When a client sends an update, the server publishes it to Redis. All servers (including the sender) receive the update and broadcast to their connected clients.

const redis = require('redis')
const subscriber = redis.createClient()
const publisher = redis.createClient()

subscriber.subscribe(`document:${docId}`)

subscriber.on('message', (channel, message) => {
  // Broadcast to all local clients
  wss.clients.forEach((client) => {
    if (client.readyState === WebSocket.OPEN) {
      client.send(message)
    }
  })
})

ws.on('message', (message) => {
  // Publish to Redis
  publisher.publish(`document:${docId}`, message)
})

This allows multiple servers to handle clients for the same document. We deployed 5 servers behind a load balancer, each handling ~50 connections. Total capacity: 250 concurrent users.

The cost increased:

Single server: $20/month (1 EC2 instance)
Multi-server + Redis: $50/month (5 EC2 instances + ElastiCache)

For 200+ users, this was justified.

Presence and Cursors

Real-time collaboration isn't just document edits. It's also presence (who's online) and cursor positions (where others are editing).

Yjs provides an awareness API:

const awareness = provider.awareness

// Set local user info
awareness.setLocalStateField('user', {
  name: 'Alice',
  color: '#ff0000'
})

awareness.setLocalStateField('cursor', {
  position: 42,
  selection: [42, 50]
})

// Listen to remote users
awareness.on('change', () => {
  const states = awareness.getStates()
  states.forEach((state, clientId) => {
    console.log('User ' + state.user.name + ' is at position ' + state.cursor.position)
  })
})

Awareness data is ephemeral (not persisted). It's broadcasted like document updates but doesn't affect document state.

We used this to render:

User avatars showing who's online
Colored cursors showing where each user is editing
Selection highlights

The UX impact was massive. Seeing others' cursors made collaboration feel real. It also prevented conflicts because users avoided editing the same region simultaneously.

Performance Optimization

At scale, we optimized aggressively:

1. Compress messages

Yjs updates can be large. We used gzip compression:

const zlib = require('zlib')

ws.on('message', (message) => {
  zlib.gzip(message, (err, compressed) => {
    publisher.publish(`document:${docId}`, compressed)
  })
})

This reduced bandwidth by 60%.

2. Debounce cursor updates

Cursor positions change on every keystroke. Broadcasting every cursor movement floods the network. We debounced:

let cursorTimer
editor.on('selectionChange', (position) => {
  clearTimeout(cursorTimer)
  cursorTimer = setTimeout(() => {
    awareness.setLocalStateField('cursor', { position })
  }, 100)
})

Cursor updates now send at most 10 times per second instead of 50+.

3. Lazy load historical updates

Instead of sending all historical updates on connect, we send the latest snapshot plus the last 100 updates. If the client needs older history, it requests explicitly.

This reduced initial load time from 3 seconds to 300ms.

Cost Analysis

Running CollabDesk for 200 users:

Infrastructure:

5 EC2 instances (t3.medium): $35/month
ElastiCache (Redis): $15/month
Load balancer: $20/month
S3 storage (document snapshots): $2/month

Total: $72/month

For context, our first prototype cost $20/month (single server). Scaling to 200 users increased costs 3.6x. Revenue per user was $5/month, so 200 users generated $1,000/month. Infrastructure was 7% of revenue.

This is sustainable. At 1,000 users, we'd scale to 20 servers (~$200/month infrastructure) and $5,000/month revenue. Infrastructure remains less than 5% of revenue.

What Went Wrong

Despite careful engineering, we had production incidents:

Incident 1: Redis memory exhaustion A single document accumulated 50,000 updates (unusual but possible). Redis ran out of memory, crashed, and took down all servers. We lost 10 minutes of edit history.

Fix: Implement update compaction more aggressively (every 1 minute instead of 5 minutes).

Incident 2: Infinite loop in CRDT merge A corrupted update caused Yjs to enter an infinite loop, pegging CPU at 100%. The server became unresponsive.

Fix: Wrap Yjs operations in try-catch and isolate document processing (one document's failure shouldn't crash the server).

Incident 3: Network partition Two servers disconnected from Redis for 30 seconds. Clients on different servers diverged. On reconnect, CRDTs merged, but some edits appeared in unexpected order.

Fix: None. This is a fundamental limitation of distributed systems. CRDTs guarantee eventual consistency, not order preservation during partitions.

Lessons Learned

CRDTs are complex but worth it: Implementing OT would have taken months. Yjs worked in days.

Test under load: Our single-server prototype worked perfectly. Scaling revealed race conditions we never anticipated.

Websockets require care: Backpressure, connection limits, and message batching aren't optional. They're critical for reliability.

Monitor everything: We added Prometheus metrics for WebSocket connections, message throughput, Redis latency, and CRDT operation time. These metrics caught issues before users noticed.

Persistence is non-trivial: Snapshotting, compaction, and recovery logic took longer to implement than the core collaboration feature.

Expect the unexpected: Network partitions, corrupted updates, and edge cases happen in production. Design for failure.

The Future: Peer-to-Peer Collaboration

Our architecture uses a central server for simplicity. But CRDTs enable peer-to-peer sync. Clients could connect directly using WebRTC, eliminating the server entirely.

Benefits:

Lower latency (no server hop)
Lower cost (no server infrastructure)
Better privacy (data stays on clients)

Challenges:

NAT traversal (not all clients can connect directly)
Persistence (who stores the canonical state?)
Discovery (how do clients find each other?)

We're experimenting with hybrid architecture: peer-to-peer for active editing, server for persistence and discovery.

Closing Thoughts

Real-time collaboration is one of the most technically interesting problems in web development. It combines distributed systems, networking, conflict resolution, and performance optimization. It looks effortless from the user's perspective but requires extraordinary engineering.

CollabDesk taught me that the hard part isn't building a prototype. It's making it work reliably at scale, handling failures gracefully, and maintaining performance under load.

CRDTs made this achievable. Without Yjs, we would still be debugging OT edge cases. With Yjs, we focused on the unique aspects of our product instead of reinventing collaboration primitives.

If you're building real-time collaboration, use a mature CRDT library. The time you save is worth the trade-offs. And test under load early. The problems you discover will shape your architecture.

Real-time collaboration at scale isn't easy. But it's solvable. And when it works, it feels like magic.