Posted on

Design a Real-Time Collaborative Document Editing System (like Google Docs)

Problem Statement:

You need to design a cloud-based real-time collaborative document editor that allows multiple users to edit the same document simultaneously. The system should support real-time updates, conflict resolution, and offline editing.


Requirements:

Functional Requirements:

  1. Users can create, edit, and delete documents.
  2. Multiple users can edit the same document concurrently and see real-time changes.
  3. The system should handle conflicting edits and merge them efficiently.
  4. Users should be able to view document history and revert to previous versions.
  5. Users should be able to edit documents offline, and changes should sync once they’re back online.
  6. The system should be fast, even for large documents with thousands of edits per second.

Non-Functional Requirements:

  1. Low latency (sub-100ms updates for real-time collaboration).
  2. Scalability: The system should support millions of users simultaneously.
  3. Fault tolerance: Ensure minimal data loss even if servers crash.
  4. Security: Handle role-based access control (RBAC) for documents (read-only, edit, admin).
  5. High availability: 99.99% uptime with geo-redundancy.
  6. Efficient storage: Maintain versions without excessive data duplication.

What I Expect –

  1. A quick architecture diagram on – https://excalidraw.com/ outlining major blocks
  2. Any database design / schema.

Discussion Points:

  1. Data Model & Storage
    • How will you store documents? (SQL vs NoSQL, CRDT-based storage, event logs)
    • How do you efficiently store document history without excessive duplication?
  2. Concurrency & Conflict Resolution
    • Which technique would you use to merge concurrent edits? (CRDTs, Operational Transformations, or custom locking mechanisms)
    • How do you handle two users making changes to the same word at the same time?
  3. Real-Time Synchronization
    • How would you design the event propagation system for real-time updates?
    • Would you use WebSockets, Server-Sent Events (SSE), or gRPC?
  4. Offline Editing & Sync
    • How do you handle offline users editing a document and later syncing changes?
    • How would you resolve conflicts when an offline user makes a conflicting edit?
  5. Scalability & Performance
    • How do you scale the system to handle millions of documents and users?
    • Would you use sharding, leader-based consensus, or event-driven architecture?
    • How would you ensure that real-time updates are fast and efficient?
  6. Security & Access Control
    • How do you prevent unauthorized users from modifying a document?
    • How do you implement role-based access control (RBAC)?
  7. Versioning & Undo Feature
    • How do you store document history efficiently?
    • Would you store delta changes or full snapshots?

Key Points

  • It seems likely that a real-time collaborative document editing system can be designed using ShareJS for real-time updates and MongoDB for storage, supporting features like concurrent editing and offline syncing.
  • Research suggests that Operational Transformation (OT) is effective for handling conflicts, while Conflict-Free Replicated Data Types (CRDTs) like Automerge could simplify offline editing but may need additional conflict resolution for text.
  • The evidence leans toward using sharding and a pub-sub system like Redis for scalability, ensuring low latency and high availability for millions of users.

System Overview

This design aims to create a cloud-based real-time collaborative document editor, similar to Google Docs, that supports multiple users editing simultaneously, handles conflicts, and allows offline editing with sync capabilities. The system will be fast, scalable, and secure, meeting all functional and non-functional requirements.

Architecture

The system architecture includes:

  • Clients: Web browsers using WebSockets for real-time communication.
  • Load Balancer: Distributes connections to appropriate shards based on document ID for scalability.
  • Shards: Each shard handles a subset of documents, with multiple server instances running ShareJS for real-time collaboration.
  • Database: A MongoDB cluster stores document data, with replication for fault tolerance.
  • Pub-Sub System: Redis facilitates real-time updates within shards, ensuring low latency.

For offline editing, clients queue local operations and sync them when reconnected, leveraging ShareJS’s OT for conflict resolution.

Database Design

The database schema includes:

  • documents collection: Stores current document state and version.
  • permissions collection: Manages role-based access control (RBAC) for users.
  • operations collection: Logs all operations for versioning and undo functionality.

This design ensures efficient storage and quick access to document history without excessive duplication.



Detailed System Design and Analysis

This section provides a comprehensive analysis of designing a real-time collaborative document editing system, addressing all requirements and discussion points. The design leverages established technologies and methodologies to ensure scalability, performance, and user experience.

System Requirements and Design Goals

The system must support:

  • Functional Requirements: Creation, editing, and deletion of documents; real-time concurrent editing; conflict resolution; document history and versioning; offline editing with sync; high performance for large documents.
  • Non-Functional Requirements: Low latency (sub-100ms updates), scalability for millions of users, fault tolerance, security with RBAC, high availability (99.99% uptime), and efficient storage.

The design aims to balance these requirements using a combination of Operational Transformation (OT) for real-time collaboration and considerations for offline editing, ensuring a robust and scalable solution.

Data Model and Storage

Storage Strategy

The system uses ShareJS, which implements OT, for real-time collaboration. Documents are stored in a MongoDB cluster for scalability and fault tolerance. The storage strategy involves:

  • Document State: Stored as JSON objects in the documents collection, with each document having a type (e.g., “text”) and current data.
  • Operation History: Maintained in the operations collection for versioning and undo, logging each operation with details like document ID, operation data, timestamp, and user ID.
  • Snapshots: Considered for efficiency, where periodic snapshots of the document state are stored to reduce the need to replay long operation logs for historical versions.
Database Schema

The database design is as follows:

CollectionFieldsDescription
documents_id (string), type (string), data (object), v (integer)Stores current document state and version number
permissions_id (string), users (array of objects with user_id and role)Manages RBAC for each document
operations_id (string), document_id (string), operation_data (object), timestamp (date), user_id (string)Logs all operations for versioning and undo

This schema ensures efficient storage and retrieval, with indexing on document_id for quick access to operations and permissions.

SQL vs. NoSQL

MongoDB was chosen over SQL due to its flexibility with JSON-like documents and scalability for handling large volumes of concurrent writes and reads, essential for real-time collaboration.

CRDT-Based Storage

Initially, CRDTs like Automerge were considered for their offline-first capabilities and conflict-free merging. However, for real-time text editing, OT was preferred due to better handling of concurrent edits without manual conflict resolution, which Automerge might require for text overlaps. Automerge remains a viable option for offline editing, but the design leans toward OT for consistency.

Concurrency and Conflict Resolution

Technique Selection
  • Operational Transformation (OT): Chosen for real-time collaboration, as implemented by ShareJS. OT transforms concurrent operations to maintain document consistency, ensuring that when two users edit the same word simultaneously, the server adjusts operations to merge changes seamlessly.
  • Conflict Resolution: OT handles conflicts by transforming operations based on their order and position, preserving all changes without loss. For example, if User A inserts text at position 5 and User B deletes at position 5 concurrently, OT adjusts the operations to apply both changes correctly.
Comparison with CRDTs

CRDTs were evaluated, particularly Automerge, for their decentralized merging capabilities. However, for text editing, CRDTs might preserve conflicting edits (e.g., both insertions at the same position), requiring application-level resolution, which could disrupt real-time flow. OT’s centralized approach ensures a single consistent view, making it more suitable.

Real-Time Synchronization

Event Propagation System
  • WebSockets: Used for bidirectional communication, enabling clients to send edits and receive updates in real-time. Each document has a unique channel in the pub-sub system (Redis), ensuring updates are broadcast to all connected users.
  • Pub-Sub Implementation: Redis facilitates efficient message passing within each shard, with server instances subscribing to document channels to propagate changes, achieving sub-100ms latency.
Technology Choice

WebSockets were preferred over Server-Sent Events (SSE) or gRPC due to their bidirectional nature, essential for real-time collaboration. gRPC could be considered for high-performance backend communication, but WebSockets align better with browser-based clients.

Offline Editing and Sync

Handling Offline Users
  • When offline, clients queue local operations using ShareJS’s client-side capabilities, storing them locally. Upon reconnection, these operations are sent to the server.
  • The server applies these operations, transforming them based on the current document state to handle any intervening changes, ensuring consistency.
Conflict Resolution for Offline Edits
  • The server uses OT to merge offline operations with the current state. If conflicts arise (e.g., offline user edited the same part as online users), OT transforms the operations to resolve them, maintaining document integrity.
  • This approach ensures that offline edits are not lost and are seamlessly integrated, with the server broadcasting the updated state to all connected clients.

Scalability and Performance

Scaling Strategy
  • Sharding: Documents are distributed across multiple shards based on document ID, with each shard handling a subset. This distributes load and ensures scalability for millions of users and documents.
  • Leader-Based Consensus: Each shard has a primary server instance for document updates, with secondary instances for failover, ensuring consistency and availability.
  • Event-Driven Architecture: The pub-sub system (Redis) enables event-driven updates, reducing server load by broadcasting changes efficiently.
Ensuring Fast Updates
  • Low latency is achieved by routing users to the nearest data center (geo-redundancy) and using WebSockets for real-time communication. Redis’s in-memory data structure ensures quick message passing, meeting the sub-100ms requirement.
  • For large documents, ShareJS’s OT implementation is optimized for frequent updates, with periodic snapshots reducing the need for full operation replays.

Security and Access Control

Preventing Unauthorized Access
  • All communications are encrypted using HTTPS for web traffic and secure WebSockets, ensuring data privacy.
  • Authentication is handled through an identity provider, with user sessions validated before allowing operations.
Implementing RBAC
  • The permissions collection stores user roles (read-only, edit, admin) for each document. Before applying operations, the server checks the user’s role, denying unauthorized actions. This ensures fine-grained access control, meeting security requirements.

Versioning and Undo Feature

Efficient Storage of History
  • Document history is maintained in the operations collection, logging each edit with timestamp and user ID. This allows replaying operations to reconstruct any version, supporting undo functionality.
  • To optimize storage, periodic snapshots are stored in the documents collection, reducing the need to process long operation logs for historical access.
Delta Changes vs. Full Snapshots
  • The system uses delta changes (operations) for real-time updates, stored in the operation log. Full snapshots are taken at intervals (e.g., every 1000 operations) to balance storage efficiency and quick access, ensuring users can revert to previous versions without excessive computation.

Unexpected Detail: Hybrid Approach Consideration

While OT is central to real-time collaboration, the initial exploration of CRDTs like Automerge highlights a potential hybrid approach for offline editing, where CRDTs could simplify syncing but require additional conflict resolution for text. This dual consideration adds flexibility but increases complexity, which was ultimately resolved by favoring OT for consistency.

Conclusion

This design leverages ShareJS for OT-based real-time collaboration, MongoDB for scalable storage, and Redis for efficient pub-sub, ensuring low latency, high availability, and support for offline editing. The sharding mechanism and RBAC implementation meet scalability and security needs, with operation logs and snapshots providing robust versioning and undo features.

Key Citations