Posted on

Design of a Real-Time Collaborative Document Editing System (like Google Docs)

Below is a comprehensive design for a cloud-based real-time collaborative document editing system that meets the specified functional and non-functional requirements. The system allows multiple users to edit documents simultaneously, supports offline editing with synchronization, and scales to millions of users while ensuring security, low latency, and efficient storage.


Solution Overview

The system is designed to provide a seamless collaborative editing experience similar to Google Docs, addressing challenges such as real-time synchronization, conflict resolution, offline support, scalability, and security. Here’s how each component is architected to meet the requirements.


Key Components and Design Decisions

1. Data Model & Storage

  • Document Storage:
    Documents are stored using a Conflict-Free Replicated Data Type (CRDT)-based system (e.g., Yjs or Automerge). CRDTs represent the document as a sequence of operations (e.g., insert, delete, update) rather than plain text, enabling efficient merging of concurrent edits. This approach ensures that multiple users can edit the same document without conflicts.
  • Versioning Storage:
    An event log records every operation applied to the document, allowing reconstruction of any previous version by replaying the log. To optimize performance and avoid replaying the entire log for older versions, periodic snapshots of the document state are saved (e.g., every 100 operations or at fixed time intervals). This balances storage efficiency with fast version retrieval.
  • Database Choice:
    A NoSQL database (e.g., DynamoDB or MongoDB) is used to persist the CRDT data and event logs, as it supports high write throughput and horizontal scaling better than traditional SQL databases.

Why?
CRDTs natively handle concurrent edits, and the event log with snapshots minimizes storage overhead while enabling efficient versioning.


2. Concurrency & Conflict Resolution

  • Technique:
    Use operation-based CRDTs (e.g., Logoot or Treedoc) to manage concurrent edits. Each edit is an operation with a unique identifier and timestamp, ensuring that operations can be applied in any order and still converge to the same document state.
  • Handling Same-Word Edits:
    If two users edit the same word simultaneously (e.g., User A inserts “x” at position 5, and User B deletes character 5), the CRDT assigns unique identifiers to each operation based on user ID and timestamp. The merge function ensures both changes are preserved (e.g., “x” is inserted, and the deletion shifts accordingly), avoiding data loss.

Why?
CRDTs simplify conflict resolution compared to Operational Transformations (OT) and eliminate the need for locking, providing a seamless user experience under high concurrency.


3. Real-Time Synchronization

  • Communication Protocol:
    Use WebSockets for real-time, bidirectional communication between clients and the server. When a user makes an edit, the operation is sent to the server via WebSocket, which broadcasts it to all connected clients editing the same document. Clients apply the operation to their local CRDT state.
  • Event Propagation:
    The server acts as a central coordinator, receiving operations from clients and pushing them to others in near real-time (targeting sub-100ms latency). WebSocket connections are maintained for each active user per document.

Why?
WebSockets offer low-latency, full-duplex communication, making them ideal for real-time updates compared to Server-Sent Events (unidirectional) or gRPC (more complex for this use case).


4. Offline Editing & Sync

  • Offline Editing:
    When a user goes offline, their edits are stored locally in the browser using IndexedDB as a queue of operations. The client continues to apply these operations to its local CRDT state, allowing uninterrupted editing.
  • Synchronization:
    Upon reconnection, the queued operations are sent to the server. The server merges them with the current document state using the CRDT merge function, resolving conflicts automatically. If the merge result is ambiguous (e.g., significant divergence), users can optionally review changes via a manual conflict resolution interface.

Why?
Local storage ensures offline functionality, and CRDTs handle conflict resolution naturally, minimizing data loss during sync.


5. Scalability & Performance

  • Sharding:
    Documents are sharded across multiple servers based on document ID, distributing the load and enabling horizontal scaling. Each shard manages a subset of documents and their associated event logs.
  • Event-Driven Architecture:
    A message broker (e.g., Kafka or RabbitMQ) handles operation propagation. When an edit occurs, the operation is published to the broker, and relevant servers and clients consume it. This decouples the system, improving scalability and fault tolerance.
  • Caching:
    Frequently accessed documents and their current states are cached in memory (e.g., using Redis) to reduce database load and ensure fast reads.
  • Performance Optimization:
    Sub-100ms latency is achieved through WebSockets, in-memory caching, and efficient CRDT operations. Periodic snapshots reduce computation for version reconstruction.

Why?
Sharding and an event-driven approach scale the system to millions of users, while caching and snapshots ensure high performance even with large documents and frequent edits.


6. Security & Access Control

  • Authentication:
    Users are authenticated using OAuth2 or JWT, ensuring only authorized individuals can access the system.
  • Authorization:
    Implement Role-Based Access Control (RBAC) with roles such as:
  • Owner: Can edit, share, and delete the document.
  • Editor: Can edit the document.
  • Viewer: Can only view the document.
    Role assignments are stored in a centralized database and checked for every operation (read, write, delete).
  • Encryption:
  • In Transit: Use TLS to secure all WebSocket and HTTP communications.
  • At Rest: Encrypt sensitive document data in the database using AES-256.

Why?
RBAC ensures fine-grained permissions, and encryption protects data confidentiality, meeting security requirements.


7. Versioning & Undo Feature

  • Document History:
    The event log stores all operations (delta changes) applied to the document, enabling reconstruction of any version by replaying operations from the start or the nearest snapshot. Snapshots are taken periodically to optimize retrieval.
  • Undo Feature:
    Clients maintain a local stack of recent operations. An undo reverses the last operation (e.g., deleting an inserted character) and sends the reversal to the server, which propagates it to other clients.

Why?
Delta changes are storage-efficient, and snapshots improve performance. The local undo stack provides a responsive user experience.


Final Architecture

  • Clients: Web browsers or mobile apps connect via WebSockets for real-time collaboration and use IndexedDB for offline edits.
  • API Gateway: Authenticates users, enforces RBAC, and routes requests to services.
  • Document Service: Manages document CRUD operations, coordinates real-time updates, and applies CRDT logic.
  • CRDT Engine: Merges operations and maintains document consistency.
  • Event Log Database: Persists operations and snapshots (e.g., DynamoDB).
  • Message Broker: Distributes operations across servers and clients (e.g., Kafka).
  • Caching Layer: Stores document states in memory (e.g., Redis).
  • Storage Layer: Holds encrypted document data and metadata.

Meeting Requirements

Functional Requirements

  1. CRUD Operations: Supported via the document service.
  2. Concurrent Editing: Enabled by CRDTs and WebSockets.
  3. Conflict Resolution: Handled automatically by operation-based CRDTs.
  4. Version History: Provided by the event log and snapshots.
  5. Offline Editing: Supported with local storage and sync via CRDTs.
  6. Speed: Optimized with caching and efficient CRDT operations.

Non-Functional Requirements

  1. Low Latency: Sub-100ms updates via WebSockets and caching.
  2. Scalability: Achieved with sharding and event-driven architecture.
  3. Fault Tolerance: Message broker and geo-redundant storage ensure minimal data loss.
  4. Security: RBAC, TLS, and encryption protect access and data.
  5. High Availability: Sharding and redundancy target 99.99% uptime.
  6. Efficient Storage: Delta changes and periodic snapshots minimize duplication.

Conclusion

This design delivers a robust, scalable, and secure real-time collaborative document editing system. By leveraging CRDTs for conflict-free editing, WebSockets for low-latency synchronization, and an event-driven architecture for scalability, it meets all specified requirements while providing a user experience comparable to Google Docs.

Posted on

Design a Real-Time Collaborative Document Editing System (like Google Docs)

Problem Statement:

You need to design a cloud-based real-time collaborative document editor that allows multiple users to edit the same document simultaneously. The system should support real-time updates, conflict resolution, and offline editing.


Requirements:

Functional Requirements:

  1. Users can create, edit, and delete documents.
  2. Multiple users can edit the same document concurrently and see real-time changes.
  3. The system should handle conflicting edits and merge them efficiently.
  4. Users should be able to view document history and revert to previous versions.
  5. Users should be able to edit documents offline, and changes should sync once they’re back online.
  6. The system should be fast, even for large documents with thousands of edits per second.

Non-Functional Requirements:

  1. Low latency (sub-100ms updates for real-time collaboration).
  2. Scalability: The system should support millions of users simultaneously.
  3. Fault tolerance: Ensure minimal data loss even if servers crash.
  4. Security: Handle role-based access control (RBAC) for documents (read-only, edit, admin).
  5. High availability: 99.99% uptime with geo-redundancy.
  6. Efficient storage: Maintain versions without excessive data duplication.

What I Expect –

  1. A quick architecture diagram on – https://excalidraw.com/ outlining major blocks
  2. Any database design / schema.

Discussion Points:

  1. Data Model & Storage
    • How will you store documents? (SQL vs NoSQL, CRDT-based storage, event logs)
    • How do you efficiently store document history without excessive duplication?
  2. Concurrency & Conflict Resolution
    • Which technique would you use to merge concurrent edits? (CRDTs, Operational Transformations, or custom locking mechanisms)
    • How do you handle two users making changes to the same word at the same time?
  3. Real-Time Synchronization
    • How would you design the event propagation system for real-time updates?
    • Would you use WebSockets, Server-Sent Events (SSE), or gRPC?
  4. Offline Editing & Sync
    • How do you handle offline users editing a document and later syncing changes?
    • How would you resolve conflicts when an offline user makes a conflicting edit?
  5. Scalability & Performance
    • How do you scale the system to handle millions of documents and users?
    • Would you use sharding, leader-based consensus, or event-driven architecture?
    • How would you ensure that real-time updates are fast and efficient?
  6. Security & Access Control
    • How do you prevent unauthorized users from modifying a document?
    • How do you implement role-based access control (RBAC)?
  7. Versioning & Undo Feature
    • How do you store document history efficiently?
    • Would you store delta changes or full snapshots?

Key Points

  • It seems likely that a real-time collaborative document editing system can be designed using ShareJS for real-time updates and MongoDB for storage, supporting features like concurrent editing and offline syncing.
  • Research suggests that Operational Transformation (OT) is effective for handling conflicts, while Conflict-Free Replicated Data Types (CRDTs) like Automerge could simplify offline editing but may need additional conflict resolution for text.
  • The evidence leans toward using sharding and a pub-sub system like Redis for scalability, ensuring low latency and high availability for millions of users.

System Overview

This design aims to create a cloud-based real-time collaborative document editor, similar to Google Docs, that supports multiple users editing simultaneously, handles conflicts, and allows offline editing with sync capabilities. The system will be fast, scalable, and secure, meeting all functional and non-functional requirements.

Architecture

The system architecture includes:

  • Clients: Web browsers using WebSockets for real-time communication.
  • Load Balancer: Distributes connections to appropriate shards based on document ID for scalability.
  • Shards: Each shard handles a subset of documents, with multiple server instances running ShareJS for real-time collaboration.
  • Database: A MongoDB cluster stores document data, with replication for fault tolerance.
  • Pub-Sub System: Redis facilitates real-time updates within shards, ensuring low latency.

For offline editing, clients queue local operations and sync them when reconnected, leveraging ShareJS’s OT for conflict resolution.

Database Design

The database schema includes:

  • documents collection: Stores current document state and version.
  • permissions collection: Manages role-based access control (RBAC) for users.
  • operations collection: Logs all operations for versioning and undo functionality.

This design ensures efficient storage and quick access to document history without excessive duplication.



Detailed System Design and Analysis

This section provides a comprehensive analysis of designing a real-time collaborative document editing system, addressing all requirements and discussion points. The design leverages established technologies and methodologies to ensure scalability, performance, and user experience.

System Requirements and Design Goals

The system must support:

  • Functional Requirements: Creation, editing, and deletion of documents; real-time concurrent editing; conflict resolution; document history and versioning; offline editing with sync; high performance for large documents.
  • Non-Functional Requirements: Low latency (sub-100ms updates), scalability for millions of users, fault tolerance, security with RBAC, high availability (99.99% uptime), and efficient storage.

The design aims to balance these requirements using a combination of Operational Transformation (OT) for real-time collaboration and considerations for offline editing, ensuring a robust and scalable solution.

Data Model and Storage

Storage Strategy

The system uses ShareJS, which implements OT, for real-time collaboration. Documents are stored in a MongoDB cluster for scalability and fault tolerance. The storage strategy involves:

  • Document State: Stored as JSON objects in the documents collection, with each document having a type (e.g., “text”) and current data.
  • Operation History: Maintained in the operations collection for versioning and undo, logging each operation with details like document ID, operation data, timestamp, and user ID.
  • Snapshots: Considered for efficiency, where periodic snapshots of the document state are stored to reduce the need to replay long operation logs for historical versions.
Database Schema

The database design is as follows:

CollectionFieldsDescription
documents_id (string), type (string), data (object), v (integer)Stores current document state and version number
permissions_id (string), users (array of objects with user_id and role)Manages RBAC for each document
operations_id (string), document_id (string), operation_data (object), timestamp (date), user_id (string)Logs all operations for versioning and undo

This schema ensures efficient storage and retrieval, with indexing on document_id for quick access to operations and permissions.

SQL vs. NoSQL

MongoDB was chosen over SQL due to its flexibility with JSON-like documents and scalability for handling large volumes of concurrent writes and reads, essential for real-time collaboration.

CRDT-Based Storage

Initially, CRDTs like Automerge were considered for their offline-first capabilities and conflict-free merging. However, for real-time text editing, OT was preferred due to better handling of concurrent edits without manual conflict resolution, which Automerge might require for text overlaps. Automerge remains a viable option for offline editing, but the design leans toward OT for consistency.

Concurrency and Conflict Resolution

Technique Selection
  • Operational Transformation (OT): Chosen for real-time collaboration, as implemented by ShareJS. OT transforms concurrent operations to maintain document consistency, ensuring that when two users edit the same word simultaneously, the server adjusts operations to merge changes seamlessly.
  • Conflict Resolution: OT handles conflicts by transforming operations based on their order and position, preserving all changes without loss. For example, if User A inserts text at position 5 and User B deletes at position 5 concurrently, OT adjusts the operations to apply both changes correctly.
Comparison with CRDTs

CRDTs were evaluated, particularly Automerge, for their decentralized merging capabilities. However, for text editing, CRDTs might preserve conflicting edits (e.g., both insertions at the same position), requiring application-level resolution, which could disrupt real-time flow. OT’s centralized approach ensures a single consistent view, making it more suitable.

Real-Time Synchronization

Event Propagation System
  • WebSockets: Used for bidirectional communication, enabling clients to send edits and receive updates in real-time. Each document has a unique channel in the pub-sub system (Redis), ensuring updates are broadcast to all connected users.
  • Pub-Sub Implementation: Redis facilitates efficient message passing within each shard, with server instances subscribing to document channels to propagate changes, achieving sub-100ms latency.
Technology Choice

WebSockets were preferred over Server-Sent Events (SSE) or gRPC due to their bidirectional nature, essential for real-time collaboration. gRPC could be considered for high-performance backend communication, but WebSockets align better with browser-based clients.

Offline Editing and Sync

Handling Offline Users
  • When offline, clients queue local operations using ShareJS’s client-side capabilities, storing them locally. Upon reconnection, these operations are sent to the server.
  • The server applies these operations, transforming them based on the current document state to handle any intervening changes, ensuring consistency.
Conflict Resolution for Offline Edits
  • The server uses OT to merge offline operations with the current state. If conflicts arise (e.g., offline user edited the same part as online users), OT transforms the operations to resolve them, maintaining document integrity.
  • This approach ensures that offline edits are not lost and are seamlessly integrated, with the server broadcasting the updated state to all connected clients.

Scalability and Performance

Scaling Strategy
  • Sharding: Documents are distributed across multiple shards based on document ID, with each shard handling a subset. This distributes load and ensures scalability for millions of users and documents.
  • Leader-Based Consensus: Each shard has a primary server instance for document updates, with secondary instances for failover, ensuring consistency and availability.
  • Event-Driven Architecture: The pub-sub system (Redis) enables event-driven updates, reducing server load by broadcasting changes efficiently.
Ensuring Fast Updates
  • Low latency is achieved by routing users to the nearest data center (geo-redundancy) and using WebSockets for real-time communication. Redis’s in-memory data structure ensures quick message passing, meeting the sub-100ms requirement.
  • For large documents, ShareJS’s OT implementation is optimized for frequent updates, with periodic snapshots reducing the need for full operation replays.

Security and Access Control

Preventing Unauthorized Access
  • All communications are encrypted using HTTPS for web traffic and secure WebSockets, ensuring data privacy.
  • Authentication is handled through an identity provider, with user sessions validated before allowing operations.
Implementing RBAC
  • The permissions collection stores user roles (read-only, edit, admin) for each document. Before applying operations, the server checks the user’s role, denying unauthorized actions. This ensures fine-grained access control, meeting security requirements.

Versioning and Undo Feature

Efficient Storage of History
  • Document history is maintained in the operations collection, logging each edit with timestamp and user ID. This allows replaying operations to reconstruct any version, supporting undo functionality.
  • To optimize storage, periodic snapshots are stored in the documents collection, reducing the need to process long operation logs for historical access.
Delta Changes vs. Full Snapshots
  • The system uses delta changes (operations) for real-time updates, stored in the operation log. Full snapshots are taken at intervals (e.g., every 1000 operations) to balance storage efficiency and quick access, ensuring users can revert to previous versions without excessive computation.

Unexpected Detail: Hybrid Approach Consideration

While OT is central to real-time collaboration, the initial exploration of CRDTs like Automerge highlights a potential hybrid approach for offline editing, where CRDTs could simplify syncing but require additional conflict resolution for text. This dual consideration adds flexibility but increases complexity, which was ultimately resolved by favoring OT for consistency.

Conclusion

This design leverages ShareJS for OT-based real-time collaboration, MongoDB for scalable storage, and Redis for efficient pub-sub, ensuring low latency, high availability, and support for offline editing. The sharding mechanism and RBAC implementation meet scalability and security needs, with operation logs and snapshots providing robust versioning and undo features.

Key Citations

Posted on

Time Series + Predictive Analytics

I have had some interesting back-end questions posted to me recently.

Implementing a time series store / and sum_submerge method.
In this particular vein, I felt like solutions similar to ReductStore and PyStore were worth a look.

But I felt that I was at a loss for the overall theory in terms of time-series data vs a more traditional relational data used to model and create SaaS like I have been building most of my life.
I can definitely see how having a fleet of GPU’s one would want to collect telemetry data, and then use said data to quantify their performance and lifespan.

Using predictive analytics to predict the failure of a device, and preemptively removing it a top tier where the best clients are paying top dollar for said fleet of devices.
Would seem like a good idea to try and create a dataset of devices, and optimal telemetry vs thresholds for failure.
Also tracking deltas on things which could signify performance degradation.

Another crushing boulder has been dropped on me, with all the stuff that I don’t know being tacked on. Feels like atlas has become a splatter.

Posted on

Best Practices for Writing Unit Tests in Node.js

When writing unit tests in Node.js, following best practices ensures your tests are effective, maintainable, and reliable. Additionally, choosing the right testing framework can streamline the process. Below, I’ll outline key best practices for writing unit tests and share the testing frameworks I’ve used.


  1. Isolate Tests
    Ensure each test is independent and doesn’t depend on the state or outcome of other tests. This allows tests to run in any order and makes debugging easier. Use setup and teardown methods (like beforeEach and afterEach in Jest) to reset the environment before and after each test.
  2. Test Small Units
    Focus on testing individual functions or modules in isolation rather than entire workflows. Mock dependencies—such as database calls or external APIs—to keep the test focused on the specific logic being tested.
  3. Use Descriptive Test Names
    Write clear, descriptive test names that explain what’s being tested without needing to dive into the code. For example, prefer shouldReturnSumOfTwoNumbers over a vague testFunction.
  4. Cover Edge Cases
    Test not just the typical “happy path” but also edge cases, invalid inputs, and error conditions. This helps uncover bugs in less common scenarios.
  5. Avoid Testing Implementation Details
    Test the behavior and output of a function, not its internal workings. This keeps tests flexible and reduces maintenance when refactoring code.
  6. Keep Tests Fast
    Unit tests should execute quickly to support frequent runs and smooth development workflows. Avoid slow operations like network calls by mocking dependencies.
  7. Use Assertions Wisely
    Choose the right assertions for the job (e.g., toBe for primitives, toEqual for objects in Jest) and avoid over-asserting. Ideally, each test should verify one specific behavior.
  8. Maintain Test Coverage
    Aim for high coverage of critical paths and complex logic, but don’t chase 100% coverage for its own sake. Tools like Istanbul can help measure coverage effectively.
  9. Automate Test Execution
    Integrate tests into your CI/CD pipeline to run automatically on every code change. This catches regressions early and keeps the codebase stable.
  10. Write Tests First (TDD)
    Consider Test-Driven Development (TDD), where you write tests before the code. This approach can improve code design and testability, though writing tests early is valuable even without strict TDD.

Testing Frameworks I’ve Used

I’ve worked with several testing frameworks in the Node.js ecosystem, each with its strengths. Here’s an overview:

  1. Jest
    • What It Is: A popular, all-in-one testing framework known for simplicity and ease of use, especially with Node.js and React projects.
    • Key Features: Zero-config setup, built-in mocking, assertions, and coverage reporting, plus snapshot testing.
    • Why I Like It: Jest’s comprehensive features and parallel test execution make it fast and developer-friendly.
  2. Mocha
    • What It Is: A flexible testing framework often paired with assertion libraries like Chai.
    • Key Features: Supports synchronous and asynchronous testing, extensible with plugins, and offers custom reporting.
    • Why I Like It: Its flexibility gives me fine-grained control, making it ideal for complex testing needs.
  3. Jasmine
    • What It Is: A behavior-driven development (BDD) framework with a clean syntax.
    • Key Features: Built-in assertions and mocking, plus spies for tracking function calls—no external dependencies needed.
    • Why I Like It: The intuitive syntax suits teams who prefer a BDD approach.
  4. AVA
    • What It Is: A test runner focused on speed and simplicity, with strong support for modern JavaScript.
    • Key Features: Concurrent test execution, async/await support, and a minimalistic API.
    • Why I Like It: Its performance shines when testing asynchronous code.
  5. Tape
    • What It Is: A lightweight, minimalistic framework that outputs TAP (Test Anything Protocol) results.
    • Key Features: Simple, no-config setup, and easy integration with other tools.
    • Why I Like It: Perfect for small projects needing a straightforward testing solution.

<em>// Define the function to be tested</em>
function add(a, b) {
    return a + b;
}

<em>// Test suite for the add function</em>
describe('add function', () => {
    test('adds two positive numbers', () => {
        expect(add(2, 3)).toBe(5);
    });

    test('adds a positive and a negative number', () => {
        expect(add(2, -3)).toBe(-1);
    });

    test('adds two negative numbers', () => {
        expect(add(-2, -3)).toBe(-5);
    });

    test('adds a number and zero', () => {
        expect(add(2, 0)).toBe(2);
    });

    test('adds floating-point numbers', () => {
        expect(add(0.1, 0.2)).toBeCloseTo(0.3);
    });
});

Explanation

  • Purpose: The add function takes two parameters, a and b, and returns their sum. The test suite ensures this behavior works correctly across different types of numeric inputs.
  • Test Cases:
    • Two positive numbers: 2 + 3 should equal 5.
    • Positive and negative number: 2 + (-3) should equal -1.
    • Two negative numbers: (-2) + (-3) should equal -5.
    • Number and zero: 2 + 0 should equal 2.
    • Floating-point numbers: 0.1 + 0.2 should be approximately 0.3. We use toBeCloseTo instead of toBe due to JavaScript’s floating-point precision limitations.
  • Structure:
    • describe block: Groups all tests related to the add function for better organization.
    • test functions: Each test case is defined with a clear description and uses Jest’s expect function to assert the output matches the expected result.
  • Assumptions: The function assumes numeric inputs. Non-numeric inputs (e.g., strings) are not tested here, as the function’s purpose is basic numeric addition.

This test suite provides a simple yet comprehensive check of the add function’s functionality in Jest.

How to Mock External Services in Unit Tests with Jest

When writing unit tests in Jest, mocking external services—like APIs, databases, or third-party libraries—is essential to ensure your tests are fast, reliable, and isolated from real dependencies. Jest provides powerful tools to create mock implementations of these services. Below is a step-by-step guide to mocking external services in Jest, complete with examples.


Why Mock External Services?

Mocking replaces real external services with fake versions, allowing you to:

  • Avoid slow or unreliable network calls.
  • Prevent side effects (e.g., modifying a real database).
  • Simulate specific responses or errors without depending on live systems.

Steps to Mock External Services in Jest

1. Identify the External Service

Determine which external dependency you need to mock. For example:

  • An HTTP request to an API.
  • A database query.
  • A third-party library like Axios.

2. Use Jest’s Mocking Tools

Jest offers several methods to mock external services:

Mock Entire Modules with jest.mock()

Use jest.mock() to replace an entire module with a mock version. This is ideal for mocking libraries or custom modules that interact with external services.

Mock Specific Functions with jest.fn()

Create mock functions using jest.fn() and customize their behavior (e.g., return values or promise resolutions).

Spy on Methods with jest.spyOn()

Mock specific methods of an object while preserving the rest of the module’s functionality.

3. Handle Asynchronous Behavior

Since external services often involve asynchronous operations (e.g., API calls returning promises), Jest provides utilities like:

  • mockResolvedValue() for successful promise resolutions.
  • mockRejectedValue() for promise rejections.
  • mockImplementation() for custom async logic.

4. Reset or Restore Mocks

To maintain test isolation, reset mocks between tests using jest.resetAllMocks() or restore original implementations with jest.restoreAllMocks().


Example: Mocking an API Call

Let’s walk through an example of mocking an external API call in Jest.

Code to Test

Imagine you have a module that fetches user data from an API:

javascript

<em>// api.js</em>
const axios = require('axios');

async function getUserData(userId) {
  const response = await axios.get(`https://api.example.com/users/${userId}`);
  return response.data;
}

module.exports = { getUserData };

javascript

<em>// userService.js</em>
const { getUserData } = require('./api');

async function fetchUser(userId) {
  const userData = await getUserData(userId);
  return `User: ${userData.name}`;
}

module.exports = { fetchUser };

Test File

Here’s how to mock the getUserData function in Jest:

javascript

<em>// userService.test.js</em>
const { fetchUser } = require('./userService');
const api = require('./api');

jest.mock('./api'); <em>// Mock the entire api.js module</em>

describe('fetchUser', () => {
  afterEach(() => {
    jest.resetAllMocks(); <em>// Reset mocks after each test</em>
  });

  test('fetches user data successfully', async () => {
    <em>// Mock getUserData to return a resolved promise</em>
    api.getUserData.mockResolvedValue({ name: 'John Doe', age: 30 });

    const result = await fetchUser(1);
    expect(result).toBe('User: John Doe');
    expect(api.getUserData).toHaveBeenCalledWith(1);
  });

  test('handles error when fetching user data', async () => {
    <em>// Mock getUserData to return a rejected promise</em>
    api.getUserData.mockRejectedValue(new Error('Network Error'));

    await expect(fetchUser(1)).rejects.toThrow('Network Error');
  });
});

Explanation

  • jest.mock(‘./api’): Mocks the entire api.js module, replacing getUserData with a mock function.
  • mockResolvedValue(): Simulates a successful API response with fake data.
  • mockRejectedValue(): Simulates an API failure with an error.
  • jest.resetAllMocks(): Ensures mocks don’t persist between tests, maintaining isolation.
  • Async Testing: async/await handles the asynchronous nature of fetchUser.

Mocking Other External Services

Mocking a Third-Party Library (e.g., Axios)

If your code uses Axios directly, you can mock it like this:

javascript

const axios = require('axios');
jest.mock('axios');

test('fetches user data with Axios', async () => {
  axios.get.mockResolvedValue({ data: { name: 'John Doe' } });
  const response = await axios.get('https://api.example.com/users/1');
  expect(response.data).toEqual({ name: 'John Doe' });
});

Mocking a Database (e.g., Mongoose)

For a MongoDB interaction using Mongoose:

javascript

const mongoose = require('mongoose');
jest.mock('mongoose', () => {
  const mockModel = {
    find: jest.fn().mockResolvedValue([{ name: 'John Doe' }]),
  };
  return { model: jest.fn().mockReturnValue(mockModel) };
});

test('fetches data from database', async () => {
  const User = mongoose.model('User');
  const users = await User.find();
  expect(users).toEqual([{ name: 'John Doe' }]);
});

Advanced Mocking Techniques

Custom Mock Implementation

Simulate complex behavior, like a delayed API response:

javascript

api.getUserData.mockImplementation(() =>
  new Promise((resolve) => setTimeout(() => resolve({ name: 'John Doe' }), 1000))
);

Spying on Methods

Mock only a specific method:

javascript

jest.spyOn(api, 'getUserData').mockResolvedValue({ name: 'John Doe' });

Best Practices

  • Isolate Tests: Always reset or restore mocks to prevent test interference.
  • Match Real Behavior: Ensure mocks mimic the real service’s interface (e.g., return promises if the service is async).
  • Keep It Simple: Use the minimal mocking needed to test your logic.

By using jest.mock(), jest.fn(), and jest.spyOn(), along with utilities for handling async code, you can effectively mock external services in Jest unit tests. This approach keeps your tests fast, predictable, and independent of external systems.

Final Thoughts

By following best practices like isolating tests, using descriptive names, and covering edge cases, you can write unit tests that improve the reliability of your Node.js applications. As for frameworks, I’ve used Jest for its ease and features, Mocha for its flexibility, AVA for async performance, Jasmine for BDD, and Tape for simplicity. The right choice depends on your project’s needs and team preferences, but any of these can support a robust testing strategy.

To test the add function using Jest, we need to verify that it correctly adds two numbers. Below is a simple Jest test suite that covers basic scenarios, including positive numbers, negative numbers, zero, and floating-point numbers.

Posted on

ACID properties in relational databases and How they ensure data consistency

ACID properties are fundamental concepts in relational databases that ensure reliable transaction processing and maintain data consistency, even in the presence of errors, system failures, or concurrent access. The acronym ACID stands for Atomicity, Consistency, Isolation, and Durability. Below, I will explain each property and how they work together to ensure data consistency.


1. Atomicity

  • Definition: Atomicity ensures that a transaction is treated as a single, indivisible unit of work. This means that either all the operations within the transaction are executed successfully, or none of them are applied. There is no partial execution.
  • How it ensures consistency:
    • Consider a transaction that involves multiple steps, such as transferring money from one account to another (debiting one account and crediting another).
    • Atomicity guarantees that if any part of the transaction fails (e.g., the credit operation fails due to an error), the entire transaction is rolled back to its original state.
    • This prevents partial updates, such as debiting one account without crediting the other, which would leave the database in an inconsistent state (e.g., account balances would not match).
    • By ensuring all-or-nothing execution, atomicity maintains the integrity of the data.

2. Consistency

  • Definition: Consistency ensures that the database remains in a valid state before and after a transaction. It enforces all rules and constraints defined in the database schema, such as primary key uniqueness, foreign key relationships, data types, and check constraints.
  • How it ensures consistency:
    • Before committing a transaction, the database verifies that the transaction adheres to all defined rules.
    • For example, if a transaction tries to insert a duplicate primary key or violate a foreign key constraint, the transaction is not allowed to commit, and the database remains unchanged.
    • This ensures that only valid data is stored, preserving the overall consistency of the database.
    • Consistency prevents invalid or corrupted data from being committed, maintaining the integrity of the database schema.

3. Isolation

  • Definition: Isolation ensures that concurrent transactions do not interfere with each other. Each transaction is executed as if it were the only transaction running on the database, even when multiple transactions are processed simultaneously.
  • How it ensures consistency:
    • Isolation prevents issues that can arise when multiple transactions access and modify the same data concurrently, such as:
      • Dirty reads: Reading data from an uncommitted transaction that may later be rolled back.
      • Non-repeatable reads: Seeing different values for the same data within the same transaction due to changes by other transactions.
      • Phantom reads: Seeing changes in the number of rows (e.g., new rows inserted by another transaction) during a transaction.
    • Isolation is typically achieved through mechanisms like locking or multi-version concurrency control (MVCC), which ensure that transactions see a consistent view of the data.
    • By isolating transactions, the database ensures that concurrent operations do not compromise data integrity, maintaining consistency in multi-user environments.

4. Durability

  • Definition: Durability ensures that once a transaction is committed, its changes are permanent and will survive any subsequent failures, such as power outages, system crashes, or hardware malfunctions.
  • How it ensures consistency:
    • After a transaction is committed, the changes are written to non-volatile storage (e.g., disk), ensuring that the data is not lost even if the system fails immediately after the commit.
    • This guarantees that the database can recover to a consistent state after a failure, preserving the integrity of the committed transactions.
    • Durability ensures that once a transaction is successfully completed, its effects are permanently stored, maintaining long-term data consistency.

How ACID Properties Work Together to Ensure Data Consistency

The ACID properties collectively provide a robust framework for managing transactions and maintaining data consistency in relational databases:

  • Atomicity ensures that transactions are all-or-nothing, preventing partial updates that could lead to inconsistencies.
  • Consistency enforces the database’s rules and constraints, ensuring that only valid data is committed.
  • Isolation manages concurrent access, preventing transactions from interfering with each other and maintaining a consistent view of the data.
  • Durability guarantees that once a transaction is committed, its changes are permanent, even in the event of a system failure.

Together, these properties ensure that the database remains consistent, reliable, and resilient, even in complex, multi-user environments or during unexpected failures. By adhering to ACID principles, relational databases provide a trustworthy foundation for applications that require data integrity and consistency.

Posted on

What strategies would you use to optimize database queries and improve performance?

To optimize database queries and improve performance, I recommend a structured approach that addresses both the queries themselves and the broader database environment. Below are the key strategies:

1. Analyze Query Performance

Start by evaluating how your current queries perform to pinpoint inefficiencies:

  • Use Diagnostic Tools: Leverage tools like EXPLAIN in SQL to examine query execution plans. This reveals how the database processes your queries.
  • Identify Bottlenecks: Look for issues such as full table scans (where the database reads every row), unnecessary joins, or missing indexes that slow things down.

2. Review Database Schema

The structure of your database plays a critical role in query efficiency:

  • Normalization: Ensure the schema is normalized to eliminate redundancy and maintain data integrity, which can streamline queries.
  • Denormalization (When Needed): For applications with heavy read demands, consider denormalizing parts of the schema to reduce complex joins and speed up data retrieval.

3. Implement Indexing

Indexes are a powerful way to accelerate query execution:

  • Target Key Columns: Add indexes to columns frequently used in WHERE, JOIN, and ORDER BY clauses to allow faster data lookups.
  • Balance Indexing: Be cautious not to over-index, as too many indexes can slow down write operations like inserts and updates.

4. Use Caching Mechanisms

Reduce database load by storing frequently accessed data elsewhere:

  • Caching Tools: Implement solutions like Redis or Memcached to keep commonly used query results in memory.
  • Minimize Queries: Serve repeated requests from the cache instead of hitting the database every time.

5. Optimize Queries

Refine the queries themselves for maximum efficiency:

  • Rewrite for Efficiency: Avoid SELECT * (which retrieves all columns) and specify only the needed columns. Use appropriate JOIN types to match your data needs.
  • Batch Operations: Combine multiple operations into a single query where possible to cut down on database round trips.

6. Monitor and Tune the Database Server

Keep the database engine running smoothly:

  • Adjust Configuration: Fine-tune settings like buffer pool size or query cache to match your workload.
  • Regular Maintenance: Perform tasks like updating table statistics and rebuilding indexes to ensure optimal performance over time.

Conclusion

By applying these strategies—analyzing performance, refining the schema, indexing wisely, caching effectively, optimizing queries, and tuning the server—you can significantly boost database query performance and enhance the efficiency of your application. Start with the biggest bottlenecks and iterate as needed for the best results.

Posted on

How would you decide between using MongoDB (NoSQL) and PostgreSQL (relational database) for a new application?

Deciding between MongoDB (NoSQL) and PostgreSQL (relational database) for a new application depends on several factors, including the application’s data structure, scalability needs, transaction requirements, development speed, and team expertise. Below, I’ll outline the key considerations to help you make an informed decision.


1. Understand the Data Structure and Relationships

The nature of your data is one of the most critical factors in choosing between MongoDB and PostgreSQL.

  • Relational Data:
    • If your application involves complex relationships between entities (e.g., customers, orders, products) that require joins, foreign keys, and strict data integrity, PostgreSQL is the better choice.
    • PostgreSQL excels at maintaining data consistency across related tables and supports ACID (Atomicity, Consistency, Isolation, Durability) compliance, which is essential for applications like financial systems or e-commerce platforms.
  • Unstructured or Semi-Structured Data:
    • If your data is hierarchical, nested, or doesn’t fit neatly into tables (e.g., JSON-like documents, logs, or user profiles with varying fields), MongoDB is more suitable.
    • MongoDB’s document-based model allows you to store data in flexible, schemaless documents, making it ideal for applications where data structures evolve frequently.
  • Schema Flexibility:
    • MongoDB allows for dynamic schemas, meaning documents in the same collection can have different fields without a predefined structure. This is useful for rapid prototyping or applications with evolving requirements.
    • PostgreSQL requires a predefined schema, which is beneficial for structured data but can be restrictive if the schema changes frequently.

2. Consider Scalability and Performance Needs

Scalability and performance requirements can also guide your decision.

  • Horizontal Scaling:
    • MongoDB is designed for horizontal scaling, making it easier to distribute data across multiple servers or clusters. This is ideal for applications expecting rapid growth or handling large amounts of data (e.g., social media platforms, real-time analytics).
    • PostgreSQL typically scales vertically (by adding more resources to a single server), though it supports read replicas for scaling reads. If your application requires massive write loads, MongoDB might be more suitable.
  • Read/Write Patterns:
    • For read-heavy applications with complex queries, PostgreSQL’s advanced indexing and query optimization capabilities can provide better performance.
    • For write-heavy applications or those requiring high throughput, MongoDB’s document model can offer faster write operations, especially in distributed setups.

3. Evaluate Transaction Requirements

Transactional integrity is crucial for certain applications.

  • ACID Compliance:
    • If your application requires strict transactional integrity (e.g., financial systems, e-commerce platforms), PostgreSQL’s full ACID compliance is essential. It ensures that transactions are processed reliably and consistently.
    • MongoDB supports ACID transactions, but with some limitations, especially in distributed setups. If strict consistency is not critical, MongoDB’s flexible consistency models might be acceptable.
  • Eventual Consistency:
    • If your application can tolerate eventual consistency (e.g., social media feeds, analytics), MongoDB’s flexible consistency models can work well, offering better performance for distributed systems.

4. Assess Development Speed and Flexibility

The development process and long-term maintenance requirements are also important.

  • Rapid Prototyping:
    • MongoDB’s schemaless nature allows for faster development cycles, especially in the early stages of a project when requirements are evolving. Developers can iterate quickly without worrying about schema migrations.
    • PostgreSQL’s strict schema enforcement can slow down initial development if frequent schema changes are needed.
  • Long-Term Maintenance:
    • PostgreSQL’s strict schema enforcement can lead to better data quality and easier maintenance in the long run, especially for applications with stable, well-defined requirements.
    • MongoDB’s flexibility can sometimes lead to data inconsistencies if not carefully managed, which might complicate maintenance.

5. Consider Team Expertise and Ecosystem

Your team’s familiarity with the technologies and the available ecosystem can influence your choice.

  • Familiarity:
    • If your development team is more experienced with SQL and relational databases, PostgreSQL might be a better choice to leverage existing skills.
    • If your team is comfortable with NoSQL databases or JavaScript (given MongoDB’s JSON-like documents), MongoDB could be preferable.
  • Tooling and Community:
    • PostgreSQL has a longer history and a vast array of tools for administration, monitoring, and optimization, making it a mature choice for complex applications.
    • MongoDB’s ecosystem is also robust, with a focus on cloud-native and distributed systems. Its managed services (e.g., MongoDB Atlas) are designed for ease of use in cloud environments.

6. Evaluate Cost and Operational Complexity

Operational overhead and cost considerations can also play a role.

  • Operational Overhead:
    • MongoDB’s distributed architecture can introduce complexity in terms of managing clusters, sharding, and replication. If your team lacks experience with distributed systems, this could increase operational costs.
    • PostgreSQL is simpler to manage in smaller setups but may require more effort to scale horizontally.
  • Cloud Integration:
    • Both databases are supported by major cloud providers, but MongoDB’s managed services (e.g., MongoDB Atlas) are designed for ease of use in cloud environments, potentially reducing operational burden.

7. Consider Use Case Specifics

Certain use cases may favor one database over the other.

  • Geospatial Data:
    • If your application heavily relies on geospatial queries (e.g., location-based services), both databases have geospatial capabilities. However, MongoDB’s GeoJSON support and 2dsphere indexes are often more straightforward.
  • Full-Text Search:
    • PostgreSQL has robust full-text search capabilities, making it a strong choice for applications requiring advanced search features.
  • Time-Series Data:
    • For time-series data (e.g., IoT sensor data), MongoDB’s document model can handle large volumes of time-stamped data efficiently. PostgreSQL also has extensions like TimescaleDB for this purpose.

Decision Framework

  • Choose PostgreSQL if:
    • Your application requires complex relationships and joins between entities.
    • Strict ACID compliance is necessary for transactional integrity.
    • Your team is more comfortable with SQL and relational databases.
    • The data schema is well-defined and unlikely to change frequently.
    • Advanced querying, indexing, and full-text search are critical.

  • Choose MongoDB if:
    • Your data is unstructured or semi-structured (e.g., JSON-like documents).
    • Your application needs to scale horizontally with ease.
    • Rapid development and schema flexibility are priorities.
    • Your team is experienced with NoSQL databases or JavaScript.
    • Your application involves large volumes of write-heavy operations or distributed systems.

Conclusion

The decision between MongoDB and PostgreSQL should be based on the specific needs of your application. If your application demands strict data integrity, complex relationships, and a stable schema, PostgreSQL is the better choice. Conversely, if flexibility, scalability, and rapid development are more important, MongoDB is likely a better fit. In some cases, a hybrid approach using both databases for different parts of the application can also be effective, but this introduces additional complexity.

Posted on

are you a logger?

Some people are debuggers.
Stepping their way through the binary jungle, one hack at a time.

For those of you who are loggers, staring  at the console for interesting events:
I had some time to write a small php script that will put a console.log for every method in a canjs controller.

Should save me loads of monotony when reverse engineering OPC ( other peoples code ).

Hope you find it useful:

<?php

if( !isset( $argv[1] ) )
$argv[1] = 'Storage.js';

$fileInfo = pathinfo( $argv[1] );
$outFile = $fileInfo['dirname'] . '/' . $fileInfo['filename'] . '_debug.' . $fileInfo['extension'];

$in = fopen($argv[1], 'r');
$out = fopen($outFile, 'w');

while( !feof( $in ) ){
$line = fgets($in);
if( preg_match('/:\W+function/', $line)){
preg_match("/\((.*)\)/", $line, $matches);
$function = explode(':', $line );
$functionName = trim($function[0]);

if( isset( $matches[1] ) && strlen($matches[1]) > 0  )
$line .= "\nconsole.log( '$fileInfo[filename]', '$functionName', $matches[1] )\n";
else
$line .= "\nconsole.log( '$fileInfo[filename]', '$functionName' )\n";
}
fputs($out, $line);
}

fclose($in);
fclose($out);

Posted on

Real Web Developers don’t do “Builds”

I don’t want to wait for some MAVEN command to execute.
I just want to refresh the browser!

So this is where Charles Proxy comes to the rescue!
The “Map to Local” feature allows you to quickly map your resources to a live implementation on a production, or development environment.

Really great feature, that saves a lot of time. Even the uploading to a generic hosting step can be skipped when doing changes to static resources. So this tool is good for Enterprises and SMBs.

Usually I would completely avoid extraneous non open source solutions. The KISS principle is something I apply not just to my coding, but to my workflow as well. Unfortunately, I don’t always get to decide what platform and workflow structure I have to interact with. And this where tools like Fiddler and Charles become indispensable in preserving my sanity.

While on the Mac environment, I am currently using Charles Proxy, Fiddler2 also provides this feature, hidden in the AutoResponder tab. Simply activate “Map URLs to local files”. And Fiddler should be able to run on Linux / Mac, although I haven’t tried it yet.

Check it out http://www.fiddler2.com/fiddler/help/video/default.asp

Posted on

Mobile Web

Today I had the pleasure of meeting Maximiliano Firtman. An amazing speaker who kept my attention the entire time.
He really validated my perspective on mobile web. That even though it’s a complete fuster cluck when it comes to devices, screens, features, and may be an exercise in complete futility because as soon as your done coding everything is going to change anyway, you still can’t idly sit back and do nothing.

Though many of my co-workers were joking that his talk made them depressed about the current state of the mobile web, I really found it enjoyable, because he echoed back many of my own viewpoints regarding Responsive Web Design, system architecture, and even how job duties or roles should be defined. And since great minds think alike, I recommend you check out his books. They will reveal lots of great resources. And illuminate niches in capturing users, and getting a better ROI, when it comes to creating a mobile version of your website.

Spoiler alert, there is no easy way to go about creating a mobile version of your site. And as craftsmen, we have to first painstakingly measure, and remeasure, before we put our tools to work. Keep in mind, this gentleman has been doing web programming since 1995, and has been a subject matter expert in mobile web design since 2000, during the days of WML, and WAP.