Ashish Kumar | Full-stack & ML Engineer

Building a high-performance centralized cryptocurrency exchange (CEX) matching engine is one of the most demanding software engineering challenges. It requires a relentless focus on latency, throughput, and determinism. While most modern web applications are built with managed, garbage-collected languages, the core of a competitive trading system demands bare-metal control.

This case study delves into the architecture, design decisions, and implementation details of my CEX project—a proprietary limit order book and matching engine written entirely in C from scratch.

1. The Core Architecture

The architecture of the exchange is divided into three primary subsystems:

The Networking Layer: Custom-built HTTP and WebSocket servers for ingestion and broadcating.
The Matching Engine: The algorithmic core that maintains the limit order book and executes trades based on price-time priority.
The Memory Manager: A specialized pre-allocated memory pool system to eliminate OS-level allocation overhead during runtime.

Data Structures: The Limit Order Book (LOB)

At the heart of any exchange is the Limit Order Book. It must support incredibly fast insertions, deletions, and executions.

To achieve $O(1)$ time complexity for order execution and $O(\log N)$ for order insertion, the LOB is structured using a hybrid approach:

Price Levels: Managed using a self-balancing Red-Black Tree. This allows for fast querying of the best bid and ask prices.
Order Queues: Each price level node contains a pointer to a doubly linked list of individual orders. This enforces the "Time" aspect of "Price-Time Priority."

C17 lines|

// Simplified representation of the order book structures
typedef struct Order {
    uint64_t order_id;
    uint64_t user_id;
    uint64_t quantity;
    uint64_t price;
    struct Order* next;
    struct Order* prev;
} Order;

typedef struct PriceLevel {
    uint64_t price;
    uint64_t total_volume;
    Order* head;
    Order* tail;
    struct PriceLevel* left;
    struct PriceLevel* right;
} PriceLevel;

When an aggressive order crosses the spread, the engine traverses the Red-Black tree to find the best price, then walks the doubly linked list to match against resting orders until the incoming quantity is filled.

2. Eliminating Allocation Overhead

Standard library memory allocation (malloc and free) requires context switches to the OS kernel and involves complex bookkeeping. In a highly volatile market, an influx of thousands of orders per second can cause severe heap fragmentation and unpredictable latency spikes.

Custom Memory Pools

To guarantee deterministic latency, the CEX engine pre-allocates all necessary memory at startup. We use custom memory pools (slab allocators) for the Order and PriceLevel structs.

C15 lines|

typedef struct OrderPool {
    Order* blocks;
    Order** free_list;
    size_t capacity;
    size_t free_count;
} OrderPool;

// O(1) Allocation
Order* pool_alloc(OrderPool* pool) {
    if (pool->free_count == 0) return NULL; // Pool exhausted
    return pool->free_list[--pool->free_count];
}

// O(1) Deallocation
void pool_free(OrderPool* pool, Order* order) {
    pool->free_list[pool->free_count++] = order;
}

This ensures that under peak load, the engine is simply shifting pointers around rather than requesting memory from the operating system.

3. Precision and Math

One of the most dangerous pitfalls in financial engineering is the use of IEEE 754 floating-point numbers (float or double). Floating-point arithmetic introduces rounding errors that are unacceptable when handling user funds.

To solve this, the engine strictly uses Fixed-Point Arithmetic scaled by a factor (e.g., $10^8$ for Bitcoin satoshis). All prices and quantities are represented as unsigned 64-bit integers (uint64_t).

A price of $50,000.50 is stored as 5000050000000.
Mathematical operations are reduced to simple, clock-cycle-efficient integer arithmetic.

4. The Networking Layer

A matching engine is useless if it cannot communicate with the outside world efficiently. The src/net directory contains a completely bespoke networking stack.

Avoiding Bloated Frameworks

Instead of relying on heavy event loops like libuv or external web servers like Nginx, the system utilizes Linux epoll directly. This edge-triggered I/O event notification facility allows the engine to handle thousands of concurrent WebSocket connections for real-time market data broadcasting.

The WebSocket Implementation:

Framing: Custom implementation of RFC 6455 framing to parse incoming trade requests and format outgoing order book deltas.
Zero-Copy Broadcasts: When a trade occurs, the JSON string representing the market delta is constructed exactly once in a shared memory buffer and written directly to the file descriptors of all subscribed WebSocket clients using send() or writev().

5. Performance and Profiling

During stress testing, the goal was to push the throughput to the absolute maximum on a single CPU thread, avoiding the massive overhead of lock contention (mutexes).

Single-Threaded by Design

Most traditional exchanges use a single-threaded matching engine per trading pair (e.g., one thread for BTC/USD, one for ETH/USD). This architecture avoids the need for locks (pthread_mutex_t), which are notoriously slow. By pinning the matching thread to an isolated CPU core using sched_setaffinity, we prevent the OS from context-switching the thread away, keeping the L1 cache hot.

Benchmarks

On standard consumer hardware (Intel Core i7):

Order Insertion (Passive): ~45 nanoseconds.
Order Execution (Aggressive): ~65 nanoseconds per matched resting order.
Throughput: Sustained processing of>1.5 million orders per second on a single core.

6. Challenges and Learnings

Building a financial engine from scratch in C brought several difficult challenges:

JSON Serialization Overhead: Parsing incoming JSON orders and formatting outgoing JSON updates was initially the biggest bottleneck, taking longer than the actual trade matching. I had to write a highly optimized, allocation-free string formatting routine specific to our data structures.
Handling Network Backpressure: When broadcasting market data to slow WebSocket clients, the TCP send buffers would fill up. Implementing a robust backpressure mechanism to drop stale market data updates for lagging clients without blocking the main engine thread was a complex state-machine challenge.
Strict Memory Safety: Without a garbage collector, tracking memory leaks required rigorous use of tools like Valgrind and AddressSanitizer during the testing phase. Every possible execution branch had to correctly return Order structs to the memory pool.

Conclusion

This CEX project represents a deep dive into extreme systems optimization. By stripping away modern abstractions, abandoning generic data structures, and writing close to the metal, it achieves performance metrics that rival institutional trading platforms. It stands as a testament to the power of C in domains where determinism and speed are absolute requirements.

This case study delves into the architecture, design decisions, and implementation details of my CEX project—a proprietary limit order book and matching engine written entirely in C from scratch.

1. The Core Architecture

The architecture of the exchange is divided into three primary subsystems:

The Networking Layer: Custom-built HTTP and WebSocket servers for ingestion and broadcating.
The Matching Engine: The algorithmic core that maintains the limit order book and executes trades based on price-time priority.
The Memory Manager: A specialized pre-allocated memory pool system to eliminate OS-level allocation overhead during runtime.

Data Structures: The Limit Order Book (LOB)

At the heart of any exchange is the Limit Order Book. It must support incredibly fast insertions, deletions, and executions.

To achieve $O(1)$ time complexity for order execution and $O(\log N)$ for order insertion, the LOB is structured using a hybrid approach:

Price Levels: Managed using a self-balancing Red-Black Tree. This allows for fast querying of the best bid and ask prices.
Order Queues: Each price level node contains a pointer to a doubly linked list of individual orders. This enforces the "Time" aspect of "Price-Time Priority."

C17 lines|

// Simplified representation of the order book structures
typedef struct Order {
    uint64_t order_id;
    uint64_t user_id;
    uint64_t quantity;
    uint64_t price;
    struct Order* next;
    struct Order* prev;
} Order;

typedef struct PriceLevel {
    uint64_t price;
    uint64_t total_volume;
    Order* head;
    Order* tail;
    struct PriceLevel* left;
    struct PriceLevel* right;
} PriceLevel;

2. Eliminating Allocation Overhead

Custom Memory Pools

To guarantee deterministic latency, the CEX engine pre-allocates all necessary memory at startup. We use custom memory pools (slab allocators) for the Order and PriceLevel structs.

C15 lines|

typedef struct OrderPool {
    Order* blocks;
    Order** free_list;
    size_t capacity;
    size_t free_count;
} OrderPool;

// O(1) Allocation
Order* pool_alloc(OrderPool* pool) {
    if (pool->free_count == 0) return NULL; // Pool exhausted
    return pool->free_list[--pool->free_count];
}

// O(1) Deallocation
void pool_free(OrderPool* pool, Order* order) {
    pool->free_list[pool->free_count++] = order;
}

This ensures that under peak load, the engine is simply shifting pointers around rather than requesting memory from the operating system.

3. Precision and Math

A price of $50,000.50 is stored as 5000050000000.
Mathematical operations are reduced to simple, clock-cycle-efficient integer arithmetic.

4. The Networking Layer

A matching engine is useless if it cannot communicate with the outside world efficiently. The src/net directory contains a completely bespoke networking stack.

Avoiding Bloated Frameworks

The WebSocket Implementation:

Framing: Custom implementation of RFC 6455 framing to parse incoming trade requests and format outgoing order book deltas.
Zero-Copy Broadcasts: When a trade occurs, the JSON string representing the market delta is constructed exactly once in a shared memory buffer and written directly to the file descriptors of all subscribed WebSocket clients using send() or writev().

5. Performance and Profiling

During stress testing, the goal was to push the throughput to the absolute maximum on a single CPU thread, avoiding the massive overhead of lock contention (mutexes).

Single-Threaded by Design

Benchmarks

On standard consumer hardware (Intel Core i7):

Order Insertion (Passive): ~45 nanoseconds.
Order Execution (Aggressive): ~65 nanoseconds per matched resting order.
Throughput: Sustained processing of>1.5 million orders per second on a single core.

6. Challenges and Learnings

Building a financial engine from scratch in C brought several difficult challenges:

JSON Serialization Overhead: Parsing incoming JSON orders and formatting outgoing JSON updates was initially the biggest bottleneck, taking longer than the actual trade matching. I had to write a highly optimized, allocation-free string formatting routine specific to our data structures.
Handling Network Backpressure: When broadcasting market data to slow WebSocket clients, the TCP send buffers would fill up. Implementing a robust backpressure mechanism to drop stale market data updates for lagging clients without blocking the main engine thread was a complex state-machine challenge.
Strict Memory Safety: Without a garbage collector, tracking memory leaks required rigorous use of tools like Valgrind and AddressSanitizer during the testing phase. Every possible execution branch had to correctly return Order structs to the memory pool.

CEX (Centralized Exchange)

1. The Core Architecture

Data Structures: The Limit Order Book (LOB)

2. Eliminating Allocation Overhead

Custom Memory Pools

3. Precision and Math

4. The Networking Layer

Avoiding Bloated Frameworks

5. Performance and Profiling

Single-Threaded by Design

Benchmarks

6. Challenges and Learnings

Conclusion

CEX (Centralized Exchange)

1. The Core Architecture

Data Structures: The Limit Order Book (LOB)

2. Eliminating Allocation Overhead

Custom Memory Pools

3. Precision and Math

4. The Networking Layer

Avoiding Bloated Frameworks

5. Performance and Profiling

Single-Threaded by Design

Benchmarks

6. Challenges and Learnings

Conclusion