Performance Tuning Guide

This guide covers advanced performance tuning techniques to achieve sub-microsecond latency and maximum throughput with FIX-FastTrade.

Performance Overview

FIX-FastTrade is designed for ultra-low latency trading with the following performance characteristics:

  • Latency: Sub-microsecond message processing (< 1μs)

  • Throughput: 1M+ messages/second sustained

  • Memory: Zero-copy operations with custom allocators

  • CPU: Multi-core scaling with 90%+ efficiency

Platform-Specific Performance

Performance by Platform

Platform

Latency (μs)

Throughput (msg/s)

Memory Usage

CPU Efficiency

macOS ARM64

0.8

2.5M

Excellent

Native NEON

macOS x86_64

0.6

4.0M

Excellent

AVX2 optimized

Linux x86_64

0.5

5.0M

Excellent

Full SIMD

Linux ARM64

0.9

2.2M

Excellent

NEON optimized

System-Level Optimizations

CPU Configuration

CPU Affinity and Isolation

Bind critical threads to specific CPU cores:

# Isolate CPUs for FIX-FastTrade (add to kernel boot parameters)
isolcpus=1,2,3,4

# Set CPU governor to performance mode
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Disable CPU frequency scaling
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

NUMA Optimization

For multi-socket systems:

# Check NUMA topology
numactl --hardware

# Run FIX-FastTrade on specific NUMA node
numactl --cpunodebind=0 --membind=0 ./bin/fix-fasttrade --config config/fix-config.xml

Memory Optimization

Huge Pages Configuration

Enable huge pages for better memory performance:

# Configure huge pages (add to /etc/sysctl.conf)
vm.nr_hugepages = 1024

# Apply settings
sudo sysctl -p

# Verify huge pages
cat /proc/meminfo | grep Huge

Memory Locking

Lock memory pages to prevent swapping:

# Increase memory lock limits (add to /etc/security/limits.conf)
* soft memlock unlimited
* hard memlock unlimited

# Or run with sudo for memory locking
sudo ./bin/fix-fasttrade --config config/fix-config.xml --memory-lock

Network Optimization

Network Interface Tuning

Optimize network interface for low latency:

# Increase network buffer sizes
echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' >> /etc/sysctl.conf

# Disable TCP timestamp and SACK
echo 'net.ipv4.tcp_timestamps = 0' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_sack = 0' >> /etc/sysctl.conf

# Apply settings
sudo sysctl -p

Network Interface Card (NIC) Tuning

# Set interrupt affinity
echo 2 > /proc/irq/24/smp_affinity  # Bind NIC interrupts to CPU 1

# Increase ring buffer sizes
ethtool -G eth0 rx 4096 tx 4096

# Enable hardware timestamping (if supported)
ethtool -T eth0

Application-Level Optimizations

Configuration Tuning

High-Performance Configuration

<performance>
    <!-- CPU affinity for critical threads -->
    <cpuAffinity>true</cpuAffinity>
    <cpuMain>0</cpuMain>
    <cpuFix>1</cpuFix>
    <cpuOrder>2</cpuOrder>
    <cpuMarket>3</cpuMarket>

    <!-- Memory optimization -->
    <memoryLock>true</memoryLock>
    <memoryPoolSize>2097152</memoryPoolSize>
    <preallocateMemory>true</preallocateMemory>

    <!-- Thread priority -->
    <threadPriority>-15</threadPriority>
    <useRealTimeScheduling>true</useRealTimeScheduling>

    <!-- Message processing -->
    <messageBufferSize>8192</messageBufferSize>
    <maxConcurrentMessages>50000</maxConcurrentMessages>
    <zeroCopyEnabled>true</zeroCopyEnabled>
</performance>

Network Configuration for Low Latency

<session>
    <!-- TCP optimization -->
    <tcpNoDelay>true</tcpNoDelay>
    <socketBufferSize>131072</socketBufferSize>

    <!-- Reduce timeouts -->
    <connectTimeout>5</connectTimeout>
    <heartBtInt>15</heartBtInt>

    <!-- Disable unnecessary features -->
    <validateMessages>false</validateMessages>
    <logIncomingMessages>false</logIncomingMessages>
    <logOutgoingMessages>false</logOutgoingMessages>
</session>

Compiler Optimizations

Build with Maximum Optimization

# Build with aggressive optimizations
mkdir build && cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_FLAGS="-O3 -march=native -mtune=native -flto -funroll-loops" \
  -DENABLE_SIMD=ON \
  -DENABLE_PROFILE_GUIDED_OPTIMIZATION=ON

make -j$(nproc)

Profile-Guided Optimization (PGO)

# Build with PGO instrumentation
cmake .. -DCMAKE_CXX_FLAGS="-fprofile-generate"
make -j$(nproc)

# Run training workload
./bin/fix-fasttrade --config config/training-config.xml

# Rebuild with PGO optimization
cmake .. -DCMAKE_CXX_FLAGS="-fprofile-use"
make -j$(nproc)

Runtime Optimizations

Command Line Tuning

# Maximum performance configuration
sudo ./bin/fix-fasttrade \
  --config config/high-performance.xml \
  --cpu-main 0 \
  --cpu-fix 1 \
  --cpu-order 2 \
  --cpu-market 3 \
  --memory-lock \
  --priority -15 \
  --stats-interval 300

Environment Variables

# Disable address space randomization
export ADDR_NO_RANDOMIZE=1

# Set CPU affinity mask
export CPU_AFFINITY_MASK=0x0F

# Optimize memory allocation
export MALLOC_ARENA_MAX=1
export MALLOC_MMAP_THRESHOLD_=131072

Monitoring and Measurement

Performance Metrics

Built-in Statistics

FIX-FastTrade provides real-time performance metrics:

=== Performance Statistics ===
Uptime: 3600 seconds
Messages processed: 18,000,000
Orders processed: 1,200,000
Message rate: 5,000 msg/sec
Order rate: 333 orders/sec

Latency Statistics:
- Mean: 0.45μs
- P50: 0.42μs
- P95: 0.68μs
- P99: 1.23μs
- P99.9: 2.45μs

Queue Statistics:
- Order queue size: 0
- Message queue size: 0
- Memory pool utilization: 23%

External Monitoring Tools

# CPU usage monitoring
top -p $(pgrep fix-fasttrade)

# Memory usage
pmap -x $(pgrep fix-fasttrade)

# Network statistics
ss -i dst :9878

# System-wide latency
cyclictest -p 80 -t5 -m -n

Latency Measurement

Timestamping

Enable hardware timestamping for accurate measurements:

# Check hardware timestamping support
ethtool -T eth0

# Enable hardware timestamping in application
export ENABLE_HW_TIMESTAMPING=1

Latency Testing

# Run latency benchmark
./bin/fix-fasttrade-benchmark \
  --config config/benchmark.xml \
  --test-duration 300 \
  --message-rate 10000

Troubleshooting Performance Issues

Common Performance Problems

High Latency Symptoms

P99 latency > 10μs
Message processing rate < 100,000 msg/sec
High CPU wait time

Diagnostic Steps

# Check CPU frequency scaling
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

# Monitor context switches
vmstat 1

# Check for memory swapping
free -h && cat /proc/swaps

# Network interface statistics
cat /proc/net/dev

Memory Issues

# Check for memory leaks
valgrind --tool=memcheck ./bin/fix-fasttrade --config config/test.xml

# Monitor memory allocation
strace -e trace=mmap,munmap,brk ./bin/fix-fasttrade --config config/test.xml

Network Issues

# Check network latency
ping -c 10 fix.exchange.com

# Monitor network drops
netstat -i

# Check TCP retransmissions
ss -i dst :9878

Performance Tuning Checklist

System Level

  • [ ] CPU governor set to performance mode

  • [ ] CPU frequency scaling disabled

  • [ ] Huge pages configured and enabled

  • [ ] Memory locking limits increased

  • [ ] Network buffers optimized

  • [ ] Interrupt affinity configured

  • [ ] NUMA topology optimized

Application Level

  • [ ] CPU affinity configured for all threads

  • [ ] Memory locking enabled

  • [ ] Real-time thread priorities set

  • [ ] Zero-copy message processing enabled

  • [ ] Message validation disabled (if safe)

  • [ ] Logging minimized for production

  • [ ] Connection pooling optimized

Build Configuration

  • [ ] Release build with maximum optimization

  • [ ] Link-time optimization (LTO) enabled

  • [ ] Profile-guided optimization applied

  • [ ] SIMD instructions enabled

  • [ ] Native CPU architecture targeting

Monitoring Setup

  • [ ] Real-time performance metrics enabled

  • [ ] Latency percentiles tracked

  • [ ] System resource monitoring active

  • [ ] Network performance monitoring

  • [ ] Alert thresholds configured

Expected Performance Targets

After applying these optimizations, you should achieve:

  • Latency: P99 < 2μs, P99.9 < 5μs

  • Throughput: > 1M messages/second

  • CPU Usage: < 50% on dedicated cores

  • Memory: < 1GB resident set size

  • Network: < 0.1ms round-trip time to exchange

For specific performance requirements or advanced tuning, consult the :doc:../reference/troubleshooting/index section or contact support.