Performance Tuning - Timeplus Proton

Timeplus Proton is designed for high-performance stream processing. This guide covers optimization techniques to achieve maximum throughput and minimum latency.

Performance Characteristics

Timeplus Proton delivers exceptional performance:

Throughput: Up to 90 million events per second (EPS) on modern hardware
Latency: As low as 4 milliseconds end-to-end latency
Cardinality: Handles 1 million unique keys in aggregations
Resource Efficient: Runs on as little as 0.5 GB RAM (AWS t2.nano)

Benchmark: Apple MacBook Pro with M2 Max processor

Memory Optimization

Memory Configuration

Configure memory limits based on available RAM:

# config.yaml

# Use up to 90% of total RAM
max_server_memory_usage_to_ram_ratio: 0.9

# Cache up to 50% of RAM
cache_size_to_ram_max_ratio: 0.5

# Mark cache (index cache)
mark_cache_size: 5368709120  # 5 GB

# Uncompressed block cache
uncompressed_cache_size: 8589934592  # 8 GB

# Primary key cache
primary_key_cache_size: 5368709120  # 5 GB

Memory Limit Recommendations

RAM Available	max_server_memory	mark_cache	uncompressed_cache
4 GB	0.7	512 MB	1 GB
8 GB	0.8	1 GB	2 GB
16 GB	0.85	2 GB	4 GB
32 GB	0.9	5 GB	8 GB
64 GB+	0.9	10 GB	16 GB

Per-Query Memory Limits

Configure in users.yaml:

profiles:
  default:
    max_memory_usage: 10000000000  # 10 GB per query
    max_memory_usage_for_user: 20000000000  # 20 GB per user
    
  low_memory:
    max_memory_usage: 1000000000   # 1 GB per query

Monitor Memory Usage

-- Current memory usage
SELECT 
    formatReadableSize(value) AS memory_used
FROM system.asynchronous_metrics
WHERE metric = 'jemalloc.allocated';

-- Memory by query
SELECT 
    query_id,
    user,
    formatReadableSize(memory_usage) AS memory,
    query
FROM system.processes
ORDER BY memory_usage DESC;

CPU Optimization

Thread Pool Configuration

Optimize thread pools for your CPU core count:

# config.yaml

# Background processing (set to CPU cores)
background_pool_size: 16
background_merge_pool_size: 16

# Data fetching threads
background_fetches_pool_size: 8

# Data movement between disks
background_move_pool_size: 8

# Scheduled tasks
background_schedule_pool_size: 128

# Streaming query processing
streaming_processing_pool_size: 100

Thread Pool Sizing Guidelines

CPU Cores	background_pool	merge_pool	streaming_pool
2-4	4	4	50
8	8	8	100
16	16	16	200
32+	32	32	300

CPU Affinity

For dedicated servers, pin Proton to specific CPU cores:

# Linux: Use taskset
taskset -c 0-15 proton server

# Or in systemd service
[Service]
CPUAffinity=0-15

SIMD Optimization

Proton uses SIMD instructions for performance. Ensure your CPU supports:

x86_64: AVX2 (minimum), AVX-512 (optimal)
ARM: NEON (included in ARM64)

Verify SIMD support:

SELECT * FROM system.build_options 
WHERE name LIKE '%SIMD%' OR name LIKE '%AVX%';

Query Concurrency Tuning

Concurrent Query Limits

# config.yaml

# Overall limits
max_concurrent_queries: 100

# By query type
max_concurrent_select_queries: 100
max_concurrent_insert_queries: 100

# Streaming queries
streaming_processing_pool_size: 100

Environment Variable Override

docker run -d \
  -e MAX_CONCURRENT_QUERIES=200 \
  -e MAX_CONCURRENT_STREAMING_QUERIES=150 \
  d.timeplus.com/timeplus-io/proton:latest

Adjust Based on Workload

High read throughput: Increase max_concurrent_select_queries
High write throughput: Increase max_concurrent_insert_queries
Many streaming queries: Increase streaming_processing_pool_size
Limited resources: Reduce all limits to prevent overload

Storage Optimization

Disk Selection

Best: NVMe SSD for data and checkpoints
Good: SATA SSD for data, separate disk for checkpoints
Minimum: SSD (avoid HDD for production)

Data Path Configuration

# Separate data and temporary paths
path: /data/proton/
tmp_path: /tmp/proton/

# Use fast storage for checkpoints
query_state_checkpoint:
  path: /nvme/proton/checkpoint/

# Separate disk for logs
logger:
  log: /logs/proton/proton-server.log

Compression Settings

Balance compression ratio vs. CPU usage:

-- Create table with compression
CREATE STREAM events (
  timestamp datetime64(3),
  user_id string,
  event_type string,
  payload string
)
ENGINE = Stream
SETTINGS 
  codec = 'ZSTD(1)',  -- Fast compression (levels 1-22)
  min_compress_block_size = 65536;

Compression levels:

ZSTD(1): Fastest, lower compression
ZSTD(3): Balanced (recommended)
ZSTD(9): Higher compression, slower
LZ4: Faster than ZSTD, lower ratio

Network Optimization

TCP Settings

# config.yaml

# Connection pooling
max_connections: 4096
keep_alive_timeout: 3

# TCP keep-alive
tcp_keep_alive_timeout: 30

# Send/receive buffer sizes (bytes)
max_network_bandwidth: 1000000000  # 1 Gbps
max_network_bandwidth_for_user: 1000000000

Kafka/Redpanda Tuning

Optimize external stream storage:

stream_storage:
  kafka:
    enabled: true
    brokers: kafka:9092
    
    # Producer latency control
    queue_buffering_max_ms: 50    # Lower = lower latency
    batch_size: 1048576           # 1 MB batches
    
    # Consumer settings
    fetch_wait_max_ms: 500        # Max wait for data
    fetch_min_bytes: 1            # Fetch immediately
    fetch_max_bytes: 52428800     # 50 MB max fetch
    
    # Parallelism
    num_consumers: 8              # Consumer threads

Latency vs. Throughput Tradeoff

For minimum latency (< 10ms):

queue_buffering_max_ms: 10
fetch_wait_max_ms: 100
batch_size: 16384  # Smaller batches

For maximum throughput:

queue_buffering_max_ms: 100
fetch_wait_max_ms: 500
batch_size: 1048576  # Larger batches

Checkpoint Optimization

For stateful streaming queries:

query_state_checkpoint:
  path: /nvme/proton/checkpoint/
  
  # Auto-tune checkpoint intervals
  interval: 0  # Auto mode
  
  # Lightweight state (ETL)
  light_state_interval: 5          # 5 seconds
  
  # Heavy state (large aggregations)
  heavy_state_interval: 900        # 15 minutes
  heavy_state_size_threshold: 524288000  # 500 MB
  
  # Minimize checkpoint overhead
  log_flush_interval_entries: 10   # Batch log writes
  log_segment_size: 2147483648     # 2 GB segments

Checkpoint Best Practices

Use fast storage (NVMe) for checkpoint directory
Tune intervals based on state size
Monitor checkpoint latency in query logs
Clean up old checkpoints via TTL settings
Separate checkpoint disk from data disk if possible

Streaming Query Optimization

Window Function Tuning

Optimize tumble/hop/session windows:

-- Use smaller window intervals for lower latency
SELECT 
  window_start,
  count(*) AS event_count
FROM tumble(events, 5s)  -- 5-second windows
GROUP BY window_start;

-- For high cardinality aggregations
SELECT 
  window_start,
  user_id,
  count(*) AS events
FROM tumble(events, 1m)
GROUP BY window_start, user_id
SETTINGS 
  max_memory_usage = 20000000000;  -- Allocate more memory

Materialized View Optimization

-- Optimize materialized view refresh
CREATE MATERIALIZED VIEW user_stats
ENGINE = SummingMergeTree()
ORDER BY (user_id, date)
SETTINGS 
  index_granularity = 8192,        -- Default, good for most
  merge_max_block_size = 8192,     -- Merge block size
  min_bytes_for_wide_part = 10485760  -- 10 MB for wide format
AS SELECT 
  user_id,
  to_date(timestamp) AS date,
  count(*) AS event_count
FROM events
GROUP BY user_id, date;

Avoid Common Pitfalls

Don’t use SELECT * - specify only needed columns
Avoid unbounded state - use time windows
Limit JOIN complexity - pre-aggregate when possible
Use appropriate data types - smaller types = better performance
Partition large tables - by date or other key

Scaling Strategies

Vertical Scaling

Add more RAM - improves caching and query parallelism
Upgrade CPU - faster cores with AVX-512
Use NVMe storage - reduces I/O bottlenecks
Increase network bandwidth - for Kafka integration

Horizontal Scaling Patterns

Pattern 1: Dedicated Compute Nodes

┌──────────────┐
│ Proton Node 1├──┐
│ (Compute)    │  │     ┌─────────┐
└──────────────┘  ├────►│  Kafka  │
┌──────────────┐  │     └─────────┘
│ Proton Node 2├──┘
│ (Compute)    │
└──────────────┘

Each node processes different streams independently.

Pattern 2: Topic Partitioning

Partition Kafka topics and assign partitions to different Proton instances:

-- Node 1: Partitions 0-3
CREATE EXTERNAL STREAM events_node1
SETTINGS 
  type='kafka',
  brokers='kafka:9092',
  topic='events',
  partitions='0,1,2,3';

-- Node 2: Partitions 4-7
CREATE EXTERNAL STREAM events_node2
SETTINGS 
  type='kafka',
  brokers='kafka:9092',
  topic='events',
  partitions='4,5,6,7';

Load Balancing

Use a load balancer for query distribution:

         ┌─────────────┐
         │Load Balancer│
         └──────┬──────┘
         ┌──────┴──────┐
    ┌────▼───┐    ┌────▼───┐
    │Proton 1│    │Proton 2│
    └────────┘    └────────┘

Performance Monitoring

Key Metrics to Track

-- Query throughput
SELECT 
  count(*) / 60 AS queries_per_second
FROM system.query_log
WHERE query_start_time >= now() - INTERVAL 1 MINUTE;

-- Average query latency
SELECT 
  avg(query_duration_ms) AS avg_latency_ms,
  quantile(0.95)(query_duration_ms) AS p95_latency_ms
FROM system.query_log
WHERE query_start_time >= now() - INTERVAL 5 MINUTE
  AND type = 'QueryFinish';

-- Memory efficiency
SELECT 
  formatReadableSize(value) AS memory_used,
  round(value / (SELECT value FROM system.asynchronous_metrics 
                 WHERE metric = 'OSMemoryTotal') * 100, 2) AS memory_pct
FROM system.asynchronous_metrics
WHERE metric = 'jemalloc.allocated';

Benchmark Your Configuration

-- Create test stream
CREATE RANDOM STREAM perf_test (
  id uint64 DEFAULT rand64(),
  timestamp datetime64(3) DEFAULT now64(3),
  value float64 DEFAULT rand() / 1000
);

-- Test aggregation performance
SELECT 
  window_start,
  count(*) AS events,
  avg(value) AS avg_value
FROM tumble(perf_test, 1s)
GROUP BY window_start;

-- Measure throughput
SELECT count(*) FROM perf_test;

Performance Troubleshooting

Slow Queries

-- Find slow queries
SELECT 
  query,
  query_duration_ms,
  memory_usage,
  read_rows
FROM system.query_log
WHERE query_duration_ms > 5000
ORDER BY query_duration_ms DESC
LIMIT 10;

Common fixes:

Add indexes on filter columns
Reduce data scanned (use time filters)
Increase max_memory_usage
Optimize JOIN order

High Memory Usage

-- Identify memory-intensive queries
SELECT 
  query_id,
  user,
  formatReadableSize(memory_usage) AS memory,
  query
FROM system.processes
ORDER BY memory_usage DESC;

Solutions:

Reduce query concurrency
Decrease cache sizes
Lower max_memory_usage_to_ram_ratio
Add more RAM

Low Throughput

Diagnose:

Check CPU utilization
Monitor disk I/O wait
Verify network bandwidth
Review query complexity

Optimize:

Increase thread pool sizes
Use faster storage
Batch smaller queries
Scale horizontally

Performance Best Practices

Right-size memory allocation - 70-90% of available RAM
Use fast storage - NVMe SSD for production
Optimize thread pools - match CPU core count
Monitor query performance - track p95/p99 latency
Tune Kafka settings - balance latency vs. throughput
Checkpoint on fast disks - reduce state overhead
Use appropriate data types - smaller = faster
Partition large tables - improve query pruning
Limit query complexity - simpler queries perform better
Scale horizontally - when vertical limits reached

Next Steps

Set up comprehensive Monitoring
Review Configuration settings
Plan your Deployment architecture

Get Started

Core Concepts

Query Guide

Integration

Operations

Documentation Index

​Performance Characteristics

​Memory Optimization

​Memory Configuration

​Memory Limit Recommendations

​Per-Query Memory Limits

​Monitor Memory Usage

​CPU Optimization

​Thread Pool Configuration

​Thread Pool Sizing Guidelines

​CPU Affinity

​SIMD Optimization

​Query Concurrency Tuning

​Concurrent Query Limits

​Environment Variable Override

​Adjust Based on Workload

​Storage Optimization

​Disk Selection

​Data Path Configuration

​Compression Settings

​Network Optimization

​TCP Settings

​Kafka/Redpanda Tuning

​Latency vs. Throughput Tradeoff

​Checkpoint Optimization

​Checkpoint Best Practices

​Streaming Query Optimization

​Window Function Tuning

​Materialized View Optimization

​Avoid Common Pitfalls

​Scaling Strategies

​Vertical Scaling

​Horizontal Scaling Patterns

​Pattern 1: Dedicated Compute Nodes

​Pattern 2: Topic Partitioning

​Load Balancing

​Performance Monitoring

​Key Metrics to Track

​Benchmark Your Configuration

​Performance Troubleshooting

​Slow Queries

​High Memory Usage

​Low Throughput

​Performance Best Practices

​Next Steps

Build docs developers (and LLMs) love