Overview
Optimizing AI agent performance involves balancing multiple factors: response latency, token usage, memory consumption, and cost. This guide covers techniques for building high-performance agents that scale efficiently.Performance Bottlenecks
Common Performance Issues
-
LLM Latency (typically 1-10 seconds)
- Network round-trip time
- Model inference time
- Token generation speed
-
Token Usage (impacts cost and speed)
- Large prompts increase latency
- Context window limitations
- Redundant information
-
Tool Execution (varies by tool)
- Database queries
- API calls
- File I/O operations
-
Memory Usage
- Large conversation histories
- Caching overhead
- Accumulated state
Response Caching
Prompt-Level Caching
Cache LLM responses to avoid redundant calls:File-Based Cache
Persist cached responses across restarts (JVM/Android/iOS):Redis Cache
Shared cache across multiple instances (JVM only):Cache Configuration
Streaming with Cache
Cached responses work with streaming:Token Optimization
Minimize Prompt Size
Use Structured Outputs
Structured outputs reduce token usage:Truncate Long Contexts
Limit conversation history:Summarize Old Messages
Token Counting
Monitor token usage:Parallel Execution
Parallel Nodes
Execute multiple operations concurrently:Parallel Tool Execution
Performance Impact
Model Selection
Choose Appropriate Models
Dynamic Model Selection
Model Routing
Connection Pooling
HTTP Client Configuration
Reuse LLM Clients
Memory Management
Limit State Size
Use Weak References for Cache
Stream Large Responses
Benchmarking
Measure Agent Performance
Load Testing
Best Practices
1. Cache Aggressively
2. Minimize LLM Calls
3. Use Streaming for Long Responses
4. Profile in Production
5. Set Appropriate Timeouts
Performance Targets
Typical Latencies
| Operation | Target | Notes |
|---|---|---|
| Simple LLM call | < 1s | With fast model (Haiku, GPT-3.5) |
| Complex LLM call | < 5s | With powerful model (GPT-4, Claude) |
| Tool execution | < 500ms | For local/cached operations |
| Agent execution | < 10s | End-to-end for typical workflow |
| Streaming first token | < 500ms | Time to first response |
Resource Usage
| Metric | Target | Notes |
|---|---|---|
| Memory per agent | < 50MB | Without large caches |
| Cache memory | < 500MB | With aggressive caching |
| Concurrent agents | 100+ | Per server instance |
| RPS | 10-50 | Requests per second per instance |