Overview
HH-suite performance can be significantly improved through SIMD optimizations, proper compilation, and efficient hardware utilization. This guide covers optimization strategies to maximize search speed.SIMD Optimization
Supported Instruction Sets
HH-suite supports multiple SIMD (Single Instruction, Multiple Data) instruction sets for parallel computation. Source:src/CMakeLists.txt:14-38
SSE2
Baseline: Required minimum for x86-64 systems
- Available on all modern Intel/AMD CPUs
- ~2x speedup over scalar code
SSE4.1
Recommended: Better performance than SSE2
- Available since Intel Core 2 (2006)
- ~2.5x speedup over scalar code
AVX2
Best Performance: ~2x faster than SSE2
- Available since Intel Haswell (2013), AMD Excavator (2015)
- 256-bit vector operations
- Recommended for production use
ARM64/POWER
Alternative Architectures
- ARM: NEON instructions (Armv8-a+simd)
- POWER8/9: VSX instructions
Check Your CPU Support
Pre-compiled Binaries
HH-suite provides pre-compiled binaries optimized for different SIMD levels:Compilation Optimization
Compile for Native Architecture
For maximum performance, compile HH-suite for your specific CPU:-march=native flag, which optimizes for your CPU’s exact instruction set.
Manual SIMD Selection
To explicitly compile for a specific SIMD level:Compiler Requirements
Source:CMakeLists.txt:43-58
Minimum compiler versions:
- GCC ≥ 4.8.0
- Clang ≥ 3.6
- Intel ICC (any recent version)
macOS Compilation
The default macOS clang doesn’t support OpenMP. Install GCC via Homebrew:OpenMP Parallelization
Source:src/CMakeLists.txt:90-96
HH-suite uses OpenMP for multi-core parallelization.
Enable OpenMP
OpenMP is automatically enabled if detected during compilation:Control Thread Count
OpenMP Performance Tips
Thread Scaling Best Practices
Thread Scaling Best Practices
Optimal thread count:
- Use number of physical cores (not hyperthreads)
- For large databases:
threads = physical_cores - For small databases:
threads = physical_cores / 2
- Each thread uses additional memory
- BFD database: Limit to 8-16 threads on 128GB RAM
- Uniclust30: Can use all available cores
- Limited benefit for HH-suite (usually <20% improvement)
- May reduce per-thread performance
- Recommended: Use physical cores only
OpenMP Variants
Source:src/CMakeLists.txt:251-267
HH-suite provides OpenMP-specific versions of main tools:
hhblits_omp- OpenMP parallel version of hhblitshhsearch_omp- OpenMP parallel version of hhsearchhhalign_omp- OpenMP parallel version of hhalign
These OpenMP variants are automatically compiled when OpenMP is detected. Regular versions (hhblits, hhsearch) also use OpenMP if available.
MPI Parallelization
For distributed computing across multiple nodes.Compile with MPI
Source:src/CMakeLists.txt:269-296
hhblits_mpihhsearch_mpihhalign_mpicstranslate_mpi
Run with MPI
MPI vs OpenMP
| Feature | OpenMP | MPI |
|---|---|---|
| Scope | Single node, shared memory | Multiple nodes, distributed |
| Setup | Automatic if available | Requires MPI installation |
| Use case | Multi-core workstations | HPC clusters |
| Scalability | Up to ~64 cores | Hundreds of cores |
| Pre-compiled | Yes | No (compile from source) |
Performance Benchmarks
SIMD Performance Comparison
Relative search speed for the same query on Uniclust30:| SIMD Level | Relative Speed | Availability |
|---|---|---|
| Scalar (no SIMD) | 1.0x | All CPUs |
| SSE2 | 2.0x | All modern x86-64 |
| SSE4.1 | 2.5x | CPUs since ~2006 |
| AVX2 | 4.0x | Intel Haswell+ (2013), AMD Excavator+ (2015) |
| ARM NEON | 2.5x | ARM v8+ |
| POWER VSX | 3.0x | POWER8+ |
Scaling Performance
Typical speedup with OpenMP on a 16-core system:| Threads | Speedup | Efficiency |
|---|---|---|
| 1 | 1.0x | 100% |
| 2 | 1.9x | 95% |
| 4 | 3.7x | 93% |
| 8 | 7.1x | 89% |
| 16 | 13.2x | 83% |
| 32 (HT) | 15.8x | 49% |
Efficiency decreases with hyperthreading due to resource contention.
Memory Optimization
Memory Requirements
Per-thread memory usage:| Database | Base Memory | Per Thread | 8 Threads |
|---|---|---|---|
| PDB70 | 1 GB | +0.1 GB | ~2 GB |
| Uniclust30 | 8 GB | +0.5 GB | ~12 GB |
| BFD | 32 GB | +2 GB | ~48 GB |
Reduce Memory Usage
Memory-Constrained Systems
For systems with limited RAM:- Use smaller databases: Uniclust30 instead of BFD
- Reduce parallelization:
-cpu 2or-cpu 4 - Process in batches: Split query files into smaller chunks
- Increase swap space: Not recommended for production (very slow)
I/O Optimization
Database Storage
FFindex Optimization
Optimize FFindex databases for sequential access:Search Parameter Optimization
Speed vs Sensitivity Trade-offs
Parameter Impact on Speed
| Parameter | Impact | Speed |
|---|---|---|
-n 1 | Single iteration | Fastest |
-n 3 | 3 iterations | Moderate |
-n 8 | 8 iterations | Slow |
-e 0.001 | Strict E-value | Faster |
-e 1.0 | Relaxed E-value | Slower |
-maxfilt 1000 | Small prefilter | Faster |
-maxfilt 100000 | Large prefilter | Slower |
Batch Processing
For processing many queries efficiently:Profiling and Monitoring
Check SIMD Usage
Monitor Resource Usage
Best Practices Summary
Performance Optimization Checklist
Performance Optimization Checklist
✓ Use AVX2-compiled binaries if your CPU supports it (2x faster than SSE2)✓ Compile from source with
-march=native for maximum performance✓ Use all physical cores with -cpu flag✓ Store databases on SSD for better I/O performance✓ Optimize thread count based on available memory✓ Use MPI for cluster computing with many nodes✓ Adjust search parameters based on sensitivity needs✓ Process queries in batches for better CPU utilization✓ Monitor memory usage especially with large databases✓ Optimize FFindex databases for sequential accessSee Also
- Parallel Computing Guide - Detailed MPI and OpenMP usage
- Building Custom Databases - Database optimization
- Available Databases - Database selection for performance