Overview
HH-suite supports parallel computing through two mechanisms: OpenMP for shared-memory parallelization (single node, multiple cores) and MPI for distributed computing (multiple nodes). This guide covers both approaches for maximum performance.OpenMP Parallelization
What is OpenMP?
OpenMP enables multi-core parallelization on a single machine using shared memory. It’s the simplest way to speed up HH-suite searches. Key Features:- Automatically enabled in pre-compiled binaries
- Works on single workstations or compute nodes
- Scales to ~64 cores efficiently
- No special runtime configuration needed
Check OpenMP Support
Using OpenMP
Source:src/CMakeLists.txt:90-96
OpenMP Best Practices
Thread Scaling Performance
Typical speedup on a 16-core workstation:| Threads | Speedup | Efficiency | Use Case |
|---|---|---|---|
| 1 | 1.0x | 100% | Baseline |
| 2 | 1.9x | 95% | Testing |
| 4 | 3.7x | 93% | Memory-limited |
| 8 | 7.1x | 89% | Recommended |
| 16 | 13.2x | 83% | Maximum throughput |
| 32 (HT) | 15.8x | 49% | Diminishing returns |
Efficiency drops with hyperthreading due to resource contention. Use physical cores for optimal performance.
Specialized OpenMP Tools
When compiling with OpenMP support, HH-suite provides specialized parallel executables:hhblits_omp / hhsearch_omp / hhalign_omp
hhblits_omp / hhsearch_omp / hhalign_omp
Standard OpenMP-parallelized versions of the main tools. These offer better thread efficiency for batch processing compared to the regular versions with
-cpu flag.hhblits_ca3m
hhblits_ca3m
Specialized OpenMP version optimized for compressed CA3M databases in FFindex format. Provides better I/O performance for large compressed alignment databases.When to use:
- Working with compressed CA3M database formats
- Processing large batches of queries from FFindex files
- Need to minimize disk I/O on large databases
MPI Parallelization
What is MPI?
MPI (Message Passing Interface) enables distributed computing across multiple compute nodes. It’s designed for HPC clusters and large-scale processing. Key Features:- Scales to hundreds of cores across many nodes
- Requires MPI library installation
- Only available when compiling from source
- Ideal for processing many queries or large databases
Compile with MPI Support
Source:src/CMakeLists.txt:269-296
MPI Tools Available
- hhblits_mpi - Parallel iterative search
- hhsearch_mpi - Parallel database search
- hhalign_mpi - Parallel pairwise alignment
- cstranslate_mpi - Parallel context-specific translation
Running MPI Jobs
Single Node, Multiple Processes
Multiple Nodes
SLURM Integration
PBS/Torque Integration
Hybrid OpenMP + MPI
Combine MPI (inter-node) with OpenMP (intra-node) for maximum efficiency:- 4 nodes × 2 MPI ranks = 8 total MPI processes
- Each MPI rank uses 8 OpenMP threads
- Total: 64 cores (8 × 8)
Batch Processing Strategies
GNU Parallel
For workstations without MPI:Job Arrays
For HPC systems:Split-Apply-Combine
Performance Optimization
Choosing the Right Parallelization
Use OpenMP When
- Single workstation/node
- ≤64 cores
- Shared memory available
- Simple setup needed
Use MPI When
- Multiple compute nodes
-
64 cores
- HPC cluster available
- Maximum scalability needed
Load Balancing
MPI automatically distributes work across processes:- Query-level parallelism: Each query processed by one MPI rank
- Database-level parallelism: Database split across MPI ranks
- Dynamic load balancing: Idle processes pick up new work
Load balancing works best when:
- Number of queries >> number of processes
- Query sizes are similar
- Database is evenly distributed
Memory Considerations
Per-process memory:| Parallelization | Database | Memory per Process | 16 Processes |
|---|---|---|---|
| OpenMP | Uniclust30 | Shared: ~10 GB | ~12 GB total |
| MPI | Uniclust30 | Independent: ~10 GB | ~160 GB total |
| OpenMP | BFD | Shared: ~35 GB | ~50 GB total |
| MPI | BFD | Independent: ~35 GB | ~560 GB total |
Troubleshooting
MPI Not Found
MPI Binaries Not Created
- Ensure
-DCHECK_MPI=1was set during cmake - Check cmake output for “Found MPI”
- Verify MPI development files are installed
Network Issues
- Check SSH keys are configured
- Verify firewall allows MPI communication
- Test:
mpirun -np 2 -H node1,node2 hostname
Slow Performance
- Check network bandwidth (use InfiniBand if available)
- Ensure database is on shared filesystem (not copied per node)
- Verify no swap usage:
free -h - Use hybrid OpenMP+MPI for better core utilization
Benchmarking
Test Scaling
Measure Efficiency
Expected Scalability
| Cores | MPI Speedup | OpenMP Speedup |
|---|---|---|
| 1 | 1.0x | 1.0x |
| 2 | 1.95x | 1.9x |
| 4 | 3.8x | 3.7x |
| 8 | 7.3x | 7.1x |
| 16 | 13.8x | 13.2x |
| 32 | 25.5x | 15.8x (HT) |
| 64 | 47.2x | N/A |
| 128 | 85.1x | N/A |
Best Practices
Parallel Computing Guidelines
Parallel Computing Guidelines
✓ Start with OpenMP for simplicity on single nodes✓ Use MPI for large-scale processing on clusters✓ Combine MPI + OpenMP for hybrid parallelism on large systems✓ Monitor memory usage - MPI uses more RAM than OpenMP✓ Use physical cores not hyperthreads for best performance✓ Balance load by having more queries than processes✓ Test scaling before large production runs✓ Use fast interconnect (InfiniBand) for MPI on clusters✓ Keep database on shared storage to avoid duplication✓ Process in batches for very large query sets
Example Workflows
Small Workstation (8 cores)
Large Workstation (64 cores)
HPC Cluster (128 cores, 8 nodes)
Many Short Queries
See Also
- Performance Optimization - SIMD and compiler optimizations
- Building Custom Databases - Database preparation for parallel searches
- Available Databases - Database selection and sizing