Skip to main content

Overview

HH-suite supports parallel computing through two mechanisms: OpenMP for shared-memory parallelization (single node, multiple cores) and MPI for distributed computing (multiple nodes). This guide covers both approaches for maximum performance.

OpenMP Parallelization

What is OpenMP?

OpenMP enables multi-core parallelization on a single machine using shared memory. It’s the simplest way to speed up HH-suite searches. Key Features:
  • Automatically enabled in pre-compiled binaries
  • Works on single workstations or compute nodes
  • Scales to ~64 cores efficiently
  • No special runtime configuration needed

Check OpenMP Support

# Check if binary has OpenMP support
ldd $(which hhblits) | grep -i gomp
# If found, OpenMP is supported

# Check during compilation
cmake -DCMAKE_INSTALL_PREFIX=. ..
# Look for: "-- Found OpenMP"

Using OpenMP

Source: src/CMakeLists.txt:90-96
hhblits -i query.fasta -d database -cpu 8

OpenMP Best Practices

1

Determine Core Count

Use physical cores, not hyperthreads:
# Linux: Physical cores
lscpu | grep "^Core(s) per socket"

# macOS: Physical cores
sysctl -n hw.physicalcpu
2

Set Thread Count

Use -cpu flag or OMP_NUM_THREADS environment variable:
export OMP_NUM_THREADS=8
3

Monitor Performance

Check CPU utilization during searches:
htop  # Or top
All cores should show ~100% usage.
4

Adjust for Memory

If you run out of memory, reduce thread count:
hhblits -i query.fasta -d bfd -cpu 4  # Instead of 16

Thread Scaling Performance

Typical speedup on a 16-core workstation:
ThreadsSpeedupEfficiencyUse Case
11.0x100%Baseline
21.9x95%Testing
43.7x93%Memory-limited
87.1x89%Recommended
1613.2x83%Maximum throughput
32 (HT)15.8x49%Diminishing returns
Efficiency drops with hyperthreading due to resource contention. Use physical cores for optimal performance.

Specialized OpenMP Tools

When compiling with OpenMP support, HH-suite provides specialized parallel executables:
Standard OpenMP-parallelized versions of the main tools. These offer better thread efficiency for batch processing compared to the regular versions with -cpu flag.
hhblits_omp -i query.fasta -d database
# Automatically uses OMP_NUM_THREADS cores
Specialized OpenMP version optimized for compressed CA3M databases in FFindex format. Provides better I/O performance for large compressed alignment databases.
# Use with CA3M-compressed databases
hhblits_ca3m -i queries.ffindex -d database_ca3m -oa3m results.ffindex
When to use:
  • Working with compressed CA3M database formats
  • Processing large batches of queries from FFindex files
  • Need to minimize disk I/O on large databases
For most use cases, the standard tools with -cpu flag are sufficient. Use specialized versions for advanced workflows or when working with compressed database formats.

MPI Parallelization

What is MPI?

MPI (Message Passing Interface) enables distributed computing across multiple compute nodes. It’s designed for HPC clusters and large-scale processing. Key Features:
  • Scales to hundreds of cores across many nodes
  • Requires MPI library installation
  • Only available when compiling from source
  • Ideal for processing many queries or large databases

Compile with MPI Support

Source: src/CMakeLists.txt:269-296
1

Install MPI Library

sudo apt-get install libopenmpi-dev openmpi-bin
2

Compile HH-suite

git clone https://github.com/soedinglab/hh-suite.git
mkdir -p hh-suite/build && cd hh-suite/build

cmake -DCMAKE_INSTALL_PREFIX=. -DCHECK_MPI=1 ..
make -j $(nproc) && make install

export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"
Check for MPI binaries:
ls bin/*_mpi
# Should show: hhblits_mpi, hhsearch_mpi, hhalign_mpi, cstranslate_mpi
3

Verify MPI Installation

mpirun --version
which mpirun
MPI versions are NOT included in pre-compiled binaries because MPI configuration is system-specific. You must compile from source to use MPI.

MPI Tools Available

  • hhblits_mpi - Parallel iterative search
  • hhsearch_mpi - Parallel database search
  • hhalign_mpi - Parallel pairwise alignment
  • cstranslate_mpi - Parallel context-specific translation

Running MPI Jobs

Single Node, Multiple Processes

# Use 8 MPI processes on one node
mpirun -np 8 hhblits_mpi -i queries.fasta -d database -o results.hhr

Multiple Nodes

# Create hostfile
cat > hosts.txt << EOF
node01 slots=16
node02 slots=16
node03 slots=16
node04 slots=16
EOF

# Run across nodes
mpirun -np 64 --hostfile hosts.txt \
  hhblits_mpi -i queries.fasta -d database -o results.hhr

SLURM Integration

#!/bin/bash
#SBATCH --job-name=hhblits_mpi
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --mem=64G

module load openmpi

mpirun hhblits_mpi \
  -i queries.fasta \
  -d /path/to/database \
  -o results.hhr \
  -oa3m results.a3m

PBS/Torque Integration

#!/bin/bash
#PBS -N hhblits_mpi
#PBS -l nodes=4:ppn=16
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR

mpirun -np 64 hhblits_mpi \
  -i queries.fasta \
  -d database \
  -o results.hhr

Hybrid OpenMP + MPI

Combine MPI (inter-node) with OpenMP (intra-node) for maximum efficiency:
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2      # 2 MPI ranks per node
#SBATCH --cpus-per-task=8         # 8 OpenMP threads per rank
#SBATCH --time=24:00:00

export OMP_NUM_THREADS=8

mpirun hhblits_mpi -i queries.fasta -d database -cpu 8
Explanation:
  • 4 nodes × 2 MPI ranks = 8 total MPI processes
  • Each MPI rank uses 8 OpenMP threads
  • Total: 64 cores (8 × 8)
Hybrid parallelism is most efficient for large clusters. Use 2-4 MPI ranks per node with OpenMP filling the remaining cores.

Batch Processing Strategies

GNU Parallel

For workstations without MPI:
# Process 8 queries in parallel
cat queries.list | parallel -j 8 \
  "hhblits -i {}.fasta -d database -o {}.hhr -cpu 2"

Job Arrays

For HPC systems:
#!/bin/bash
#SBATCH --array=1-1000
#SBATCH --cpus-per-task=8

query=$(sed -n "${SLURM_ARRAY_TASK_ID}p" queries.list)

hhblits -i $query -d database -cpu 8 -o results/${SLURM_ARRAY_TASK_ID}.hhr

Split-Apply-Combine

1

Split Queries

split -l 100 all_queries.fasta query_batch_
# Creates: query_batch_aa, query_batch_ab, ...
2

Process Batches in Parallel

for batch in query_batch_*; do
  mpirun -np 16 hhblits_mpi -i $batch -d database -o ${batch}.hhr &
done
wait
3

Combine Results

cat query_batch_*.hhr > all_results.hhr

Performance Optimization

Choosing the Right Parallelization

Use OpenMP When

  • Single workstation/node
  • ≤64 cores
  • Shared memory available
  • Simple setup needed

Use MPI When

  • Multiple compute nodes
  • 64 cores
  • HPC cluster available
  • Maximum scalability needed

Load Balancing

MPI automatically distributes work across processes:
  • Query-level parallelism: Each query processed by one MPI rank
  • Database-level parallelism: Database split across MPI ranks
  • Dynamic load balancing: Idle processes pick up new work
Load balancing works best when:
  • Number of queries >> number of processes
  • Query sizes are similar
  • Database is evenly distributed

Memory Considerations

Per-process memory:
ParallelizationDatabaseMemory per Process16 Processes
OpenMPUniclust30Shared: ~10 GB~12 GB total
MPIUniclust30Independent: ~10 GB~160 GB total
OpenMPBFDShared: ~35 GB~50 GB total
MPIBFDIndependent: ~35 GB~560 GB total
MPI uses more memory because each process loads its own copy of the database. OpenMP shares memory across threads.

Troubleshooting

MPI Not Found

CMake Error: Could not find MPI
Solution:
# Ensure MPI is in PATH
which mpirun
export PATH="/usr/lib64/openmpi/bin:$PATH"

# Or specify MPI explicitly
cmake -DCMAKE_INSTALL_PREFIX=. -DMPI_HOME=/usr/lib64/openmpi ..

MPI Binaries Not Created

ls bin/*_mpi
# No such file
Solution:
  • Ensure -DCHECK_MPI=1 was set during cmake
  • Check cmake output for “Found MPI”
  • Verify MPI development files are installed

Network Issues

Error: Unable to connect to remote nodes
Solution:
  • Check SSH keys are configured
  • Verify firewall allows MPI communication
  • Test: mpirun -np 2 -H node1,node2 hostname

Slow Performance

MPI job slower than expected
Solutions:
  • Check network bandwidth (use InfiniBand if available)
  • Ensure database is on shared filesystem (not copied per node)
  • Verify no swap usage: free -h
  • Use hybrid OpenMP+MPI for better core utilization

Benchmarking

Test Scaling

#!/bin/bash

for np in 1 2 4 8 16 32; do
  echo "Testing with $np processes..."
  time mpirun -np $np hhblits_mpi -i test.fasta -d database
done

Measure Efficiency

# Strong scaling: Fixed problem size, varying cores
# Ideal: Time(n cores) = Time(1 core) / n

for cores in 1 2 4 8 16; do
  /usr/bin/time -f "%e" -o time_${cores}.txt \
    mpirun -np $cores hhblits_mpi -i query.fasta -d database
done

Expected Scalability

CoresMPI SpeedupOpenMP Speedup
11.0x1.0x
21.95x1.9x
43.8x3.7x
87.3x7.1x
1613.8x13.2x
3225.5x15.8x (HT)
6447.2xN/A
12885.1xN/A
MPI scales better than OpenMP beyond 16-32 cores due to reduced memory contention and better cache locality.

Best Practices

Start with OpenMP for simplicity on single nodesUse MPI for large-scale processing on clustersCombine MPI + OpenMP for hybrid parallelism on large systemsMonitor memory usage - MPI uses more RAM than OpenMPUse physical cores not hyperthreads for best performanceBalance load by having more queries than processesTest scaling before large production runsUse fast interconnect (InfiniBand) for MPI on clustersKeep database on shared storage to avoid duplicationProcess in batches for very large query sets

Example Workflows

Small Workstation (8 cores)

# Use OpenMP
hhblits -i query.fasta -d uniclust30 -cpu 8

Large Workstation (64 cores)

# Use OpenMP with all cores
hhblits -i query.fasta -d uniclust30 -cpu 64

HPC Cluster (128 cores, 8 nodes)

#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4

export OMP_NUM_THREADS=4

# 8 nodes × 4 MPI ranks × 4 OpenMP threads = 128 cores
mpirun hhblits_mpi -i queries.fasta -d database -cpu 4

Many Short Queries

# Use job arrays instead of MPI
#SBATCH --array=1-10000
#SBATCH --cpus-per-task=4

query=$(sed -n "${SLURM_ARRAY_TASK_ID}p" queries.list)
hhblits -i $query -d database -cpu 4

See Also

Build docs developers (and LLMs) love