Parallel Computing

Overview

HH-suite supports parallel computing through two mechanisms: OpenMP for shared-memory parallelization (single node, multiple cores) and MPI for distributed computing (multiple nodes). This guide covers both approaches for maximum performance.

OpenMP Parallelization

What is OpenMP?

OpenMP enables multi-core parallelization on a single machine using shared memory. It’s the simplest way to speed up HH-suite searches. Key Features:

Automatically enabled in pre-compiled binaries
Works on single workstations or compute nodes
Scales to ~64 cores efficiently
No special runtime configuration needed

Check OpenMP Support

# Check if binary has OpenMP support
ldd $(which hhblits) | grep -i gomp
# If found, OpenMP is supported

# Check during compilation
cmake -DCMAKE_INSTALL_PREFIX=. ..
# Look for: "-- Found OpenMP"

Using OpenMP

Source: src/CMakeLists.txt:90-96

hhblits -i query.fasta -d database -cpu 8

OpenMP Best Practices

Determine Core Count

Use physical cores, not hyperthreads:

# Linux: Physical cores
lscpu | grep "^Core(s) per socket"

# macOS: Physical cores
sysctl -n hw.physicalcpu

Set Thread Count

Use -cpu flag or OMP_NUM_THREADS environment variable:

export OMP_NUM_THREADS=8

Monitor Performance

Check CPU utilization during searches:

htop  # Or top

All cores should show ~100% usage.

Adjust for Memory

If you run out of memory, reduce thread count:

hhblits -i query.fasta -d bfd -cpu 4  # Instead of 16

Thread Scaling Performance

Typical speedup on a 16-core workstation:

Threads	Speedup	Efficiency	Use Case
1	1.0x	100%	Baseline
2	1.9x	95%	Testing
4	3.7x	93%	Memory-limited
8	7.1x	89%	Recommended
16	13.2x	83%	Maximum throughput
32 (HT)	15.8x	49%	Diminishing returns

Efficiency drops with hyperthreading due to resource contention. Use physical cores for optimal performance.

Specialized OpenMP Tools

When compiling with OpenMP support, HH-suite provides specialized parallel executables:

hhblits_omp / hhsearch_omp / hhalign_omp

Standard OpenMP-parallelized versions of the main tools. These offer better thread efficiency for batch processing compared to the regular versions with -cpu flag.

hhblits_omp -i query.fasta -d database
# Automatically uses OMP_NUM_THREADS cores

hhblits_ca3m

Specialized OpenMP version optimized for compressed CA3M databases in FFindex format. Provides better I/O performance for large compressed alignment databases.

# Use with CA3M-compressed databases
hhblits_ca3m -i queries.ffindex -d database_ca3m -oa3m results.ffindex

When to use:

Working with compressed CA3M database formats
Processing large batches of queries from FFindex files
Need to minimize disk I/O on large databases

For most use cases, the standard tools with -cpu flag are sufficient. Use specialized versions for advanced workflows or when working with compressed database formats.

MPI Parallelization

What is MPI?

MPI (Message Passing Interface) enables distributed computing across multiple compute nodes. It’s designed for HPC clusters and large-scale processing. Key Features:

Scales to hundreds of cores across many nodes
Requires MPI library installation
Only available when compiling from source
Ideal for processing many queries or large databases

Compile with MPI Support

Source: src/CMakeLists.txt:269-296

Install MPI Library

sudo apt-get install libopenmpi-dev openmpi-bin

Compile HH-suite

git clone https://github.com/soedinglab/hh-suite.git
mkdir -p hh-suite/build && cd hh-suite/build

cmake -DCMAKE_INSTALL_PREFIX=. -DCHECK_MPI=1 ..
make -j $(nproc) && make install

export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"

Check for MPI binaries:

ls bin/*_mpi
# Should show: hhblits_mpi, hhsearch_mpi, hhalign_mpi, cstranslate_mpi

Verify MPI Installation

mpirun --version
which mpirun

MPI versions are NOT included in pre-compiled binaries because MPI configuration is system-specific. You must compile from source to use MPI.

MPI Tools Available

hhblits_mpi - Parallel iterative search
hhsearch_mpi - Parallel database search
hhalign_mpi - Parallel pairwise alignment
cstranslate_mpi - Parallel context-specific translation

Running MPI Jobs

Single Node, Multiple Processes

# Use 8 MPI processes on one node
mpirun -np 8 hhblits_mpi -i queries.fasta -d database -o results.hhr

Multiple Nodes

# Create hostfile
cat > hosts.txt << EOF
node01 slots=16
node02 slots=16
node03 slots=16
node04 slots=16
EOF

# Run across nodes
mpirun -np 64 --hostfile hosts.txt \
  hhblits_mpi -i queries.fasta -d database -o results.hhr

SLURM Integration

#!/bin/bash
#SBATCH --job-name=hhblits_mpi
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --mem=64G

module load openmpi

mpirun hhblits_mpi \
  -i queries.fasta \
  -d /path/to/database \
  -o results.hhr \
  -oa3m results.a3m

PBS/Torque Integration

#!/bin/bash
#PBS -N hhblits_mpi
#PBS -l nodes=4:ppn=16
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR

mpirun -np 64 hhblits_mpi \
  -i queries.fasta \
  -d database \
  -o results.hhr

Hybrid OpenMP + MPI

Combine MPI (inter-node) with OpenMP (intra-node) for maximum efficiency:

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2      # 2 MPI ranks per node
#SBATCH --cpus-per-task=8         # 8 OpenMP threads per rank
#SBATCH --time=24:00:00

export OMP_NUM_THREADS=8

mpirun hhblits_mpi -i queries.fasta -d database -cpu 8

Explanation:

4 nodes × 2 MPI ranks = 8 total MPI processes
Each MPI rank uses 8 OpenMP threads
Total: 64 cores (8 × 8)

Hybrid parallelism is most efficient for large clusters. Use 2-4 MPI ranks per node with OpenMP filling the remaining cores.

Batch Processing Strategies

GNU Parallel

For workstations without MPI:

# Process 8 queries in parallel
cat queries.list | parallel -j 8 \
  "hhblits -i {}.fasta -d database -o {}.hhr -cpu 2"

Job Arrays

For HPC systems:

#!/bin/bash
#SBATCH --array=1-1000
#SBATCH --cpus-per-task=8

query=$(sed -n "${SLURM_ARRAY_TASK_ID}p" queries.list)

hhblits -i $query -d database -cpu 8 -o results/${SLURM_ARRAY_TASK_ID}.hhr

Split-Apply-Combine

Split Queries

split -l 100 all_queries.fasta query_batch_
# Creates: query_batch_aa, query_batch_ab, ...

Process Batches in Parallel

for batch in query_batch_*; do
  mpirun -np 16 hhblits_mpi -i $batch -d database -o ${batch}.hhr &
done
wait

Combine Results

cat query_batch_*.hhr > all_results.hhr

Performance Optimization

Choosing the Right Parallelization

Use OpenMP When

Single workstation/node
≤64 cores
Shared memory available
Simple setup needed

Use MPI When

Multiple compute nodes
64 cores
HPC cluster available
Maximum scalability needed

Load Balancing

MPI automatically distributes work across processes:

Query-level parallelism: Each query processed by one MPI rank
Database-level parallelism: Database split across MPI ranks
Dynamic load balancing: Idle processes pick up new work

Load balancing works best when:

Number of queries >> number of processes
Query sizes are similar
Database is evenly distributed

Memory Considerations

Per-process memory:

Parallelization	Database	Memory per Process	16 Processes
OpenMP	Uniclust30	Shared: ~10 GB	~12 GB total
MPI	Uniclust30	Independent: ~10 GB	~160 GB total
OpenMP	BFD	Shared: ~35 GB	~50 GB total
MPI	BFD	Independent: ~35 GB	~560 GB total

MPI uses more memory because each process loads its own copy of the database. OpenMP shares memory across threads.

Troubleshooting

MPI Not Found

CMake Error: Could not find MPI

Solution:

# Ensure MPI is in PATH
which mpirun
export PATH="/usr/lib64/openmpi/bin:$PATH"

# Or specify MPI explicitly
cmake -DCMAKE_INSTALL_PREFIX=. -DMPI_HOME=/usr/lib64/openmpi ..

MPI Binaries Not Created

ls bin/*_mpi
# No such file

Solution:

Ensure -DCHECK_MPI=1 was set during cmake
Check cmake output for “Found MPI”
Verify MPI development files are installed

Network Issues

Error: Unable to connect to remote nodes

Solution:

Check SSH keys are configured
Verify firewall allows MPI communication
Test: mpirun -np 2 -H node1,node2 hostname

Slow Performance

MPI job slower than expected

Solutions:

Check network bandwidth (use InfiniBand if available)
Ensure database is on shared filesystem (not copied per node)
Verify no swap usage: free -h
Use hybrid OpenMP+MPI for better core utilization

Benchmarking

Test Scaling

#!/bin/bash

for np in 1 2 4 8 16 32; do
  echo "Testing with $np processes..."
  time mpirun -np $np hhblits_mpi -i test.fasta -d database
done

Measure Efficiency

# Strong scaling: Fixed problem size, varying cores
# Ideal: Time(n cores) = Time(1 core) / n

for cores in 1 2 4 8 16; do
  /usr/bin/time -f "%e" -o time_${cores}.txt \
    mpirun -np $cores hhblits_mpi -i query.fasta -d database
done

Expected Scalability

Cores	MPI Speedup	OpenMP Speedup
1	1.0x	1.0x
2	1.95x	1.9x
4	3.8x	3.7x
8	7.3x	7.1x
16	13.8x	13.2x
32	25.5x	15.8x (HT)
64	47.2x	N/A
128	85.1x	N/A

MPI scales better than OpenMP beyond 16-32 cores due to reduced memory contention and better cache locality.

Best Practices

Parallel Computing Guidelines

✓ Start with OpenMP for simplicity on single nodes✓ Use MPI for large-scale processing on clusters✓ Combine MPI + OpenMP for hybrid parallelism on large systems✓ Monitor memory usage - MPI uses more RAM than OpenMP✓ Use physical cores not hyperthreads for best performance✓ Balance load by having more queries than processes✓ Test scaling before large production runs✓ Use fast interconnect (InfiniBand) for MPI on clusters✓ Keep database on shared storage to avoid duplication✓ Process in batches for very large query sets

Example Workflows

Small Workstation (8 cores)

# Use OpenMP
hhblits -i query.fasta -d uniclust30 -cpu 8

Large Workstation (64 cores)

# Use OpenMP with all cores
hhblits -i query.fasta -d uniclust30 -cpu 64

HPC Cluster (128 cores, 8 nodes)

#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4

export OMP_NUM_THREADS=4

# 8 nodes × 4 MPI ranks × 4 OpenMP threads = 128 cores
mpirun hhblits_mpi -i queries.fasta -d database -cpu 4

Many Short Queries

# Use job arrays instead of MPI
#SBATCH --array=1-10000
#SBATCH --cpus-per-task=4

query=$(sed -n "${SLURM_ARRAY_TASK_ID}p" queries.list)
hhblits -i $query -d database -cpu 4

Getting Started

Core Tools

Utility Tools

Guides

Advanced

​Overview

​OpenMP Parallelization

​What is OpenMP?

​Check OpenMP Support

​Using OpenMP

​OpenMP Best Practices

​Thread Scaling Performance

​Specialized OpenMP Tools

​MPI Parallelization

​What is MPI?

​Compile with MPI Support

​MPI Tools Available

​Running MPI Jobs

​Single Node, Multiple Processes

​Multiple Nodes

​SLURM Integration

​PBS/Torque Integration

​Hybrid OpenMP + MPI

​Batch Processing Strategies

​GNU Parallel

​Job Arrays

​Split-Apply-Combine

​Performance Optimization

​Choosing the Right Parallelization

Use OpenMP When

Use MPI When

​Load Balancing

​Memory Considerations

​Troubleshooting

​MPI Not Found

​MPI Binaries Not Created

​Network Issues

​Slow Performance

​Benchmarking

​Test Scaling

​Measure Efficiency

​Expected Scalability

​Best Practices

​Example Workflows

​Small Workstation (8 cores)

​Large Workstation (64 cores)

​HPC Cluster (128 cores, 8 nodes)

​Many Short Queries

​See Also

Build docs developers (and LLMs) love

Overview

OpenMP Parallelization

What is OpenMP?

Check OpenMP Support

Using OpenMP

OpenMP Best Practices

Thread Scaling Performance

Specialized OpenMP Tools

MPI Parallelization

What is MPI?

Compile with MPI Support

MPI Tools Available

Running MPI Jobs

Single Node, Multiple Processes

Multiple Nodes

SLURM Integration

PBS/Torque Integration

Hybrid OpenMP + MPI

Batch Processing Strategies

GNU Parallel

Job Arrays

Split-Apply-Combine

Performance Optimization

Choosing the Right Parallelization

Load Balancing

Memory Considerations

Troubleshooting

MPI Not Found

MPI Binaries Not Created

Network Issues

Slow Performance

Benchmarking

Test Scaling

Measure Efficiency

Expected Scalability

Best Practices

Example Workflows

Small Workstation (8 cores)

Large Workstation (64 cores)

HPC Cluster (128 cores, 8 nodes)

Many Short Queries

See Also