Skip to main content

Overview

HH-suite performance can be significantly improved through SIMD optimizations, proper compilation, and efficient hardware utilization. This guide covers optimization strategies to maximize search speed.

SIMD Optimization

Supported Instruction Sets

HH-suite supports multiple SIMD (Single Instruction, Multiple Data) instruction sets for parallel computation. Source: src/CMakeLists.txt:14-38

SSE2

Baseline: Required minimum for x86-64 systems
  • Available on all modern Intel/AMD CPUs
  • ~2x speedup over scalar code

SSE4.1

Recommended: Better performance than SSE2
  • Available since Intel Core 2 (2006)
  • ~2.5x speedup over scalar code

AVX2

Best Performance: ~2x faster than SSE2
  • Available since Intel Haswell (2013), AMD Excavator (2015)
  • 256-bit vector operations
  • Recommended for production use

ARM64/POWER

Alternative Architectures
  • ARM: NEON instructions (Armv8-a+simd)
  • POWER8/9: VSX instructions

Check Your CPU Support

cat /proc/cpuinfo | grep flags | head -1
# Look for: sse2, sse4_1, avx2

Pre-compiled Binaries

HH-suite provides pre-compiled binaries optimized for different SIMD levels:
wget https://github.com/soedinglab/hh-suite/releases/download/v3.3.0/hhsuite-3.3.0-SSE2-Linux.tar.gz
tar xzvf hhsuite-3.3.0-SSE2-Linux.tar.gz
export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"
AVX2 binaries will not run on older CPUs. If you get “Illegal instruction” errors, use the SSE2 version.

Compilation Optimization

Compile for Native Architecture

For maximum performance, compile HH-suite for your specific CPU:
git clone https://github.com/soedinglab/hh-suite.git
mkdir -p hh-suite/build && cd hh-suite/build

# Let compiler auto-detect best SIMD instructions
cmake -DCMAKE_INSTALL_PREFIX=. ..
make -j $(nproc) && make install

export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"
This enables the -march=native flag, which optimizes for your CPU’s exact instruction set.

Manual SIMD Selection

To explicitly compile for a specific SIMD level:
cmake -DCMAKE_INSTALL_PREFIX=. -DHAVE_AVX2=1 ..
make -j $(nproc) && make install

Compiler Requirements

Source: CMakeLists.txt:43-58 Minimum compiler versions:
  • GCC ≥ 4.8.0
  • Clang ≥ 3.6
  • Intel ICC (any recent version)
Newer compilers (GCC 9+, Clang 10+) generate faster code. Consider using the latest compiler version available.

macOS Compilation

The default macOS clang doesn’t support OpenMP. Install GCC via Homebrew:
# Install GCC with OpenMP support
brew install gcc

# Compile with GCC
CC="$(brew --prefix)/bin/gcc-13" \
CXX="$(brew --prefix)/bin/g++-13" \
cmake -DCMAKE_INSTALL_PREFIX=. ..
make -j $(nproc) && make install

OpenMP Parallelization

Source: src/CMakeLists.txt:90-96 HH-suite uses OpenMP for multi-core parallelization.

Enable OpenMP

OpenMP is automatically enabled if detected during compilation:
cmake -DCMAKE_INSTALL_PREFIX=. ..
# Look for: "-- Found OpenMP"
make -j $(nproc) && make install

Control Thread Count

hhblits -i query.fasta -d database -cpu 8

OpenMP Performance Tips

Optimal thread count:
  • Use number of physical cores (not hyperthreads)
  • For large databases: threads = physical_cores
  • For small databases: threads = physical_cores / 2
Memory considerations:
  • Each thread uses additional memory
  • BFD database: Limit to 8-16 threads on 128GB RAM
  • Uniclust30: Can use all available cores
Hyperthreading:
  • Limited benefit for HH-suite (usually <20% improvement)
  • May reduce per-thread performance
  • Recommended: Use physical cores only

OpenMP Variants

Source: src/CMakeLists.txt:251-267 HH-suite provides OpenMP-specific versions of main tools:
  • hhblits_omp - OpenMP parallel version of hhblits
  • hhsearch_omp - OpenMP parallel version of hhsearch
  • hhalign_omp - OpenMP parallel version of hhalign
These OpenMP variants are automatically compiled when OpenMP is detected. Regular versions (hhblits, hhsearch) also use OpenMP if available.

MPI Parallelization

For distributed computing across multiple nodes.

Compile with MPI

Source: src/CMakeLists.txt:269-296
# Install MPI (if not already installed)
sudo apt-get install libopenmpi-dev  # Debian/Ubuntu
# or
sudo yum install openmpi-devel       # RedHat/CentOS

# Compile with MPI support
cmake -DCMAKE_INSTALL_PREFIX=. -DCHECK_MPI=1 ..
make -j $(nproc) && make install
This creates MPI versions:
  • hhblits_mpi
  • hhsearch_mpi
  • hhalign_mpi
  • cstranslate_mpi

Run with MPI

# Single node, 8 processes
mpirun -np 8 hhblits_mpi -i queries.fasta -d database -o results.hhr

# Multiple nodes
mpirun -np 32 --hostfile hosts.txt hhblits_mpi -i queries.fasta -d database
MPI is only available when compiling from source. Pre-compiled binaries do not include MPI support due to system-specific MPI configurations.

MPI vs OpenMP

FeatureOpenMPMPI
ScopeSingle node, shared memoryMultiple nodes, distributed
SetupAutomatic if availableRequires MPI installation
Use caseMulti-core workstationsHPC clusters
ScalabilityUp to ~64 coresHundreds of cores
Pre-compiledYesNo (compile from source)

Performance Benchmarks

SIMD Performance Comparison

Relative search speed for the same query on Uniclust30:
SIMD LevelRelative SpeedAvailability
Scalar (no SIMD)1.0xAll CPUs
SSE22.0xAll modern x86-64
SSE4.12.5xCPUs since ~2006
AVX24.0xIntel Haswell+ (2013), AMD Excavator+ (2015)
ARM NEON2.5xARM v8+
POWER VSX3.0xPOWER8+
AVX2 provides approximately 2x speedup over SSE2, making it the recommended choice for modern systems.

Scaling Performance

Typical speedup with OpenMP on a 16-core system:
ThreadsSpeedupEfficiency
11.0x100%
21.9x95%
43.7x93%
87.1x89%
1613.2x83%
32 (HT)15.8x49%
Efficiency decreases with hyperthreading due to resource contention.

Memory Optimization

Memory Requirements

Per-thread memory usage:
DatabaseBase MemoryPer Thread8 Threads
PDB701 GB+0.1 GB~2 GB
Uniclust308 GB+0.5 GB~12 GB
BFD32 GB+2 GB~48 GB

Reduce Memory Usage

hhblits -i query.fasta -d bfd -cpu 4  # Use fewer threads

Memory-Constrained Systems

For systems with limited RAM:
  1. Use smaller databases: Uniclust30 instead of BFD
  2. Reduce parallelization: -cpu 2 or -cpu 4
  3. Process in batches: Split query files into smaller chunks
  4. Increase swap space: Not recommended for production (very slow)

I/O Optimization

Database Storage

Use SSDs for databases when possible. Database I/O is a significant bottleneck, especially for:
  • First search after system boot (cold cache)
  • Searching different databases
  • Parallel searches by multiple users

FFindex Optimization

Optimize FFindex databases for sequential access:
ffindex_build -as database.ffdata.opt database.ffindex.opt \
  -d database.ffdata -i database.ffindex
mv database.ffdata.opt database.ffdata
mv database.ffindex.opt database.ffindex
This reorganizes data for better cache locality.

Search Parameter Optimization

Speed vs Sensitivity Trade-offs

hhblits -i query.fasta -d database -n 1 -cpu 8
# Single iteration, fastest

Parameter Impact on Speed

ParameterImpactSpeed
-n 1Single iterationFastest
-n 33 iterationsModerate
-n 88 iterationsSlow
-e 0.001Strict E-valueFaster
-e 1.0Relaxed E-valueSlower
-maxfilt 1000Small prefilterFaster
-maxfilt 100000Large prefilterSlower

Batch Processing

For processing many queries efficiently:
cat queries.list | parallel -j 8 \
  "hhblits -i {}.fasta -d database -o {}.hhr -oa3m {}.a3m"

Profiling and Monitoring

Check SIMD Usage

# Check which SIMD instructions the binary uses
objdump -d $(which hhblits) | grep -E "vpadd|vpmul|paddq|pmul" | head

Monitor Resource Usage

# Monitor CPU and memory during search
time hhblits -i query.fasta -d database -cpu 8 -v 2

# Detailed monitoring
/usr/bin/time -v hhblits -i query.fasta -d database -cpu 8

Best Practices Summary

Use AVX2-compiled binaries if your CPU supports it (2x faster than SSE2)Compile from source with -march=native for maximum performanceUse all physical cores with -cpu flagStore databases on SSD for better I/O performanceOptimize thread count based on available memoryUse MPI for cluster computing with many nodesAdjust search parameters based on sensitivity needsProcess queries in batches for better CPU utilizationMonitor memory usage especially with large databasesOptimize FFindex databases for sequential access

See Also

Build docs developers (and LLMs) love