Overview
HH-suite can search against various pre-built protein databases optimized for remote homology detection. These databases range from comprehensive sequence collections to specialized structure and domain databases.Recommended Databases
Uniclust30
Comprehensive clustered protein sequences at 30% identity
BFD
Big Fantastic Database with 2.5 billion environmental sequences
PDB70
Representative protein structures from the PDB
Pfam
Curated protein family database
Uniclust30
Description
Uniclust30 is a comprehensive protein sequence database clustered at 30% sequence identity. It provides excellent coverage for homology detection while maintaining reasonable database size. Key Features:- Clustered from UniProt at 30% sequence identity
- Updated regularly
- Optimized for HHblits iterative searches
- Good balance of sensitivity and speed
Download
Usage
Reference
Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M (2017) Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research, 45(D1):D170-D176. doi: 10.1093/nar/gkw1081
BFD
Description
The Big Fantastic Database (BFD) contains approximately 2.5 billion protein sequences, mostly from environmental samples. It provides maximum sensitivity for detecting remote homologs. Key Features:- 2.5+ billion sequences
- Mostly environmental (metagenomic) sequences
- Highest sensitivity for remote homology detection
- Significantly larger and slower than Uniclust30
- Used by AlphaFold for MSA generation
Download
Usage
Reference
Steinegger M, Söding J (2019) Clustering huge protein sequence sets in linear time. Nature Communications, 10:2542. doi: 10.1038/s41592-019-0437-4
PDB70
Description
PDB70 is a filtered subset of protein structures from the Protein Data Bank, clustered at 70% maximum sequence identity. It’s ideal for structure-based searches and homology modeling. Key Features:- Representative protein structures from PDB
- Clustered at 70% sequence identity
- Includes secondary structure information
- Updated weekly
- Essential for structure prediction and modeling
Download
Usage
PDB70 searches are typically performed with
hhsearch after building an HMM profile from a multiple alignment, not directly with hhblits.Pfam
Description
Pfam is a curated database of protein families, each represented by multiple sequence alignments and HMMs. It’s useful for domain annotation and functional classification. Key Features:- Manually curated protein families
- High-quality seed alignments
- Comprehensive functional annotation
- Domain architecture information
- Standard for protein family classification
Download
Usage
Reference
Mistry J, et al. (2021) Pfam: The protein families database in 2021. Nucleic Acids Research, 49(D1):D412-D419. doi: 10.1093/nar/gkaa913
SCOP
Description
Structural Classification of Proteins database, organized hierarchically by fold, superfamily, and family.Download
Usage
Additional Databases
MPI Bioinformatics Toolkit Databases
The MPI Bioinformatics Toolkit maintains additional specialized databases:- COG - Clusters of Orthologous Groups
- ECOG - Evolutionary Genealogy of Genes
- CDD - Conserved Domain Database
- dbCAN - Carbohydrate-Active Enzymes database
- SMART - Simple Modular Architecture Research Tool
Database Selection Guide
Which database should I use?
Which database should I use?
For general homology searches:
- Start with Uniclust30 - best balance of speed and sensitivity
- Use BFD if you need maximum sensitivity and have computational resources
- Use PDB70 to find structural templates
- Search after generating an MSA with Uniclust30 or BFD
- Use Pfam for standard family classification
- Use SCOP for structural classification
- Use domain-specific databases (COG, dbCAN, etc.) from MPI Toolkit
Database Formats
HH-suite databases consist of several files:All files with the same prefix must be present in the same directory for the database to work properly.
Performance Considerations
Database Size vs. Speed
| Database | Sequences | Disk Space | Search Time | Sensitivity |
|---|---|---|---|---|
| Uniclust30 | ~100M | ~100 GB | Fast | Good |
| BFD | ~2.5B | >1 TB | Slow | Excellent |
| PDB70 | ~50K | <5 GB | Very Fast | Structure-specific |
| Pfam | ~20K | <2 GB | Very Fast | Family-specific |
Memory Requirements
Building Custom Databases
You can create custom databases from your own sequences. See the Building Custom Databases guide for detailed instructions.Database Updates
Databases are typically updated on the following schedules:- Uniclust30: Every 2-3 months
- BFD: Annually
- PDB70: Weekly (follows PDB releases)
- Pfam: Every 6-12 months
Troubleshooting
Database Not Found
.ffdata, .ffindex) are in the specified directory and use the correct basename without extensions.
Out of Memory
- Reduce number of CPUs with
-cpuoption - Use a smaller database (Uniclust30 instead of BFD)
- Increase system swap space