Overview
HH-suite uses specialized database formats optimized for fast HMM-HMM comparisons. Understanding these formats is essential for creating custom databases and optimizing search performance.Database Types
HHM Database
Standard HH-suite database format containing:- HMM profiles (.hhm files)
- Index files for quick access
- Optional secondary structure information
A3M Database
Database of multiple sequence alignments:- Stored in A3M format
- Can be converted to HHM database
- Used by hhblits for iterative searches
CA3M Database (Compressed A3M)
Compressed format for large databases:- Reduces storage requirements
- Faster I/O operations
- Used with FFindex for efficient access
CS219 Database
Context-specific database using AS219 alphabet:- Used for fast prefiltering in hhblits
- Compressed sequence representation
- Enables rapid database scanning
Database Components
FFindex Structure
HH-suite databases use FFindex for efficient random access:For A3M Databases
For HHM Databases
For CA3M Databases
For CS219 Databases
Creating Databases
From FASTA File
Building HHM Database
Creating CS219 Index
Database Formats
Standard Database
Minimal database for hhsearch:HHblits Database
Complete database for hhblits:Compressed Database
For very large databases:Using Databases
With hhblits
With hhsearch
Multiple Databases
Database Statistics
Database Size
Estimate storage requirements:- A3M: ~500 bytes per sequence (varies with alignment size)
- HHM: ~5-10 KB per profile (for ~100 residue protein)
- CS219: ~100 bytes per sequence
- CA3M: ~60% of uncompressed A3M size
Database Diversity
Key metrics:- Number of entries: Total sequences/profiles
- Average Neff: Sequence diversity (aim for >4)
- Coverage: Proteome or domain coverage
Prebuilt Databases
UniProt Databases
- UniProt20: Clustered at 20% identity
- UniProt30: Clustered at 30% identity
- UniRef30: Representative sequences
Domain Databases
- PDB70: Representative PDB structures
- Pfam: Protein families
- SCOP: Structural classification
Database Naming
Common naming convention:Database Maintenance
Updating Databases
Merging Databases
Database Validation
Performance Optimization
SSD vs HDD
- SSD: 5-10x faster for random access
- HDD: Acceptable for sequential scans
- Network: Can be slow, consider local copies
Memory Considerations
- Prefilter: Loads CS219 index into memory
- HMM Search: Random access to HHM database
- Large databases: May require substantial RAM for prefilter
Database Location
Custom Database Creation
From Protein Sequences
-
Collect sequences
-
Generate MSAs (optional but recommended)
-
Build FFindex database
-
Create CS219 index
Quality Control
- Remove redundancy: Use cd-hit or hhfilter
- Check coverage: Ensure diverse representation
- Validate entries: Ensure no corrupted sequences
- Test search: Run sample searches
FFindex Tools
Building Index
-s: Sort index by entry name
Extracting Entries
Modifying Database
Troubleshooting
Database Not Found
Error: “Could not open database” Solution:- Check database path
- Verify all required files exist
- Check file permissions
Corrupted Database
Error: “Invalid FFindex format” Solution:- Rebuild index:
ffindex_build - Validate entries
- Check disk space
Slow Performance
Solution:- Move database to faster storage (SSD)
- Increase RAM for prefilter
- Use compressed format (CA3M)
- Update to latest HH-suite version
Best Practices
Database Organization
Version Control
- Include date in database name
- Keep old versions temporarily
- Document database contents and source
- Track database statistics
Documentation
Create README for each database:See Also
- hhblits - Database searching tool
- A3M Format - Alignment format
- HHM Format - HMM format
- cstranslate - Create CS219 indices