Overview
HHfilter reads an alignment in A2M, A3M, or FASTA format and writes a filtered alignment in A3M format. It can filter by maximum pairwise identity, minimum coverage, sequence identity with the master sequence, and alignment diversity.Key Features
- Redundancy removal: Filter sequences above a sequence identity threshold
- Coverage filtering: Remove sequences with low coverage
- Diversity control: Select the most diverse set of sequences
- Fast processing: Efficiently handles large alignments
When to Use HHfilter
Use HHfilter when you need to:- Reduce alignment size: Remove redundant sequences before HMM building
- Control memory usage: Smaller alignments require less RAM
- Speed up searches: Fewer sequences mean faster HMM comparisons
- Improve alignment quality: Remove partial or low-quality sequences
- Prepare alignments for downstream tools: Many tools benefit from filtered input
HHfilter is often used as a preprocessing step before hhmake or as part of an alignment pipeline.
Basic Usage
Common Use Cases
Standard Redundancy Removal
Remove highly similar sequences (default use case from help):Prepare Alignment for HMM Building
Filter before creating an HMM:Maximum Diversity Selection
Select the most diverse set of sequences:Quality Filtering
Remove low-quality sequences:Pipeline Integration
Use in a sequence processing pipeline:Key Parameters
Input/Output Options
Input/Output Options
-i <file>- Input alignment (A2M, A3M, or FASTA format)-o <file>- Output file in A3M format (overwrites existing)-a <file>- Append to output file instead of overwriting-v <int>- Verbose mode (0=silent, 1=warnings, 2=verbose)
Sequence Identity Filtering
Sequence Identity Filtering
-id [0,100]- Maximum pairwise sequence identity % (default: 90)- Remove the shorter of two sequences if identity exceeds this threshold
- Lower values = more aggressive filtering
- Common values: 50 (stringent), 70 (moderate), 90 (light)
Coverage and Quality
Coverage and Quality
-cov [0,100]- Minimum coverage with query % (default: 0)- Sequences must align to at least this % of query length
-qid [0,100]- Minimum sequence identity with query % (default: 0)- Sequences must have at least this % identity to the first sequence
-qsc [-inf,100]- Minimum score per column with query (default: -20.0)- Higher values = stricter quality requirements
Diversity Filtering
Diversity Filtering
-diff [0,inf]- Filter for diversity (default: 0 = off)- Select most diverse set keeping at least this many sequences per 50-column block
- Higher values = maintain more diversity
-neff [1,inf]- Target diversity (effective number of sequences)- Iteratively adjust
-qscto reach target Neff value
- Iteratively adjust
Input Format
Input Format
-M a2m- A2M/A3M format (default): upper=match, lower=insert-M first- FASTA format: first sequence defines match states-M [0,100]- FASTA format: columns with <X% gaps are match states
Understanding Filtering Behavior
Identity Filtering (-id)
When two sequences exceed the identity threshold:- The shorter sequence is removed
- This preserves longer, more complete sequences
- Applied pairwise across all sequences
Coverage Filtering (-cov)
Sequences must align to at least X% of the master (first) sequence:Diversity Filtering (-diff)
Selects the most diverse set of sequences:Output Format
HHfilter always outputs in A3M format:Tips and Best Practices
Advanced Options
Target Effective Sequence Count
Automatically adjust filtering to reach a target Neff:-qsc parameter until the effective number of sequences ≈ 7.0.
Maximum Sequence Limits
Control memory usage for very large alignments:Combined Filtering Strategy
Optimal filtering for most use cases:Workflow Integration
Preprocessing for Database Building
MSA Quality Control
Troubleshooting
Too many sequences removed
Too many sequences removed
If filtering is too aggressive:
- Increase
-idthreshold (e.g., 70 → 90) - Lower
-covrequirement (e.g., 75 → 50) - Remove
-qidand-qscfilters - Check if input sequences are highly diverse
Not enough filtering
Not enough filtering
If alignment is still too large:
- Decrease
-idthreshold (e.g., 90 → 70) - Increase
-covrequirement - Add
-qidfilter - Use
-diffto explicitly limit diversity
Memory errors with large alignments
Memory errors with large alignments
If HHfilter runs out of memory:
- Use
-maxseqto limit input sequences - Use
-maxresto limit HMM columns - Split alignment into chunks
- Filter in multiple passes with different criteria
Comparison with Other Filtering Tools
| Feature | HHfilter | CD-HIT | USEARCH |
|---|---|---|---|
| Input | MSA (A3M/A2M) | FASTA | FASTA |
| Output | A3M | FASTA | FASTA |
| Method | Pairwise identity | Clustering | Clustering |
| Speed | Fast | Very fast | Very fast |
| Diversity | Yes (-diff) | No | No |
| Coverage | Yes | Limited | Yes |
Related Tools
- hhmake - Build HMMs from filtered alignments
- hhconsensus - Generate consensus sequences
- hhblits - Includes built-in filtering during search