Skip to main content

Overview

HH-suite provides several utilities for working with A3M (FASTA-like) multiple sequence alignments. These tools enable compression, extraction, filtering, and reduction of A3M files, which is essential for managing large alignment databases.

Available Tools

a3m_compress

Compress A3M alignments by storing sequences as references to a sequence database

a3m_extract

Extract compressed A3M alignments back to standard A3M format

a3m_reduce

Reduce redundancy in A3M alignments by filtering similar sequences

a3m_database_reduce

Reduce A3M databases in FFindex format

a3m_database_extract

Extract A3M alignments from FFindex databases

a3m_database_filter

Filter A3M databases based on various criteria

a3m_extract

Overview

Extracts compressed A3M alignments back to standard A3M format. The compressed format stores sequences as indices and block encodings that reference a sequence database.

Usage

a3m_extract -i <input> -o <output> \
  -d <sequence_db_prefix> \
  -q <header_db_prefix>

Options

OptionDescription
-i <file>Input compressed A3M file or stdin
-o <file>Output A3M file or stdout
-d <prefix>FFindex sequence database prefix (without .ffdata/.ffindex)
-q <prefix>FFindex header database prefix
-hDisplay help message

Example

a3m_extract -i compressed.ca3m -o output.a3m \
  -d sequence_db \
  -q header_db

Compression Algorithm

The compression algorithm works by:
  1. Storing the consensus sequence in plain text
  2. Referencing sequences by index from a pre-built sequence database
  3. Encoding alignments as blocks of matches, insertions, and deletions:
    • Matches: Upper-case letters aligned to consensus
    • Insertions: Lower-case letters (not in consensus)
    • Deletions: Gaps in the sequence relative to consensus
1

Identify Sequence

Extract sequence ID from header and look up in sequence database (source: a3m_compress.cpp:356-382)
2

Find Start Position

Determine where the aligned sequence starts in the full sequence (source: a3m_compress.cpp:477-498)
3

Encode Blocks

Store sequence as blocks of:
  • Number of matches (upper-case residues)
  • Number of insertions (lower-case residues) or deletions (gaps)
Each block uses 2 bytes: 1 for matches, 1 for insertions/deletions (source: a3m_compress.cpp:396-473)
4

Write Compressed Data

Write:
  • 4-byte sequence database index
  • 2-byte start position
  • 2-byte number of blocks
  • Block data

A3M Format Validation

Source: scripts/a3m.py The A3M format uses specific character meanings:

Valid Characters

VALID_MATCH_STATES = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# Upper-case amino acids aligned to consensus

Special Sequences

A3M files can include annotation sequences:
HeaderDescriptionValid Characters
>ss_predPredicted secondary structureE (extended), C (coil), H (helix)
>ss_confSecondary structure confidence0-9 (confidence levels)
>ss_dsspDSSP secondary structureCHBEGITS-
>*_consensusConsensus sequenceStandard amino acids
Each A3M file should contain exactly one consensus sequence. Multiple consensus sequences will cause an error.

Working with FFindex Databases

a3m_database_extract

Extract specific entries from an A3M FFindex database:
a3m_database_extract -i db_a3m -o output.a3m \
  -d sequence_db \
  -q header_db \
  -e entry_name

a3m_database_reduce

Reduce redundancy in an entire A3M database:
a3m_database_reduce -i db_a3m \
  -o db_a3m_reduced \
  -d sequence_db \
  -q header_db \
  --max-seqid 90

a3m_database_filter

Filter database entries by various criteria:
a3m_database_filter -i db_a3m \
  -o db_a3m_filtered \
  -d sequence_db \
  -q header_db \
  --min-sequences 10

Python Utilities

Source: scripts/a3m.py The a3m.py module provides Python utilities for A3M manipulation:
from a3m import A3M_Container

# Read A3M file
a3m = A3M_Container()
with open('input.a3m', 'r') as fh:
    a3m.read_a3m(fh)

print(f"Number of sequences: {a3m.number_sequences}")
print(f"Match states: {a3m.nr_match_states}")

# Check sequence validity
a3m.check_sequence(sequence_string)

# Extract subsequence
sub_a3m = a3m.split_a3m([(start, end)])

Key Methods

  • check_and_add_sequence(): Validate and add a sequence to the container
  • check_match_states(): Verify all sequences have the same number of match states
  • split_a3m(): Extract a subsequence range from the alignment
  • get_sub_sequence(): Get a specific region of a sequence

Performance Tips

OpenMP Support: The A3M compression tools support OpenMP for parallel processing when compiled with OpenMP support (source: a3m_compress.cpp:11-13).
  • Compress alignments when storing large databases to save disk space
  • Use FFindex format for databases with many alignments
  • Validate A3M files with check_a3m.py before processing
  • Keep sequence databases when using compressed format
  • Build separate header and sequence databases for optimal compression

Common Workflows

Create Compressed Database

1

Build Sequence Database

Extract all sequences and create FFindex databases:
hhsuitedb.py -i alignments/ -o database
2

Compress Alignments

Compress A3M files referencing the sequence database:
a3m_compress -i alignment.a3m -o compressed.ca3m \
  -d database_sequence -q database_header
3

Build FFindex

Create FFindex from compressed files:
ffindex_build -s db.ffdata db.ffindex compressed_dir/

Error Handling

Common errors and solutions:
ErrorCauseSolution
”More than one consensus sequence”Multiple sequences ending in _consensusEnsure only one consensus per A3M
”No protein sequences could be compressed”No matching sequences in databaseCheck sequence IDs match database
”Sequence with zero match states”Empty or invalid sequenceValidate A3M format
”Diverging number of match states”Sequences have different lengthsCheck alignment integrity

See Also

Build docs developers (and LLMs) love