Overview
cstranslate converts protein sequences or alignments into an abstract state alphabet (AS219) for improved sequence comparison. This context-specific transformation can enhance homology detection by reducing the 20-letter amino acid alphabet to 219 representative states based on local sequence context.
Basic Usage
Command-Line Options
Input/Output
| Option | Description | Default |
|---|---|---|
-i, --infile <file> | Input file with alignment or sequence | Required |
-o, --outfile <file> | Output file for abstract state sequence | <infile>.as |
-a, --append <file> | Append output to this file | None |
-I, --informat | Input format: prf, seq, fas, a2m, a3m, ca3m | auto |
-O, --outformat | Output format: seq (sequence) or prf (profile) | seq |
Pseudocount Options
| Option | Description | Default |
|---|---|---|
-x, --pc-admix [0,1] | Pseudocount admixture for context-specific pseudocounts | 0.90 |
-c, --pc-ali [0,inf] | Constant in pseudocount calculation for alignments | 12.0 |
-D, --context-data <file> | Context-data file for pseudocounts | internal |
-A, --alphabet <file> | Abstract state alphabet (219 states) | internal |
-w, --weight [0,inf] | Weight of abstract state column in emission | 1000.0 |
Advanced Options
| Option | Description |
|---|---|
-M, --match-assign [0:100] | Make FASTA columns with <X% gaps match columns |
-f, --ffindex | Read/write from FFindex databases (enables OpenMP) |
-v, --verbose | Verbose output mode |
How It Works
Read Input
cstranslate reads protein sequences or multiple alignments in various formats (A3M, FASTA, etc.)Apply Context-Specific Pseudocounts
If enabled, adds context-specific pseudocounts using a context library or CRF model to improve profile quality
Translate to AS219
Converts the amino acid profile to a 219-state abstract alphabet based on posterior probabilities from the context library
Input Formats
cstranslate supports multiple input formats:
- prf: Profile format with amino acid frequencies
- seq: Single sequence
- fas/a2m/a3m: Multiple sequence alignments
- ca3m: Compressed A3M format (requires FFindex)
When using
auto format detection, the file extension determines the input format.Examples
Translate A3M Alignment
Translate with Custom Pseudocounts
Batch Processing with FFindex
Output Profile Instead of Sequence
MPI Version
For distributed processing of large database conversions:Performance Considerations
Optimization Tips
Optimization Tips
- Use
-f(FFindex mode) with OpenMP for parallel processing - The internal context library is embedded in the binary for fast access
- Pseudocount calculation is the most computationally intensive step
- Consider using the MPI version for very large databases
Technical Details
Abstract State Alphabet (AS219)
The AS219 alphabet consists of 219 representative states derived from local sequence profiles. Each state represents a characteristic amino acid distribution pattern:- Reduces alphabet size while preserving sequence context
- Improves remote homology detection
- Based on context-specific profile libraries
Context-Specific Pseudocounts
Source:/home/daytona/workspace/source/src/cs/cstranslate_app.h:64-72
The pseudocount calculation uses either:
- Library-based approach (default): Uses a pre-computed context library
- CRF approach: Uses a Conditional Random Field model
Related Tools
- hhblits - Can use CS219 profiles for searching
- hhmake - Creates HMM profiles from alignments
- reformat.pl - Convert between alignment formats
See Also
- File Formats Guide - Details on A3M, HHM, and other formats
- Building Custom Databases - Using cstranslate in database pipelines
- Performance Optimization - Tips for large-scale processing