Documentation Index
Fetch the complete documentation index at: https://mintlify.com/twitter/the-algorithm/llms.txt
Use this file to discover all available pages before exploring further.
Overview
DataRecord is Twitter’s standard machine learning data format used to represent feature data, labels, and predictions during both model training and serving. It provides a flexible, efficient representation of sparse and dense features commonly used in recommendation systems. DataRecords are used throughout the recommendation pipeline:- Feature Hydration - Storing features fetched from various sources
- Model Input - Passing features to ML prediction services
- Model Output - Returning predictions and scores
- Training Data - Serializing examples for offline model training
Architecture
The DataRecord format is implemented in thetwml (Twitter Machine Learning) library and supports:
- Sparse Features - Efficient representation of high-dimensional sparse data
- Dense Features - Continuous numerical features
- Binary Features - Boolean indicators
- Discrete Features - Categorical values
- String Features - Text and string data
- Sparse Binary Features - Set-valued binary features
- Sparse Continuous Features - Sparse floating-point features
DataRecord Structure
Feature Types
DataRecord supports multiple feature field types optimized for different data patterns:Set of binary feature IDs that are present (value = 1). Efficient for sparse binary indicators.Example: User has verified badge, user is following author
Map from feature ID to continuous floating-point value. Used for numerical features.Example: User’s follower count, tweet engagement rate, predicted CTR
Map from feature ID to discrete/categorical value. Used for enumerated features.Example: User’s country ID, tweet language ID, time of day bucket
Map from feature ID to string value. Used for text and identifiers.Example: Tweet text, user bio, hashtags
Map from feature ID to set of string keys. Used for multi-valued categorical features.Example: Set of topics user is interested in, set of hashtags in tweet
Map from feature ID to map of string keys to continuous values. Used for sparse weighted features.Example: Topic affinity scores (topic_id -> score), author engagement rates by type
Feature IDs
Features are identified by 64-bit integer IDs. Feature IDs are typically:- Defined in feature configuration files
- Hashed from feature names for dynamic features
- Organized into ranges by feature group
- Mapped to human-readable names in feature metadata
Usage in Follow Recommendations Service
As described in the FRS README, the ranking pipeline uses DataRecords extensively:Feature Hydration
-
For each candidate account, FRS fetches features from multiple sources:
- User features (follower count, account age, verification status)
- Candidate features (popularity metrics, content quality signals)
- Relationship features (mutual follows, interaction history)
- Context features (time of day, user location, device type)
-
Features are aggregated into a DataRecord for each
(user, candidate)pair
ML Ranking
-
DataRecord Construction - Build a DataRecord containing:
-
Prediction - Send DataRecord to ML prediction service:
- Service loads the trained model
- Model reads features from DataRecord
- Model computes prediction scores
- Scores are written back to output DataRecord
-
Scoring - Extract predictions from response DataRecord:
p(follow|recommendation)- Probability user will follow the candidatep(positive_engagement|follow)- Probability of engagement given follow- Final score is weighted combination of these probabilities
Python API
The DataRecord Python API is defined intwml/twml/readers/data_record.py:
Scala API
DataRecord is also used in Scala services via theml.api package:
Feature Engineering
Common patterns for DataRecord features:Normalization
Ratios and Derived Features
Crossed Features
Temporal Features
Serialization
DataRecords are serialized for:- RPC Communication - Sending features to prediction services
- Training Data - Writing examples to training datasets
- Caching - Storing computed features
- Protocol Buffers - Efficient binary format
- Thrift - Compatible with Thrift services
- Avro - For Hadoop/big data pipelines
Performance Considerations
Sparse vs Dense
Choose appropriate feature types based on sparsity:- Use
binaryFeaturesset for sparse binary indicators (saves space vs. map) - Use
sparseContinuousFeatureswhen most values are zero - Use regular
continuousFeaturesfor dense features that are usually present
Feature Limits
Recommendation services may have limits on:- Total number of features per DataRecord
- Maximum feature vector size
- Serialized DataRecord size
Batching
Prediction services often batch DataRecords:- Reduces RPC overhead
- Enables GPU batch processing
- Amortizes model loading costs
Training vs Serving
Training Data
DataRecords for training include:- All features used during serving
- Additional features for analysis
- Labels (e.g., did user follow? did user engage?)
- Weights for importance sampling
Serving Data
DataRecords for serving:- Only features available at inference time
- No labels (model produces predictions)
- Optimized for low latency
- May use cached feature values
Related Documentation
- Follow Recommendations Service API - Uses DataRecords for ranking
- Thrift Definitions - Service API types
- CR Mixer API - Candidate generation
Source Code References
DataRecord implementation files:twml/twml/readers/data_record.py- Python DataRecord readerfollow-recommendations-service/common/src/main/scala/com/twitter/follow_recommendations/common/feature_hydration/- Feature hydration adapters- FRS README -
follow-recommendations-service/README.md:26- DataRecord usage description