Overview
DataRecord is Twitter’s standard machine learning data format used to represent feature data, labels, and predictions during both model training and serving. It provides a flexible, efficient representation of sparse and dense features commonly used in recommendation systems. DataRecords are used throughout the recommendation pipeline:- Feature Hydration - Storing features fetched from various sources
- Model Input - Passing features to ML prediction services
- Model Output - Returning predictions and scores
- Training Data - Serializing examples for offline model training
Architecture
The DataRecord format is implemented in thetwml (Twitter Machine Learning) library and supports:
- Sparse Features - Efficient representation of high-dimensional sparse data
- Dense Features - Continuous numerical features
- Binary Features - Boolean indicators
- Discrete Features - Categorical values
- String Features - Text and string data
- Sparse Binary Features - Set-valued binary features
- Sparse Continuous Features - Sparse floating-point features
DataRecord Structure
Feature Types
DataRecord supports multiple feature field types optimized for different data patterns:Set of binary feature IDs that are present (value = 1). Efficient for sparse binary indicators.Example: User has verified badge, user is following author
Map from feature ID to continuous floating-point value. Used for numerical features.Example: User’s follower count, tweet engagement rate, predicted CTR
Map from feature ID to discrete/categorical value. Used for enumerated features.Example: User’s country ID, tweet language ID, time of day bucket
Map from feature ID to string value. Used for text and identifiers.Example: Tweet text, user bio, hashtags
Map from feature ID to set of string keys. Used for multi-valued categorical features.Example: Set of topics user is interested in, set of hashtags in tweet
Map from feature ID to map of string keys to continuous values. Used for sparse weighted features.Example: Topic affinity scores (topic_id -> score), author engagement rates by type
Feature IDs
Features are identified by 64-bit integer IDs. Feature IDs are typically:- Defined in feature configuration files
- Hashed from feature names for dynamic features
- Organized into ranges by feature group
- Mapped to human-readable names in feature metadata
Usage in Follow Recommendations Service
As described in the FRS README, the ranking pipeline uses DataRecords extensively:Feature Hydration
-
For each candidate account, FRS fetches features from multiple sources:
- User features (follower count, account age, verification status)
- Candidate features (popularity metrics, content quality signals)
- Relationship features (mutual follows, interaction history)
- Context features (time of day, user location, device type)
-
Features are aggregated into a DataRecord for each
(user, candidate)pair
ML Ranking
-
DataRecord Construction - Build a DataRecord containing:
-
Prediction - Send DataRecord to ML prediction service:
- Service loads the trained model
- Model reads features from DataRecord
- Model computes prediction scores
- Scores are written back to output DataRecord
-
Scoring - Extract predictions from response DataRecord:
p(follow|recommendation)- Probability user will follow the candidatep(positive_engagement|follow)- Probability of engagement given follow- Final score is weighted combination of these probabilities
Python API
The DataRecord Python API is defined intwml/twml/readers/data_record.py:
Scala API
DataRecord is also used in Scala services via theml.api package:
Feature Engineering
Common patterns for DataRecord features:Normalization
Ratios and Derived Features
Crossed Features
Temporal Features
Serialization
DataRecords are serialized for:- RPC Communication - Sending features to prediction services
- Training Data - Writing examples to training datasets
- Caching - Storing computed features
- Protocol Buffers - Efficient binary format
- Thrift - Compatible with Thrift services
- Avro - For Hadoop/big data pipelines
Performance Considerations
Sparse vs Dense
Choose appropriate feature types based on sparsity:- Use
binaryFeaturesset for sparse binary indicators (saves space vs. map) - Use
sparseContinuousFeatureswhen most values are zero - Use regular
continuousFeaturesfor dense features that are usually present
Feature Limits
Recommendation services may have limits on:- Total number of features per DataRecord
- Maximum feature vector size
- Serialized DataRecord size
Batching
Prediction services often batch DataRecords:- Reduces RPC overhead
- Enables GPU batch processing
- Amortizes model loading costs
Training vs Serving
Training Data
DataRecords for training include:- All features used during serving
- Additional features for analysis
- Labels (e.g., did user follow? did user engage?)
- Weights for importance sampling
Serving Data
DataRecords for serving:- Only features available at inference time
- No labels (model produces predictions)
- Optimized for low latency
- May use cached feature values
Related Documentation
- Follow Recommendations Service API - Uses DataRecords for ranking
- Thrift Definitions - Service API types
- CR Mixer API - Candidate generation
Source Code References
DataRecord implementation files:twml/twml/readers/data_record.py- Python DataRecord readerfollow-recommendations-service/common/src/main/scala/com/twitter/follow_recommendations/common/feature_hydration/- Feature hydration adapters- FRS README -
follow-recommendations-service/README.md:26- DataRecord usage description