Data Record Formats

Overview

DataRecord is Twitter’s standard machine learning data format used to represent feature data, labels, and predictions during both model training and serving. It provides a flexible, efficient representation of sparse and dense features commonly used in recommendation systems. DataRecords are used throughout the recommendation pipeline:

Feature Hydration - Storing features fetched from various sources
Model Input - Passing features to ML prediction services
Model Output - Returning predictions and scores
Training Data - Serializing examples for offline model training

Architecture

The DataRecord format is implemented in the twml (Twitter Machine Learning) library and supports:

Sparse Features - Efficient representation of high-dimensional sparse data
Dense Features - Continuous numerical features
Binary Features - Boolean indicators
Discrete Features - Categorical values
String Features - Text and string data
Sparse Binary Features - Set-valued binary features
Sparse Continuous Features - Sparse floating-point features

DataRecord Structure

Feature Types

DataRecord supports multiple feature field types optimized for different data patterns:

binaryFeatures

set<int64>

Set of binary feature IDs that are present (value = 1). Efficient for sparse binary indicators.Example: User has verified badge, user is following author

continuousFeatures

map<int64, double>

Map from feature ID to continuous floating-point value. Used for numerical features.Example: User’s follower count, tweet engagement rate, predicted CTR

discreteFeatures

map<int64, int64>

Map from feature ID to discrete/categorical value. Used for enumerated features.Example: User’s country ID, tweet language ID, time of day bucket

stringFeatures

map<int64, string>

Map from feature ID to string value. Used for text and identifiers.Example: Tweet text, user bio, hashtags

sparseBinaryFeatures

map<int64, set<string>>

Map from feature ID to set of string keys. Used for multi-valued categorical features.Example: Set of topics user is interested in, set of hashtags in tweet

sparseContinuousFeatures

map<int64, map<string, double>>

Map from feature ID to map of string keys to continuous values. Used for sparse weighted features.Example: Topic affinity scores (topic_id -> score), author engagement rates by type

Feature IDs

Features are identified by 64-bit integer IDs. Feature IDs are typically:

Defined in feature configuration files
Hashed from feature names for dynamic features
Organized into ranges by feature group
Mapped to human-readable names in feature metadata

Usage in Follow Recommendations Service

As described in the FRS README, the ranking pipeline uses DataRecords extensively:

Feature Hydration

For each candidate account, FRS fetches features from multiple sources:
- User features (follower count, account age, verification status)
- Candidate features (popularity metrics, content quality signals)
- Relationship features (mutual follows, interaction history)
- Context features (time of day, user location, device type)
Features are aggregated into a DataRecord for each (user, candidate) pair

ML Ranking

DataRecord Construction - Build a DataRecord containing:

# Example structure (pseudocode)
data_record = DataRecord()
data_record.continuous_features[USER_FOLLOWER_COUNT] = user.follower_count
data_record.continuous_features[CANDIDATE_FOLLOWER_COUNT] = candidate.follower_count
data_record.binary_features.add(USER_IS_VERIFIED)
data_record.discrete_features[USER_COUNTRY] = user.country_id
data_record.sparse_continuous_features[MUTUAL_FOLLOWS] = {
  "mutual_follow_count": 5.0,
  "mutual_follow_ratio": 0.023
}

Prediction - Send DataRecord to ML prediction service:
- Service loads the trained model
- Model reads features from DataRecord
- Model computes prediction scores
- Scores are written back to output DataRecord
Scoring - Extract predictions from response DataRecord:
- p(follow|recommendation) - Probability user will follow the candidate
- p(positive_engagement|follow) - Probability of engagement given follow
- Final score is weighted combination of these probabilities

Python API

The DataRecord Python API is defined in twml/twml/readers/data_record.py:

from twml.readers import DataRecord

# Create a new DataRecord
record = DataRecord()

# Add features
record.add_continuous_feature(feature_id=1001, value=3.14)
record.add_binary_feature(feature_id=2001)
record.add_discrete_feature(feature_id=3001, value=42)
record.add_string_feature(feature_id=4001, value="example")

# Serialize to bytes
bytes_data = record.serialize()

# Deserialize from bytes
record2 = DataRecord.deserialize(bytes_data)

# Access features
value = record2.get_continuous_feature(1001)  # Returns 3.14
has_feature = record2.has_binary_feature(2001)  # Returns True

Scala API

DataRecord is also used in Scala services via the ml.api package:

import com.twitter.ml.api.DataRecord
import com.twitter.ml.api.util.SRichDataRecord

// Create DataRecord
val dr = new DataRecord()

// Add features using SRichDataRecord wrapper
val richDr = SRichDataRecord(dr)
richDr.setFeatureValue(USER_FOLLOWER_COUNT, 1000.0)
richDr.setFeatureValue(USER_IS_VERIFIED, true)

// Read features
val followerCount = richDr.getFeatureValue(USER_FOLLOWER_COUNT)

Feature Engineering

Common patterns for DataRecord features:

Normalization

# Log transformation for skewed distributions
import math
raw_value = user.follower_count
normalized = math.log1p(raw_value)  # log(1 + x)
record.add_continuous_feature(LOG_FOLLOWER_COUNT, normalized)

Ratios and Derived Features

# Engagement rate
if candidate.tweet_count > 0:
  engagement_rate = candidate.total_likes / candidate.tweet_count
  record.add_continuous_feature(ENGAGEMENT_RATE, engagement_rate)

Crossed Features

# User-candidate interaction features
key = f"user_{user.id}_candidate_{candidate.id}"
record.add_sparse_continuous_feature(
  INTERACTION_HISTORY,
  {key: recent_interaction_score}
)

Temporal Features

# Time-based features
record.add_discrete_feature(HOUR_OF_DAY, current_hour)
record.add_discrete_feature(DAY_OF_WEEK, current_day)
record.add_binary_feature(IS_WEEKEND)

Serialization

DataRecords are serialized for:

RPC Communication - Sending features to prediction services
Training Data - Writing examples to training datasets
Caching - Storing computed features

Serialization formats:

Protocol Buffers - Efficient binary format
Thrift - Compatible with Thrift services
Avro - For Hadoop/big data pipelines

Performance Considerations

Sparse vs Dense

Choose appropriate feature types based on sparsity:

Use binaryFeatures set for sparse binary indicators (saves space vs. map)
Use sparseContinuousFeatures when most values are zero
Use regular continuousFeatures for dense features that are usually present

Feature Limits

Recommendation services may have limits on:

Total number of features per DataRecord
Maximum feature vector size
Serialized DataRecord size

FRS handles ~6000 features per candidate in Home Mixer ranking.

Batching

Prediction services often batch DataRecords:

Reduces RPC overhead
Enables GPU batch processing
Amortizes model loading costs

Training vs Serving

Training Data

DataRecords for training include:

All features used during serving
Additional features for analysis
Labels (e.g., did user follow? did user engage?)
Weights for importance sampling

Serving Data

DataRecords for serving:

Only features available at inference time
No labels (model produces predictions)
Optimized for low latency
May use cached feature values

Follow Recommendations Service API - Uses DataRecords for ranking
Thrift Definitions - Service API types
CR Mixer API - Candidate generation

Source Code References

DataRecord implementation files:

twml/twml/readers/data_record.py - Python DataRecord reader
follow-recommendations-service/common/src/main/scala/com/twitter/follow_recommendations/common/feature_hydration/ - Feature hydration adapters
FRS README - follow-recommendations-service/README.md:26 - DataRecord usage description

Services

Data Models

Overview

Architecture

DataRecord Structure

Feature Types

Feature IDs

Usage in Follow Recommendations Service

Feature Hydration

ML Ranking

Python API

Scala API

Feature Engineering

Normalization

Ratios and Derived Features

Crossed Features

Temporal Features

Serialization

Performance Considerations

Sparse vs Dense

Feature Limits

Batching

Training vs Serving

Training Data

Serving Data

Source Code References

Build docs developers (and LLMs) love

Services

Data Models

Documentation Index

​Overview

​Architecture

​DataRecord Structure

​Feature Types

​Feature IDs

​Usage in Follow Recommendations Service

​Feature Hydration

​ML Ranking

​Python API

​Scala API

​Feature Engineering

​Normalization

​Ratios and Derived Features

​Crossed Features

​Temporal Features

​Serialization

​Performance Considerations

​Sparse vs Dense

​Feature Limits

​Batching

​Training vs Serving

​Training Data

​Serving Data

​Related Documentation

​Source Code References

Build docs developers (and LLMs) love

Overview

Architecture

DataRecord Structure

Feature Types

Feature IDs

Usage in Follow Recommendations Service

Feature Hydration

ML Ranking

Python API

Scala API

Feature Engineering

Normalization

Ratios and Derived Features

Crossed Features

Temporal Features

Serialization

Performance Considerations

Sparse vs Dense

Feature Limits

Batching

Training vs Serving

Training Data

Serving Data

Related Documentation

Source Code References