Trust and Safety Models - X Recommendation Algorithm

Powered by Mintlify

Auto-generate your docs

Overview
Open Source Models
pNSFWMedia
pNSFWText
pToxicity
pAbuse
Model Architecture
Training Code
How They’re Used
Content Filtering Pipeline
Integration Points
Safety vs. Quality
Adversarial Considerations
Performance Characteristics
Related Components

Overview

X’s Trust and Safety models detect problematic content to maintain platform quality and user safety. These models filter content that violates policies or degrades user experience, ranging from NSFW media to abusive behavior.

Several additional models and rules remain proprietary due to the adversarial nature of trust and safety work. The team continues evaluating what can be safely open-sourced.

Open Source Models

X has open-sourced the training code for four key models:

pNSFWMedia

NSFW Media Detection

Detects tweets containing NSFW (Not Safe For Work) images, including adult and pornographic content.

What it detects:

Adult content
Pornographic images
Sexually explicit media

Use cases:

Content filtering for sensitive media settings
Protecting users who haven’t opted into adult content
Compliance with app store policies

pNSFWText

NSFW Text Detection

Identifies tweets with NSFW text content covering adult and sexual topics.

What it detects:

Adult/sexual language
Explicit text content
Sexually suggestive descriptions

Use cases:

Text-based content filtering
Search result filtering
Recommendation filtering for sensitive users

pToxicity

Toxic Content Detection

Detects toxic content including insults and certain types of harassment. Toxicity is marginal content that does not violate X’s Terms of Service.

What it detects:

Insults and derogatory language
Certain types of harassment
Low-quality or divisive content
Marginal content (does NOT violate TOS)

Toxic content may be downranked or filtered but does not result in enforcement action since it doesn’t violate Terms of Service.

Use cases:

Downranking in recommendations
Reducing exposure to low-quality content
Improving timeline quality
User experience optimization

pAbuse

Abusive Content Detection

Detects abusive content that violates X’s Terms of Service, including hate speech, targeted harassment, and abusive behavior.

What it detects:

Hate speech
Targeted harassment
Abusive behavior
Terms of Service violations

Unlike pToxicity, content flagged by pAbuse represents actual policy violations that may result in enforcement actions.

Use cases:

Content moderation queue prioritization
Automated enforcement actions
User reporting systems
Safety event detection

Model Architecture

Location: trust_and_safety_models/

Training Code

The repository includes:

Model training scripts
Feature extraction pipelines
Evaluation frameworks
Dataset preparation code

While the training code is open-sourced, production model weights and some preprocessing steps remain proprietary to prevent adversarial gaming.

How They’re Used

Content Filtering Pipeline

1

Prediction

Models score tweets and media at creation time or during processing

2

Thresholding

Scores above certain thresholds trigger filtering or moderation actions

3

Action

Content may be filtered, downranked, sent to moderation, or removed based on severity

4

User Controls

Users can adjust sensitivity settings to control filtered content

Integration Points

Visibility Filters
Ranking Systems
Moderation Queue

Trust and Safety scores feed into Visibility Filters which:

Hard-filter violating content
Apply visible product treatments (interstitials)
Downrank marginal content
Support legal compliance

Safety vs. Quality

Policy Violations (pAbuse)

Action: Enforcement actions including removal, account suspensionExamples: Hate speech, targeted harassment, threatsSeverity: High - violates Terms of Service

Marginal Content (pToxicity)

Action: Downranking, reduced distribution, user filteringExamples: Insults, divisive content, low-quality postsSeverity: Medium - degrades experience but doesn’t violate TOS

NSFW Content (pNSFWMedia/Text)

Action: Sensitive media treatments, user setting-based filteringExamples: Adult content, explicit imagery/textSeverity: Varies - not inherently violating, but requires user controls

Adversarial Considerations

X does not open-source all trust and safety systems because:

Bad actors could reverse-engineer detection methods
Adversaries could train content to evade detection
Some techniques rely on keeping detection patterns confidential

What remains proprietary:

Many additional safety models and rules
Production model weights
Detection thresholds and logic
Certain preprocessing and feature extraction methods
Ensemble architectures and combinations

Performance Characteristics

Latency: Real-time scoring during tweet creation and processing
Coverage: All public tweets scored by relevant models
Accuracy: Continuously evaluated and improved based on human review
Update Frequency: Models retrained regularly with new data and adversarial examples

Ranking Systems - Incorporates safety scores in ranking
Home Mixer - Applies visibility filtering using these models

Graph Feature Service

Navi ML Serving

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us