Skip to main content

Overview

X’s Trust and Safety models detect problematic content to maintain platform quality and user safety. These models filter content that violates policies or degrades user experience, ranging from NSFW media to abusive behavior.
Several additional models and rules remain proprietary due to the adversarial nature of trust and safety work. The team continues evaluating what can be safely open-sourced.

Open Source Models

X has open-sourced the training code for four key models:

pNSFWMedia

NSFW Media Detection

Detects tweets containing NSFW (Not Safe For Work) images, including adult and pornographic content.
What it detects:
  • Adult content
  • Pornographic images
  • Sexually explicit media
Use cases:
  • Content filtering for sensitive media settings
  • Protecting users who haven’t opted into adult content
  • Compliance with app store policies

pNSFWText

NSFW Text Detection

Identifies tweets with NSFW text content covering adult and sexual topics.
What it detects:
  • Adult/sexual language
  • Explicit text content
  • Sexually suggestive descriptions
Use cases:
  • Text-based content filtering
  • Search result filtering
  • Recommendation filtering for sensitive users

pToxicity

Toxic Content Detection

Detects toxic content including insults and certain types of harassment. Toxicity is marginal content that does not violate X’s Terms of Service.
What it detects:
  • Insults and derogatory language
  • Certain types of harassment
  • Low-quality or divisive content
  • Marginal content (does NOT violate TOS)
Toxic content may be downranked or filtered but does not result in enforcement action since it doesn’t violate Terms of Service.
Use cases:
  • Downranking in recommendations
  • Reducing exposure to low-quality content
  • Improving timeline quality
  • User experience optimization

pAbuse

Abusive Content Detection

Detects abusive content that violates X’s Terms of Service, including hate speech, targeted harassment, and abusive behavior.
What it detects:
  • Hate speech
  • Targeted harassment
  • Abusive behavior
  • Terms of Service violations
Unlike pToxicity, content flagged by pAbuse represents actual policy violations that may result in enforcement actions.
Use cases:
  • Content moderation queue prioritization
  • Automated enforcement actions
  • User reporting systems
  • Safety event detection

Model Architecture

Location: trust_and_safety_models/

Training Code

The repository includes:
  • Model training scripts
  • Feature extraction pipelines
  • Evaluation frameworks
  • Dataset preparation code
While the training code is open-sourced, production model weights and some preprocessing steps remain proprietary to prevent adversarial gaming.

How They’re Used

Content Filtering Pipeline

1

Prediction

Models score tweets and media at creation time or during processing
2

Thresholding

Scores above certain thresholds trigger filtering or moderation actions
3

Action

Content may be filtered, downranked, sent to moderation, or removed based on severity
4

User Controls

Users can adjust sensitivity settings to control filtered content

Integration Points

Trust and Safety scores feed into Visibility Filters which:
  • Hard-filter violating content
  • Apply visible product treatments (interstitials)
  • Downrank marginal content
  • Support legal compliance

Safety vs. Quality

Action: Enforcement actions including removal, account suspensionExamples: Hate speech, targeted harassment, threatsSeverity: High - violates Terms of Service
Action: Downranking, reduced distribution, user filteringExamples: Insults, divisive content, low-quality postsSeverity: Medium - degrades experience but doesn’t violate TOS
Action: Sensitive media treatments, user setting-based filteringExamples: Adult content, explicit imagery/textSeverity: Varies - not inherently violating, but requires user controls

Adversarial Considerations

X does not open-source all trust and safety systems because:
  • Bad actors could reverse-engineer detection methods
  • Adversaries could train content to evade detection
  • Some techniques rely on keeping detection patterns confidential
What remains proprietary:
  • Many additional safety models and rules
  • Production model weights
  • Detection thresholds and logic
  • Certain preprocessing and feature extraction methods
  • Ensemble architectures and combinations

Performance Characteristics

  • Latency: Real-time scoring during tweet creation and processing
  • Coverage: All public tweets scored by relevant models
  • Accuracy: Continuously evaluated and improved based on human review
  • Update Frequency: Models retrained regularly with new data and adversarial examples

Build docs developers (and LLMs) love