Overview
X’s Trust and Safety models detect problematic content to maintain platform quality and user safety. These models filter content that violates policies or degrades user experience, ranging from NSFW media to abusive behavior.Open Source Models
X has open-sourced the training code for four key models:pNSFWMedia
NSFW Media Detection
Detects tweets containing NSFW (Not Safe For Work) images, including adult and pornographic content.
- Adult content
- Pornographic images
- Sexually explicit media
- Content filtering for sensitive media settings
- Protecting users who haven’t opted into adult content
- Compliance with app store policies
pNSFWText
NSFW Text Detection
Identifies tweets with NSFW text content covering adult and sexual topics.
- Adult/sexual language
- Explicit text content
- Sexually suggestive descriptions
- Text-based content filtering
- Search result filtering
- Recommendation filtering for sensitive users
pToxicity
Toxic Content Detection
Detects toxic content including insults and certain types of harassment. Toxicity is marginal content that does not violate X’s Terms of Service.
- Insults and derogatory language
- Certain types of harassment
- Low-quality or divisive content
- Marginal content (does NOT violate TOS)
Toxic content may be downranked or filtered but does not result in enforcement action since it doesn’t violate Terms of Service.
- Downranking in recommendations
- Reducing exposure to low-quality content
- Improving timeline quality
- User experience optimization
pAbuse
Abusive Content Detection
Detects abusive content that violates X’s Terms of Service, including hate speech, targeted harassment, and abusive behavior.
- Hate speech
- Targeted harassment
- Abusive behavior
- Terms of Service violations
- Content moderation queue prioritization
- Automated enforcement actions
- User reporting systems
- Safety event detection
Model Architecture
Location:trust_and_safety_models/
Training Code
The repository includes:- Model training scripts
- Feature extraction pipelines
- Evaluation frameworks
- Dataset preparation code
How They’re Used
Content Filtering Pipeline
Integration Points
- Visibility Filters
- Ranking Systems
- Moderation Queue
Trust and Safety scores feed into Visibility Filters which:
- Hard-filter violating content
- Apply visible product treatments (interstitials)
- Downrank marginal content
- Support legal compliance
Safety vs. Quality
Policy Violations (pAbuse)
Policy Violations (pAbuse)
Action: Enforcement actions including removal, account suspensionExamples: Hate speech, targeted harassment, threatsSeverity: High - violates Terms of Service
Marginal Content (pToxicity)
Marginal Content (pToxicity)
Action: Downranking, reduced distribution, user filteringExamples: Insults, divisive content, low-quality postsSeverity: Medium - degrades experience but doesn’t violate TOS
NSFW Content (pNSFWMedia/Text)
NSFW Content (pNSFWMedia/Text)
Action: Sensitive media treatments, user setting-based filteringExamples: Adult content, explicit imagery/textSeverity: Varies - not inherently violating, but requires user controls
Adversarial Considerations
What remains proprietary:- Many additional safety models and rules
- Production model weights
- Detection thresholds and logic
- Certain preprocessing and feature extraction methods
- Ensemble architectures and combinations
Performance Characteristics
- Latency: Real-time scoring during tweet creation and processing
- Coverage: All public tweets scored by relevant models
- Accuracy: Continuously evaluated and improved based on human review
- Update Frequency: Models retrained regularly with new data and adversarial examples
Related Components
- Ranking Systems - Incorporates safety scores in ranking
- Home Mixer - Applies visibility filtering using these models