Platform Scraping

Overview

Genie Helper uses Stagehand (Playwright + vision LLM) to scrape creator profiles from OnlyFans, Fansly, and other platforms. The scraper supports cookie-based authentication, username/password login, and Twitter/X OAuth flows. Operation: scrape_profile (media-worker)
Browser: Local Playwright (headless Chrome)
Vision LLM: ollama/qwen-2.5 for page understanding

Architecture

Scraping Flow

Dashboard "Let's Go" button
  ↓
Create media_jobs record (operation: scrape_profile)
  ↓
Media worker picks job
  ↓
Stagehand session starts
  ↓
Inject cookies from platform_sessions (if available)
  ↓
Navigate to creator profile URL
  ↓
Check if login wall detected (LLM: "Is this login or profile?")
  ↓ (if login)
Try credential-based login (email/password or X OAuth)
  ↓ (if no auth available)
Create hitl_sessions record → Show yellow banner in Dashboard
  ↓
User completes login via browser extension
  ↓
Extension captures cookies → platform_sessions
  ↓
Retry scrape → Cookie injection → Success
  ↓
Extract profile stats (followers, posts, bio, subscription price)
  ↓
Extract recent posts (captions, likes, comments, dates)
  ↓
Write to scraped_media + platform_connections
  ↓
Update scrape_status: idle

Authentication Methods

The scraper supports 3 authentication types (stored in platform_connections.auth_type): Flow: User logs in via browser extension → Extension captures cookies → Scraper injects cookies on next run Pros:

Most reliable (no credential validation)
Works with 2FA, Google SSO, magic links
Bypasses bot detection

Cons:

Requires manual login once every 30-90 days (cookie expiry)

Dashboard Setup: Select “Cookie-only (most reliable)” during platform connection

2. Email/Password

Flow: Scraper navigates to login page → Fills email + password fields → Clicks sign-in button Pros:

Fully automated (no user interaction)
Works for platforms without 2FA

Cons:

May trigger CAPTCHA or bot detection
Fails if 2FA is enabled

Implementation: media-worker/index.js:731-739

if (authType === "email_password" && creds?.password) {
  await sPost(`/v1/sessions/${sid}/navigate`, { url: urls.login });
  await sPost(`/v1/sessions/${sid}/act`, {
    action: `Fill the email/username field with "${creds.username}" and the password field with the stored password, then click the sign-in button`,
    modelName: STAGEHAND_MODEL,
  });
  await new Promise(r => setTimeout(r, 4000));
  loggedIn = true;
}

3. Twitter/X OAuth

Flow: Scraper clicks “Sign in with X” → Fills X credentials → OnlyFans/Fansly redirects back Supported platforms: OnlyFans, Fansly
Credentials: x_username + x_password (separate from platform credentials) Implementation: media-worker/index.js:708-728

if (authType === "twitter_oauth" && creds?.x_password) {
  await sPost(`/v1/sessions/${sid}/navigate`, { url: urls.login });
  
  // Click "Sign in with X" button
  await sPost(`/v1/sessions/${sid}/act`, {
    action: 'Find and click the "Sign in with X" or "Continue with Twitter" button',
    modelName: STAGEHAND_MODEL,
  });
  await new Promise(r => setTimeout(r, 3500));
  
  // Fill X credentials
  await sPost(`/v1/sessions/${sid}/act`, {
    action: `Fill the username field with "${creds.x_username}", press Next, then fill the password field and click Sign in`,
    modelName: STAGEHAND_MODEL,
  });
  await new Promise(r => setTimeout(r, 5000));
  
  loggedIn = true;
}

HITL (Human-in-the-Loop) System

HITL is triggered when:

No cookies available in platform_sessions
No credentials stored (auth_type: cookie_only)
Login fails (CAPTCHA, 2FA, or expired cookies)

Flow

Scraper creates hitl_sessions record:

{
  "status": "pending",
  "platform": "onlyfans",
  "login_url": "https://onlyfans.com/login",
  "reason": "Login required to scrape @username's profile",
  "creator_profile_id": "abc123"
}

Dashboard shows yellow banner:

⚠️ Login Required
We need your help to access OnlyFans. Install the browser extension and log in.

User clicks “Download Extension” → Installs from public/extension/
User navigates to platform and logs in normally
Extension captures cookies → Sends to /api/credentials/store-platform-session
Backend encrypts cookies → Stores in platform_sessions.encrypted_cookies
Dashboard shows green checkmark → User clicks “Let’s Go” again
Scraper injects cookies → Bypass login wall → Success

Implementation: dashboard/src/pages/Dashboard/index.jsx (banner) + browser extension

Data Extraction

Profile Stats

Extracted fields:

follower_count — Total subscribers/followers
post_count — Total posts published
subscription_price — Monthly price (e.g., “$9.99” or “Free”)
bio_text — Profile biography

LLM Instruction: media-worker/index.js:762-770

const statsEx = await sPost(`/v1/sessions/${sid}/extract`, {
  instruction: `Extract this ${platform} creator's statistics: total follower/subscriber count, total post count, monthly subscription price, and profile bio text.`,
  schema: {
    follower_count: "number: total followers (integer, 0 if not visible)",
    post_count: "number: total posts (integer, 0 if not visible)",
    subscription_price: "string: subscription price like $9.99 or Free",
    bio_text: "string: profile biography text",
  },
});

Supported Platforms

Platform	Status	Auth Methods	Notes
OnlyFans	✅ Full	Cookie, Email, X OAuth	Main platform
Fansly	✅ Full	Cookie, Email, X OAuth	Similar to OF
Instagram	🚧 Partial	Cookie only	High bot detection
TikTok	🚧 Partial	Cookie only	Requires mobile user-agent
X/Twitter	🚧 Partial	Cookie only	Rate limits apply
Reddit	🚧 Partial	Cookie, Password	Subreddit-specific
Patreon	📅 Planned	Cookie	Roadmap
ManyVids	📅 Planned	Cookie	Roadmap

Platform URLs: media-worker/index.js:641-645

const PLATFORM_URLS = {
  onlyfans: { profile: `https://onlyfans.com/${username}`, login: "https://onlyfans.com/login" },
  fansly:   { profile: `https://fansly.com/${username}`,   login: "https://fansly.com/login" },
};

Scrape Status States

Stored in platform_connections.scrape_status:

Status	Meaning	Next Action
`idle`	Ready to scrape	Click “Scrape Now”
`scraping`	In progress	Wait (auto-updates)
`hitl_required`	Login needed	Install extension + log in
`failed`	Error occurred	Check error message + retry

Status Updates: media-worker/index.js:629,754,811,849

Browser Extension

Path: public/extension/ (Firefox + Chrome manifest)
Size: ~15KB (no external dependencies)

Features

Captures cookies on command (user clicks extension icon)
Encrypts cookies client-side (AES-256-GCM)
Sends to /api/credentials/store-platform-session
Auto-detects platform from current URL
Works on all 18 supported platforms

Installation

Firefox:

Download extension.zip from Dashboard
Open about:debugging#/runtime/this-firefox
Click “Load Temporary Add-on”
Select manifest.json

Chrome:

Download extension.zip
Open chrome://extensions
Enable “Developer mode”
Click “Load unpacked” → Select extension folder

Download Link: Dashboard → Platforms → “Download Browser Extension”

Metadata Stripping

All scraped images are auto-stripped of EXIF/GPS metadata before upload to Directus. Implementation: media-worker/index.js:817-835

try {
  const tmpFiles = fs.readdirSync(workDir);
  const imageExts = new Set([".jpg", ".jpeg", ".png", ".webp"]);
  
  for (const fname of tmpFiles) {
    const fext = path.extname(fname).toLowerCase();
    if (!imageExts.has(fext)) continue;
    
    const fPath = path.join(workDir, fname);
    const stripped = path.join(workDir, `stripped_${fname}`);
    
    await stripImageMetadata(fPath, stripped);
    
    // Replace original with stripped version
    if (fs.existsSync(stripped)) {
      fs.renameSync(stripped, fPath);
    }
  }
} catch (autoStripErr) {
  console.warn(`[scrape_profile] auto-strip error: ${autoStripErr.message}`);
}

Logs & Debugging

pm2 logs media-worker --lines 100 | grep scrape_profile

# Watch in real-time
pm2 logs media-worker -f | grep scrape_profile

Common Issues

Error	Cause	Fix
`HITL_REQUIRED`	No cookies + no credentials	Install extension + log in
`Stagehand timeout`	Page load >30s	Check internet connection
`Login wall detected`	Cookies expired	Re-capture cookies via extension
`screenshot failed`	Playwright crash	Restart `stagehand-server`

Media Processing — Media worker operations
Dashboard — Scrape trigger UI
AI Agent — Stagehand MCP tools

Get Started

Deployment

Core Features

AI & Automation

Integrations

Overview

Architecture

Scraping Flow

Authentication Methods

2. Email/Password

3. Twitter/X OAuth

HITL (Human-in-the-Loop) System

Flow

Data Extraction

Profile Stats

Recent Posts

Supported Platforms

Scrape Status States

Browser Extension

Features

Installation

Metadata Stripping

Logs & Debugging

Common Issues

Build docs developers (and LLMs) love

Get Started

Deployment

Core Features

AI & Automation

Integrations

Documentation Index

​Overview

​Architecture

​Scraping Flow

​Authentication Methods

​1. Cookie-Only (Recommended)

​2. Email/Password

​3. Twitter/X OAuth

​HITL (Human-in-the-Loop) System

​Flow

​Data Extraction

​Profile Stats

​Recent Posts

​Supported Platforms

​Scrape Status States

​Browser Extension

​Features

​Installation

​Metadata Stripping

​Logs & Debugging

​Common Issues

​Related

Build docs developers (and LLMs) love

Overview

Architecture

Scraping Flow

Authentication Methods

1. Cookie-Only (Recommended)

2. Email/Password

3. Twitter/X OAuth

HITL (Human-in-the-Loop) System

Flow

Data Extraction

Profile Stats

Recent Posts

Supported Platforms

Scrape Status States

Browser Extension

Features

Installation

Metadata Stripping

Logs & Debugging

Common Issues

Related