Enterprise web scraping and data collection with Bright Data

Standard HTTP requests fail against modern anti-bot systems. Rate limits, CAPTCHAs, and geo-restrictions block naive scrapers before they collect meaningful data. Bright Data’s infrastructure handles all of that—its global proxy network and built-in bypass mechanisms give your agent reliable access to any public web source. This tutorial shows you two integration paths: the langchain-brightdata package for quick setup, and the Bright Data MCP server for access to 60+ specialized platform extractors.

Global proxy network

Route requests through Bright Data’s residential and datacenter IPs to avoid blocks.

CAPTCHA bypass

Bright Data’s Unlocker handles bot detection automatically, including JS rendering.

Structured extraction

Platform-specific parsers for Amazon, LinkedIn, and more return clean JSON.

Choose your integration path

LangChain integration
MCP server

The langchain-brightdata package provides a BrightDataSERP tool that slots directly into any LangChain or LangGraph agent. Use this path when you want quick setup and standard web search.Best for: search-first workflows, rapid prototyping, Google or Bing SERP data.

Path 1: LangChain integration

Installation

pip install langchain-brightdata langchain-google-genai langgraph python-dotenv

Configure API keys

# Write to .env (replace with your actual keys)
# BRIGHT_DATA_API_TOKEN=<your-brightdata-api-key>
# GOOGLE_API_KEY=<your-google-api-key>

from langchain_brightdata import BrightDataSERP
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent
from dotenv import load_dotenv
import os

load_dotenv()

print(f"Bright Data API Key loaded: {'Yes' if os.getenv('BRIGHT_DATA_API_TOKEN') else 'No'}")
print(f"Google API Key loaded: {'Yes' if os.getenv('GOOGLE_API_KEY') else 'No'}")

Initialize the language model

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.1,  # low temperature for consistent agent decisions
)

Configure the SERP tool

serp_tool = BrightDataSERP(
    search_engine="google",
    country="us",
    language="en",
    results_count=10,
    parse_results=True,  # convert raw HTML to structured data
)

print(f"Search Engine: {serp_tool.search_engine}")
print(f"Country: {serp_tool.country}")
print(f"Language: {serp_tool.language}")
print(f"Results Count: {serp_tool.results_count}")

parse_results=True instructs Bright Data’s parser to convert raw search engine HTML into structured JSON the LLM can process directly.

Create the ReAct agent

agent = create_react_agent(
    model=llm,
    tools=[serp_tool],
    prompt=(
        "You are a web researcher agent with access to a SERP tool. "
        "You MUST use the tool to answer user queries. If no specific country, "
        "language, search engine, or vertical is specified, choose what best fits "
        "the user's question."
    ),
)

Run a search query

user_query = "What are the latest developments and news in AI technology in the US?"

for step in agent.stream(
    {"messages": [("human", user_query)]},
    stream_mode="values",
):
    step["messages"][-1].pretty_print()

The streaming output lets you observe the agent’s reasoning process: query analysis, tool invocation, result processing, and final synthesis.

Build a reusable research assistant factory

Wrap the setup logic so you can create localized agents on demand:

def create_research_assistant(
    search_engine: str = "google",
    country: str = "us",
    language: str = "en",
):
    """
    Create a research agent configured for a specific locale.

    Args:
        search_engine: "google" or "bing"
        country: ISO country code for localized results
        language: ISO language code
    """
    llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.1)

    serp_tool = BrightDataSERP(
        search_engine=search_engine,
        country=country,
        language=language,
        results_count=15,
        parse_results=True,
    )

    agent = create_react_agent(llm, [serp_tool])

    print(f"Research Assistant created!")
    print(f"  Engine: {search_engine.title()}")
    print(f"  Location: {country.upper()}")
    print(f"  Language: {language.upper()}")

    return agent

research_assistant = create_research_assistant()

Advanced configuration patterns

spanish_serp = BrightDataSERP(
    search_engine="google",
    country="es",
    language="es",
    results_count=15,
    parse_results=True,
)

spanish_agent = create_react_agent(llm, [spanish_serp])

Run a multi-part research query

research_query = """
Please research the renewable energy market trends for 2024-2025.
I need information about:
1. Market growth predictions
2. Leading companies and their strategies
3. Recent technological breakthroughs
4. Government policies affecting the sector
"""

for step in research_assistant.stream(
    {"messages": [("human", research_query)]},
    stream_mode="values",
):
    step["messages"][-1].pretty_print()

Path 2: MCP server integration

The MCP path gives you access to Bright Data’s full tool suite, including platform-specific extractors and browser automation. It requires Node.js to run the @brightdata/mcp package.

Installation

pip install langgraph langchain-openai mcp-use python-dotenv

# .env
# BRIGHT_DATA_API_TOKEN=<your-brightdata-api-key>
# OPENROUTER_API_KEY=<your-openrouter-api-key>

Configure and connect the MCP server

import asyncio
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from mcp_use.client import MCPClient
from mcp_use.adapters.langchain_adapter import LangChainAdapter
from dotenv import load_dotenv
import os

load_dotenv()

async def setup_bright_data_tools():
    """Configure the Bright Data MCP client and convert tools to LangChain format."""
    bright_data_config = {
        "mcpServers": {
            "Bright Data": {
                "command": "npx",
                "args": ["@brightdata/mcp"],
                "env": {
                    "API_TOKEN": os.getenv("BRIGHT_DATA_API_TOKEN"),
                },
            }
        }
    }

    client = MCPClient.from_dict(bright_data_config)
    adapter = LangChainAdapter()

    tools = await adapter.create_tools(client)

    print(f"Connected to Bright Data MCP server")
    print(f"Available tools: {len(tools)}")

    return tools

The npx @brightdata/mcp command downloads and runs the Bright Data MCP server. You need Node.js installed on the machine running the agent. The server exposes 60+ tools including search engines, platform-specific scrapers, and a universal web unlocker.

Create the agent with MCP tools

import datetime

async def create_web_scraper_agent():
    """Create a ReAct agent with full Bright Data MCP tool access."""
    tools = await setup_bright_data_tools()

    current_date = datetime.datetime.now().strftime("%B %d, %Y")

    llm = ChatOpenAI(
        openai_api_key=os.getenv("OPENROUTER_API_KEY"),
        openai_api_base="https://openrouter.ai/api/v1",
        model_name="google/gemini-2.5-flash-lite-preview-06-17",
        temperature=0.1,
    )

    agent = create_react_agent(
        model=llm,
        tools=tools,
        prompt=(
            f"You are a web data extraction specialist. Today is {current_date}. "
            f"You have access to {len(tools)} Bright Data tools including search engines, "
            "platform-specific extractors, and a universal web unlocker. "
            "Always use a tool to answer user requests—do not rely on training data. "
            "Follow this process: 1) Understand the request. 2) Select the best tool. "
            "3) Execute and review results. 4) Return a structured response with sources."
        ),
    )

    return agent

Test basic search

async def test_basic_search(agent):
    print("Testing basic search...")
    print("=" * 50)

    result = await agent.ainvoke({
        "messages": [("human", "Give me the latest AI news from this week. Include full URLs to sources.")],
    })

    print("\nSearch results:")
    print(result["messages"][-1].content)
    return result

agent = await create_web_scraper_agent()
basic_result = await test_basic_search(agent)

Available MCP tool categories

Search engines

Google, Bing, and Yandex with configurable location, language, and result count.

Universal web scraper

Extract content from any URL in Markdown or HTML with built-in bot detection bypass and JS rendering.

Platform extractors

Structured data from Amazon, LinkedIn, Instagram, Facebook, X, TikTok, YouTube, Reddit, and Zillow.

Browser automation

Navigate interactive pages, click elements, and scrape content that requires JavaScript execution.

How the agent selects tools

The ReAct agent follows a systematic decision loop for each query:

For competitive intelligence workflows, combine the SERP tool to discover relevant URLs with the universal scraper to extract full page content from the top results. The agent handles this chaining automatically when given a research-style prompt.

Production considerations

Monitor your Bright Data API usage. The free tier provides 5,000 unlocker requests per month. Each BrightDataSERP call with results_count=10 consumes one request. High-volume research agents can exhaust the free tier quickly.

Consider these patterns when moving to production:

Rate limiting: Add delays between agent runs that trigger many tool calls in rapid succession.
Result caching: Cache SERP results with a short TTL (minutes to hours) for queries that repeat across users.
Error handling: Wrap agent invocations in try/except to handle network failures from the proxy layer gracefully.
Monitoring: Log which tools the agent selects and how often to identify optimization opportunities.

What you built

LangChain SERP agent

Localized Google or Bing searches routed through Bright Data’s proxy network with structured result parsing.

Reusable assistant factory

A create_research_assistant() function parameterized by search engine, country, and language.

MCP agent with 60+ tools

Full Bright Data tool suite including platform extractors and browser automation via the MCP server.

Production architecture

Rate limiting, caching, error handling, and monitoring patterns for deploying at scale.

Get Started

Agent Frameworks

Memory & Knowledge

Tool Integration & Data

Deployment

Observability & Quality

Enterprise web scraping and data collection with Bright Data

Global proxy network

CAPTCHA bypass

Structured extraction

Choose your integration path

Path 1: LangChain integration

Installation

Configure API keys

Initialize the language model

Configure the SERP tool

Create the ReAct agent

Run a search query

Build a reusable research assistant factory

Advanced configuration patterns

Run a multi-part research query

Path 2: MCP server integration

Installation

Configure and connect the MCP server

Create the agent with MCP tools

Test basic search

Available MCP tool categories

Search engines

Universal web scraper

Platform extractors

Browser automation

How the agent selects tools

Production considerations

What you built

LangChain SERP agent

Reusable assistant factory

MCP agent with 60+ tools

Production architecture

Build docs developers (and LLMs) love

Get Started

Agent Frameworks

Memory & Knowledge

Tool Integration & Data

Deployment

Observability & Quality

Documentation Index

Global proxy network

CAPTCHA bypass

Structured extraction

​Choose your integration path

​Path 1: LangChain integration

​Installation

​Configure API keys

​Initialize the language model

​Configure the SERP tool

​Create the ReAct agent

​Run a search query

​Build a reusable research assistant factory

​Advanced configuration patterns

​Run a multi-part research query

​Path 2: MCP server integration

​Installation

​Configure and connect the MCP server

​Create the agent with MCP tools

​Test basic search

​Available MCP tool categories

Search engines

Universal web scraper

Platform extractors

Browser automation

​How the agent selects tools

​Production considerations

​What you built

LangChain SERP agent

Reusable assistant factory

MCP agent with 60+ tools

Production architecture

Build docs developers (and LLMs) love

Choose your integration path

Path 1: LangChain integration

Installation

Configure API keys

Initialize the language model

Configure the SERP tool

Create the ReAct agent

Run a search query

Build a reusable research assistant factory

Advanced configuration patterns

Run a multi-part research query

Path 2: MCP server integration

Installation

Configure and connect the MCP server

Create the agent with MCP tools

Test basic search

Available MCP tool categories

How the agent selects tools

Production considerations

What you built