Custom Providers - LangExtract

LangExtract’s plugin system enables you to add support for any LLM provider without modifying the core library. Create custom providers for proprietary APIs, local models, or specialized backends.

Why Create a Custom Provider?

Support new LLMs: Add providers for Claude, Cohere, Hugging Face, or any API
Private deployments: Connect to internal model endpoints
Independent distribution: Publish as a separate Python package
Zero configuration: Auto-discovery via Python entry points
Isolated dependencies: Keep provider-specific packages separate

Quick Start

Use the provider plugin generator to create a new provider in minutes:

python scripts/create_provider_plugin.py MyProvider --with-schema

This generates a complete plugin structure with:

Provider implementation
Schema support (optional)
Testing setup
Package configuration
Documentation

Architecture Overview

A provider plugin consists of:

Provider Class - Implements BaseLanguageModel interface
Registry Decorator - Registers model ID patterns
Entry Point - Enables auto-discovery
Schema Class (optional) - Enables structured output

Creating a Provider Plugin

Step 1: Package Structure

Create this directory structure:

langextract-yourprovider/
├── pyproject.toml              # Package configuration
├── README.md                    # Documentation
├── LICENSE                      # License file
└── langextract_yourprovider/   # Package directory
    ├── __init__.py             # Exports provider class
    ├── provider.py             # Provider implementation
    └── schema.py               # (Optional) Schema support

Step 2: Configure Entry Point

In pyproject.toml:

[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "langextract-yourprovider"
version = "0.1.0"
description = "YourProvider integration for LangExtract"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "langextract>=1.0.0",
    "your-sdk>=1.0.0",  # Your provider's SDK
]

[project.entry-points."langextract.providers"]
yourprovider = "langextract_yourprovider:YourProviderLanguageModel"

Step 3: Implement Provider

In langextract_yourprovider/provider.py:

import os
from typing import Iterator, Sequence
import langextract as lx
from langextract.core import types as core_types

# Register patterns that your provider handles
@lx.providers.registry.register(
    r'^yourmodel',           # Matches: yourmodel-3b, yourmodel-7b
    r'^custom-',             # Matches: custom-base, custom-large
    r'^YourProviderLanguageModel$',  # Explicit: model_id="YourProviderLanguageModel"
    priority=10  # Optional: higher priority = checked first
)
class YourProviderLanguageModel(lx.inference.BaseLanguageModel):
    """Language model provider for YourProvider API."""

    def __init__(
        self,
        model_id: str = "yourmodel-3b",
        api_key: str | None = None,
        base_url: str | None = None,
        temperature: float = 0.7,
        **kwargs
    ):
        """Initialize the provider.
        
        Args:
            model_id: Model identifier
            api_key: API key (falls back to YOURPROVIDER_API_KEY env var)
            base_url: API base URL
            temperature: Sampling temperature
            **kwargs: Additional parameters
        """
        super().__init__()
        
        self.model_id = model_id
        self.api_key = api_key or os.environ.get('YOURPROVIDER_API_KEY')
        self.base_url = base_url or "https://api.yourprovider.com"
        self.temperature = temperature
        
        if not self.api_key:
            raise lx.exceptions.InferenceConfigError(
                'API key required. Set YOURPROVIDER_API_KEY or pass api_key parameter.'
            )
        
        # Initialize your client
        from your_sdk import Client
        self.client = Client(
            api_key=self.api_key,
            base_url=self.base_url
        )
    
    def infer(
        self,
        batch_prompts: Sequence[str],
        **kwargs
    ) -> Iterator[Sequence[core_types.ScoredOutput]]:
        """Run inference on prompts.
        
        Args:
            batch_prompts: List of prompts to process
            **kwargs: Additional generation parameters
            
        Yields:
            Lists of ScoredOutput objects
        """
        # Merge kwargs from constructor and call
        merged_kwargs = self.merge_kwargs(kwargs)
        temp = merged_kwargs.get('temperature', self.temperature)
        
        for prompt in batch_prompts:
            try:
                # Call your API
                response = self.client.generate(
                    model=self.model_id,
                    prompt=prompt,
                    temperature=temp,
                    **merged_kwargs
                )
                
                # Yield result as ScoredOutput
                yield [core_types.ScoredOutput(
                    score=1.0,
                    output=response.text
                )]
                
            except Exception as e:
                raise lx.exceptions.InferenceRuntimeError(
                    f'YourProvider API error: {str(e)}',
                    original=e
                ) from e

Step 4: Export Provider

In langextract_yourprovider/__init__.py:

"""YourProvider integration for LangExtract."""

from langextract_yourprovider.provider import YourProviderLanguageModel

__all__ = ['YourProviderLanguageModel']
__version__ = '0.1.0'

Pattern Registration

The @register decorator defines which model IDs your provider handles:

@lx.providers.registry.register(
    r'^yourmodel',    # Prefix match: yourmodel-3b, yourmodel-large
    r'^custom-\w+',   # Regex: custom-base, custom-large
    r'.*-yourprovider$',  # Suffix: gpt-yourprovider, llama-yourprovider
    priority=10       # Higher priority checked first (default: 0)
)
class YourProviderLanguageModel(lx.inference.BaseLanguageModel):
    pass

Pattern Priority

Default priority: 0 (first registered wins)
Higher priority: Checked first (useful for overriding built-in providers)
Explicit selection: Users can force a provider: provider="YourProviderLanguageModel"

Adding Schema Support

Schema support enables structured output with strict JSON constraints.

Step 1: Create Schema Class

In langextract_yourprovider/schema.py:

import langextract as lx
from langextract.core import schema

class YourProviderSchema(lx.schema.BaseSchema):
    """Schema for structured output."""
    
    def __init__(self, schema_dict: dict):
        self._schema_dict = schema_dict
    
    @property
    def schema_dict(self) -> dict:
        """Return the JSON schema dictionary."""
        return self._schema_dict
    
    @classmethod
    def from_examples(cls, examples_data, attribute_suffix="_attributes"):
        """Build schema from example extractions.
        
        Args:
            examples_data: List of ExampleData with extractions
            attribute_suffix: Suffix for attribute fields
            
        Returns:
            Schema instance
        """
        # Analyze examples to determine structure
        extraction_types = {}
        for example in examples_data:
            for extraction in example.extractions:
                class_name = extraction.extraction_class
                if class_name not in extraction_types:
                    extraction_types[class_name] = set()
                if extraction.attributes:
                    extraction_types[class_name].update(
                        extraction.attributes.keys()
                    )
        
        # Build JSON schema
        properties = {}
        for class_name, attrs in extraction_types.items():
            properties[class_name] = {
                "type": "object",
                "properties": {
                    "extraction_text": {"type": "string"},
                    "attributes": {
                        "type": "object",
                        "properties": {
                            attr: {"type": "string"}
                            for attr in attrs
                        }
                    }
                },
                "required": ["extraction_text"]
            }
        
        schema_dict = {
            "type": "object",
            "properties": {
                "extractions": {
                    "type": "array",
                    "items": {
                        "oneOf": [
                            {"type": "object", "properties": props}
                            for props in properties.values()
                        ]
                    }
                }
            },
            "required": ["extractions"]
        }
        
        return cls(schema_dict)
    
    def to_provider_config(self) -> dict:
        """Convert to provider-specific configuration.
        
        Returns:
            Dictionary with provider-specific schema config
        """
        return {
            "response_schema": self._schema_dict,
            "structured_output": True
        }
    
    @property
    def supports_strict_mode(self) -> bool:
        """Return True if provider enforces valid JSON output."""
        return True  # Set False if your provider doesn't guarantee valid JSON

Step 2: Update Provider

Add schema support to your provider:

class YourProviderLanguageModel(lx.inference.BaseLanguageModel):
    
    def __init__(self, model_id: str, **kwargs):
        super().__init__()
        self.model_id = model_id
        # Schema config will be in kwargs when use_schema_constraints=True
        self.response_schema = kwargs.get('response_schema')
        self.structured_output = kwargs.get('structured_output', False)
        # ... rest of init
    
    @classmethod
    def get_schema_class(cls):
        """Tell LangExtract about our schema support."""
        from langextract_yourprovider.schema import YourProviderSchema
        return YourProviderSchema
    
    def apply_schema(self, schema_instance):
        """Apply or clear schema configuration."""
        super().apply_schema(schema_instance)
        if schema_instance:
            config = schema_instance.to_provider_config()
            self.response_schema = config.get('response_schema')
            self.structured_output = config.get('structured_output', False)
        else:
            self.response_schema = None
            self.structured_output = False
    
    def infer(self, batch_prompts, **kwargs):
        for prompt in batch_prompts:
            # Use schema in API call if available
            api_params = {}
            if self.response_schema:
                api_params['response_schema'] = self.response_schema
            
            response = self.client.generate(prompt, **api_params)
            yield [lx.core.types.ScoredOutput(score=1.0, output=response.text)]

Testing Your Provider

Create tests/test_provider.py:

import pytest
import langextract as lx
from langextract_yourprovider import YourProviderLanguageModel

def test_provider_registration():
    """Test that provider is registered."""
    config = lx.factory.ModelConfig(model_id="yourmodel-3b")
    model = lx.factory.create_model(config)
    assert isinstance(model, YourProviderLanguageModel)

def test_basic_extraction():
    """Test basic extraction."""
    example = lx.data.ExampleData(
        text="Test text",
        extractions=[
            lx.data.Extraction(
                extraction_class="entity",
                extraction_text="Test",
                attributes={"type": "example"}
            )
        ]
    )
    
    result = lx.extract(
        text="Your test text",
        model_id="yourmodel-3b",
        api_key="test-key",
        prompt_description="Extract entities",
        examples=[example]
    )
    
    assert len(result.extractions) > 0

def test_schema_support():
    """Test schema constraints."""
    result = lx.extract(
        text="Your test text",
        model_id="yourmodel-3b",
        api_key="test-key",
        prompt_description="Extract entities",
        examples=[...],
        use_schema_constraints=True
    )
    
    assert result is not None

Installation & Usage

Install in Development Mode

cd langextract-yourprovider
pip install -e .

Use Your Provider

import langextract as lx

# Auto-detected by model_id pattern
result = lx.extract(
    text="Your document",
    model_id="yourmodel-3b",
    api_key="your-api-key",
    prompt_description="Extract information",
    examples=[...]
)

# Explicit provider selection
config = lx.factory.ModelConfig(
    model_id="any-model-id",
    provider="YourProviderLanguageModel",
    provider_kwargs={"api_key": "your-key"}
)
model = lx.factory.create_model(config)

Publishing Your Provider

Build Package

pip install build twine
python -m build

Publish to PyPI

twine upload dist/*

Test installation in clean environment:
```
pip install langextract-yourprovider
```
Create documentation with:
- Supported model IDs and patterns
- Required environment variables
- Usage examples
- Provider-specific parameters
Share on GitHub:
- Open an issue on LangExtract GitHub
- Request addition to community providers list

Real-World Examples

See the complete examples:

Custom Provider Plugin Example - Full template with testing
Built-in Providers - Study langextract/providers/ for reference implementations

Checklist

Setup (Required)

Create package structure
Configure pyproject.toml with entry point
Implement provider class
Add @lx.providers.registry.register() decorator
Implement __init__() method
Implement infer() method
Export class from __init__.py

Schema Support (Optional)

Create schema class inheriting BaseSchema
Implement from_examples() class method
Implement to_provider_config() method
Add get_schema_class() to provider
Handle schema in provider’s infer()

Testing

Documentation

Document supported model IDs
List environment variables
Provide usage examples
Document provider-specific parameters
Add installation instructions

Distribution

Test in clean environment
Build package: python -m build
Publish to PyPI: twine upload dist/*
Share with community

Common Patterns

Environment Variable Fallback

def __init__(self, api_key: str | None = None, **kwargs):
    self.api_key = (
        api_key
        or os.environ.get('YOURPROVIDER_API_KEY')
        or os.environ.get('LANGEXTRACT_API_KEY')
    )

Parallel Processing

import concurrent.futures

def infer(self, batch_prompts, **kwargs):
    max_workers = kwargs.get('max_workers', 10)
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(self._process, p) for p in batch_prompts]
        for future in concurrent.futures.as_completed(futures):
            yield [future.result()]

Retry Logic

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def _call_api(self, prompt, **kwargs):
    return self.client.generate(prompt, **kwargs)

Troubleshooting

Plugin Not Loading

# Manually trigger plugin discovery
import langextract as lx
lx.providers.load_plugins_once()

# Verify entry points
import pkg_resources
for ep in pkg_resources.iter_entry_points('langextract.providers'):
    print(ep.name, ep.load())

Pattern Not Matching

# Test pattern matching
import re
pattern = r'^yourmodel'
model_id = "yourmodel-3b"
assert re.match(pattern, model_id)

Check Registration

import langextract as lx

# List all registered providers
for pattern, provider in lx.providers.registry.list_entries():
    print(f"{pattern} -> {provider}")

Next Steps

Provider Overview

Learn about the provider architecture

Gemini Provider

Study a production provider implementation

OpenAI Provider

See optional dependency handling

Ollama Provider

Learn about local provider patterns

Get Started

Core Concepts

Guides

Model Providers

Examples

​Why Create a Custom Provider?

​Quick Start

​Architecture Overview

​Creating a Provider Plugin

​Step 1: Package Structure

​Step 2: Configure Entry Point

​Step 3: Implement Provider

​Step 4: Export Provider

​Pattern Registration

​Pattern Priority

​Adding Schema Support

​Step 1: Create Schema Class

​Step 2: Update Provider

​Testing Your Provider

​Installation & Usage

​Install in Development Mode

​Use Your Provider

​Publishing Your Provider

​Build Package

​Publish to PyPI

​Share with Community

​Real-World Examples

​Checklist

​Common Patterns

​Environment Variable Fallback

​Parallel Processing

​Retry Logic

​Troubleshooting

​Plugin Not Loading

​Pattern Not Matching

​Check Registration

​Next Steps