Skip to main content
LangExtract’s plugin system enables you to add support for any LLM provider without modifying the core library. Create custom providers for proprietary APIs, local models, or specialized backends.

Why Create a Custom Provider?

  • Support new LLMs: Add providers for Claude, Cohere, Hugging Face, or any API
  • Private deployments: Connect to internal model endpoints
  • Independent distribution: Publish as a separate Python package
  • Zero configuration: Auto-discovery via Python entry points
  • Isolated dependencies: Keep provider-specific packages separate

Quick Start

Use the provider plugin generator to create a new provider in minutes:
python scripts/create_provider_plugin.py MyProvider --with-schema
This generates a complete plugin structure with:
  • Provider implementation
  • Schema support (optional)
  • Testing setup
  • Package configuration
  • Documentation

Architecture Overview

A provider plugin consists of:
  1. Provider Class - Implements BaseLanguageModel interface
  2. Registry Decorator - Registers model ID patterns
  3. Entry Point - Enables auto-discovery
  4. Schema Class (optional) - Enables structured output

Creating a Provider Plugin

Step 1: Package Structure

Create this directory structure:
langextract-yourprovider/
├── pyproject.toml              # Package configuration
├── README.md                    # Documentation
├── LICENSE                      # License file
└── langextract_yourprovider/   # Package directory
    ├── __init__.py             # Exports provider class
    ├── provider.py             # Provider implementation
    └── schema.py               # (Optional) Schema support

Step 2: Configure Entry Point

In pyproject.toml:
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "langextract-yourprovider"
version = "0.1.0"
description = "YourProvider integration for LangExtract"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "langextract>=1.0.0",
    "your-sdk>=1.0.0",  # Your provider's SDK
]

[project.entry-points."langextract.providers"]
yourprovider = "langextract_yourprovider:YourProviderLanguageModel"

Step 3: Implement Provider

In langextract_yourprovider/provider.py:
import os
from typing import Iterator, Sequence
import langextract as lx
from langextract.core import types as core_types

# Register patterns that your provider handles
@lx.providers.registry.register(
    r'^yourmodel',           # Matches: yourmodel-3b, yourmodel-7b
    r'^custom-',             # Matches: custom-base, custom-large
    r'^YourProviderLanguageModel$',  # Explicit: model_id="YourProviderLanguageModel"
    priority=10  # Optional: higher priority = checked first
)
class YourProviderLanguageModel(lx.inference.BaseLanguageModel):
    """Language model provider for YourProvider API."""

    def __init__(
        self,
        model_id: str = "yourmodel-3b",
        api_key: str | None = None,
        base_url: str | None = None,
        temperature: float = 0.7,
        **kwargs
    ):
        """Initialize the provider.
        
        Args:
            model_id: Model identifier
            api_key: API key (falls back to YOURPROVIDER_API_KEY env var)
            base_url: API base URL
            temperature: Sampling temperature
            **kwargs: Additional parameters
        """
        super().__init__()
        
        self.model_id = model_id
        self.api_key = api_key or os.environ.get('YOURPROVIDER_API_KEY')
        self.base_url = base_url or "https://api.yourprovider.com"
        self.temperature = temperature
        
        if not self.api_key:
            raise lx.exceptions.InferenceConfigError(
                'API key required. Set YOURPROVIDER_API_KEY or pass api_key parameter.'
            )
        
        # Initialize your client
        from your_sdk import Client
        self.client = Client(
            api_key=self.api_key,
            base_url=self.base_url
        )
    
    def infer(
        self,
        batch_prompts: Sequence[str],
        **kwargs
    ) -> Iterator[Sequence[core_types.ScoredOutput]]:
        """Run inference on prompts.
        
        Args:
            batch_prompts: List of prompts to process
            **kwargs: Additional generation parameters
            
        Yields:
            Lists of ScoredOutput objects
        """
        # Merge kwargs from constructor and call
        merged_kwargs = self.merge_kwargs(kwargs)
        temp = merged_kwargs.get('temperature', self.temperature)
        
        for prompt in batch_prompts:
            try:
                # Call your API
                response = self.client.generate(
                    model=self.model_id,
                    prompt=prompt,
                    temperature=temp,
                    **merged_kwargs
                )
                
                # Yield result as ScoredOutput
                yield [core_types.ScoredOutput(
                    score=1.0,
                    output=response.text
                )]
                
            except Exception as e:
                raise lx.exceptions.InferenceRuntimeError(
                    f'YourProvider API error: {str(e)}',
                    original=e
                ) from e

Step 4: Export Provider

In langextract_yourprovider/__init__.py:
"""YourProvider integration for LangExtract."""

from langextract_yourprovider.provider import YourProviderLanguageModel

__all__ = ['YourProviderLanguageModel']
__version__ = '0.1.0'

Pattern Registration

The @register decorator defines which model IDs your provider handles:
@lx.providers.registry.register(
    r'^yourmodel',    # Prefix match: yourmodel-3b, yourmodel-large
    r'^custom-\w+',   # Regex: custom-base, custom-large
    r'.*-yourprovider$',  # Suffix: gpt-yourprovider, llama-yourprovider
    priority=10       # Higher priority checked first (default: 0)
)
class YourProviderLanguageModel(lx.inference.BaseLanguageModel):
    pass

Pattern Priority

  • Default priority: 0 (first registered wins)
  • Higher priority: Checked first (useful for overriding built-in providers)
  • Explicit selection: Users can force a provider: provider="YourProviderLanguageModel"

Adding Schema Support

Schema support enables structured output with strict JSON constraints.

Step 1: Create Schema Class

In langextract_yourprovider/schema.py:
import langextract as lx
from langextract.core import schema

class YourProviderSchema(lx.schema.BaseSchema):
    """Schema for structured output."""
    
    def __init__(self, schema_dict: dict):
        self._schema_dict = schema_dict
    
    @property
    def schema_dict(self) -> dict:
        """Return the JSON schema dictionary."""
        return self._schema_dict
    
    @classmethod
    def from_examples(cls, examples_data, attribute_suffix="_attributes"):
        """Build schema from example extractions.
        
        Args:
            examples_data: List of ExampleData with extractions
            attribute_suffix: Suffix for attribute fields
            
        Returns:
            Schema instance
        """
        # Analyze examples to determine structure
        extraction_types = {}
        for example in examples_data:
            for extraction in example.extractions:
                class_name = extraction.extraction_class
                if class_name not in extraction_types:
                    extraction_types[class_name] = set()
                if extraction.attributes:
                    extraction_types[class_name].update(
                        extraction.attributes.keys()
                    )
        
        # Build JSON schema
        properties = {}
        for class_name, attrs in extraction_types.items():
            properties[class_name] = {
                "type": "object",
                "properties": {
                    "extraction_text": {"type": "string"},
                    "attributes": {
                        "type": "object",
                        "properties": {
                            attr: {"type": "string"}
                            for attr in attrs
                        }
                    }
                },
                "required": ["extraction_text"]
            }
        
        schema_dict = {
            "type": "object",
            "properties": {
                "extractions": {
                    "type": "array",
                    "items": {
                        "oneOf": [
                            {"type": "object", "properties": props}
                            for props in properties.values()
                        ]
                    }
                }
            },
            "required": ["extractions"]
        }
        
        return cls(schema_dict)
    
    def to_provider_config(self) -> dict:
        """Convert to provider-specific configuration.
        
        Returns:
            Dictionary with provider-specific schema config
        """
        return {
            "response_schema": self._schema_dict,
            "structured_output": True
        }
    
    @property
    def supports_strict_mode(self) -> bool:
        """Return True if provider enforces valid JSON output."""
        return True  # Set False if your provider doesn't guarantee valid JSON

Step 2: Update Provider

Add schema support to your provider:
class YourProviderLanguageModel(lx.inference.BaseLanguageModel):
    
    def __init__(self, model_id: str, **kwargs):
        super().__init__()
        self.model_id = model_id
        # Schema config will be in kwargs when use_schema_constraints=True
        self.response_schema = kwargs.get('response_schema')
        self.structured_output = kwargs.get('structured_output', False)
        # ... rest of init
    
    @classmethod
    def get_schema_class(cls):
        """Tell LangExtract about our schema support."""
        from langextract_yourprovider.schema import YourProviderSchema
        return YourProviderSchema
    
    def apply_schema(self, schema_instance):
        """Apply or clear schema configuration."""
        super().apply_schema(schema_instance)
        if schema_instance:
            config = schema_instance.to_provider_config()
            self.response_schema = config.get('response_schema')
            self.structured_output = config.get('structured_output', False)
        else:
            self.response_schema = None
            self.structured_output = False
    
    def infer(self, batch_prompts, **kwargs):
        for prompt in batch_prompts:
            # Use schema in API call if available
            api_params = {}
            if self.response_schema:
                api_params['response_schema'] = self.response_schema
            
            response = self.client.generate(prompt, **api_params)
            yield [lx.core.types.ScoredOutput(score=1.0, output=response.text)]

Testing Your Provider

Create tests/test_provider.py:
import pytest
import langextract as lx
from langextract_yourprovider import YourProviderLanguageModel

def test_provider_registration():
    """Test that provider is registered."""
    config = lx.factory.ModelConfig(model_id="yourmodel-3b")
    model = lx.factory.create_model(config)
    assert isinstance(model, YourProviderLanguageModel)

def test_basic_extraction():
    """Test basic extraction."""
    example = lx.data.ExampleData(
        text="Test text",
        extractions=[
            lx.data.Extraction(
                extraction_class="entity",
                extraction_text="Test",
                attributes={"type": "example"}
            )
        ]
    )
    
    result = lx.extract(
        text="Your test text",
        model_id="yourmodel-3b",
        api_key="test-key",
        prompt_description="Extract entities",
        examples=[example]
    )
    
    assert len(result.extractions) > 0

def test_schema_support():
    """Test schema constraints."""
    result = lx.extract(
        text="Your test text",
        model_id="yourmodel-3b",
        api_key="test-key",
        prompt_description="Extract entities",
        examples=[...],
        use_schema_constraints=True
    )
    
    assert result is not None

Installation & Usage

Install in Development Mode

cd langextract-yourprovider
pip install -e .

Use Your Provider

import langextract as lx

# Auto-detected by model_id pattern
result = lx.extract(
    text="Your document",
    model_id="yourmodel-3b",
    api_key="your-api-key",
    prompt_description="Extract information",
    examples=[...]
)

# Explicit provider selection
config = lx.factory.ModelConfig(
    model_id="any-model-id",
    provider="YourProviderLanguageModel",
    provider_kwargs={"api_key": "your-key"}
)
model = lx.factory.create_model(config)

Publishing Your Provider

Build Package

pip install build twine
python -m build

Publish to PyPI

twine upload dist/*

Share with Community

  1. Test installation in clean environment:
    pip install langextract-yourprovider
    
  2. Create documentation with:
    • Supported model IDs and patterns
    • Required environment variables
    • Usage examples
    • Provider-specific parameters
  3. Share on GitHub:

Real-World Examples

See the complete examples:

Checklist

  • Create package structure
  • Configure pyproject.toml with entry point
  • Implement provider class
  • Add @lx.providers.registry.register() decorator
  • Implement __init__() method
  • Implement infer() method
  • Export class from __init__.py
  • Create schema class inheriting BaseSchema
  • Implement from_examples() class method
  • Implement to_provider_config() method
  • Add get_schema_class() to provider
  • Handle schema in provider’s infer()
  • Install with pip install -e .
  • Test basic inference
  • Verify auto-discovery works
  • Test schema support (if implemented)
  • Test error handling
  • Document supported model IDs
  • List environment variables
  • Provide usage examples
  • Document provider-specific parameters
  • Add installation instructions
  • Test in clean environment
  • Build package: python -m build
  • Publish to PyPI: twine upload dist/*
  • Share with community

Common Patterns

Environment Variable Fallback

def __init__(self, api_key: str | None = None, **kwargs):
    self.api_key = (
        api_key
        or os.environ.get('YOURPROVIDER_API_KEY')
        or os.environ.get('LANGEXTRACT_API_KEY')
    )

Parallel Processing

import concurrent.futures

def infer(self, batch_prompts, **kwargs):
    max_workers = kwargs.get('max_workers', 10)
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(self._process, p) for p in batch_prompts]
        for future in concurrent.futures.as_completed(futures):
            yield [future.result()]

Retry Logic

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def _call_api(self, prompt, **kwargs):
    return self.client.generate(prompt, **kwargs)

Troubleshooting

Plugin Not Loading

# Manually trigger plugin discovery
import langextract as lx
lx.providers.load_plugins_once()

# Verify entry points
import pkg_resources
for ep in pkg_resources.iter_entry_points('langextract.providers'):
    print(ep.name, ep.load())

Pattern Not Matching

# Test pattern matching
import re
pattern = r'^yourmodel'
model_id = "yourmodel-3b"
assert re.match(pattern, model_id)

Check Registration

import langextract as lx

# List all registered providers
for pattern, provider in lx.providers.registry.list_entries():
    print(f"{pattern} -> {provider}")

Next Steps

Provider Overview

Learn about the provider architecture

Gemini Provider

Study a production provider implementation

OpenAI Provider

See optional dependency handling

Ollama Provider

Learn about local provider patterns

Build docs developers (and LLMs) love