Groq

Overview

Groq provides blazing-fast LLM inference with support for popular open-source models. LiteLLM provides seamless integration with Groq’s API, supporting all major features including streaming, function calling, and reasoning models.

Quick Start

Install LiteLLM

pip install litellm

Set API Key

export GROQ_API_KEY="gsk_..."

Make Your First Call

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Supported Models

Llama Models
Mixtral Models
Gemma Models

Meta’s Llama family on Groq’s infrastructure.

from litellm import completion

# Llama 3.3 70B - Best overall
response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# Llama 3.1 8B - Fast and efficient
response = completion(
    model="groq/llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Quick summary"}]
)

# Llama 4 405B - Most capable (if available)
response = completion(
    model="groq/llama-4-405b",
    messages=[{"role": "user", "content": "Complex analysis"}]
)

Mistral’s mixture-of-experts models.

# Mixtral 8x7B
response = completion(
    model="groq/mixtral-8x7b-32768",
    messages=[{"role": "user", "content": "Analyze this..."}]
)

Google’s Gemma models.

# Gemma 2 9B
response = completion(
    model="groq/gemma2-9b-it",
    messages=[{"role": "user", "content": "Help me with..."}]
)

# Gemma 7B
response = completion(
    model="groq/gemma-7b-it",
    messages=[{"role": "user", "content": "Quick task"}]
)

Authentication

Environment Variable
Direct Parameter
Custom Base URL

export GROQ_API_KEY="gsk_..."

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}]
)

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key="gsk_..."
)

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}],
    api_base="https://api.groq.com/openai/v1"
)

Streaming

Groq excels at fast streaming responses.

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a story about AI"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Function Calling

Groq supports OpenAI-compatible function calling.

from litellm import completion

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current stock price",
            "parameters": {
                "type": "object",
                "properties": {
                    "symbol": {
                        "type": "string",
                        "description": "Stock symbol, e.g. AAPL"
                    }
                },
                "required": ["symbol"]
            }
        }
    }
]

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's AAPL stock price?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

JSON Mode

JSON Object
JSON Schema

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "List 3 colors in JSON"}],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)

Structured outputs with JSON schema validation.

from litellm import completion

schema = {
    "type": "object",
    "properties": {
        "colors": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["colors"]
}

# Supported on models like gpt-oss-120b, llama-4, kimi-k2
response = completion(
    model="groq/llama-4-405b",
    messages=[{"role": "user", "content": "List 3 colors"}],
    response_format={
        "type": "json_schema",
        "json_schema": {"schema": schema}
    }
)

For models without native JSON schema support, LiteLLM uses function calling as a workaround.

Reasoning Models

Groq supports reasoning effort for compatible models.

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Solve this complex problem..."}],
    reasoning_effort="high"  # low, medium, high
)

# Access reasoning content
if response.choices[0].message.reasoning_content:
    print("Reasoning:", response.choices[0].message.reasoning_content)
    print("Answer:", response.choices[0].message.content)

Audio Transcription

Groq supports Whisper for audio transcription.

from litellm import transcription

with open("audio.mp3", "rb") as audio_file:
    response = transcription(
        model="groq/whisper-large-v3",
        file=audio_file,
        language="en"
    )
    
print(response.text)

Configuration

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=1000,
    top_p=0.9,
    frequency_penalty=0.5,
    presence_penalty=0.5,
    stop=["STOP"]
)

Supported Parameters

Parameter	Type	Description
`temperature`	float	Randomness (0-2)
`max_tokens`	int	Max output tokens
`max_completion_tokens`	int	Alternative to max_tokens
`top_p`	float	Nucleus sampling
`frequency_penalty`	float	Reduce repetition (-2 to 2)
`presence_penalty`	float	Encourage diversity (-2 to 2)
`stop`	list/str	Stop sequences
`n`	int	Number of completions
`response_format`	dict	JSON mode settings
`reasoning_effort`	str	Reasoning level (low/medium/high)

Error Handling

from litellm import completion
from litellm.exceptions import APIError, RateLimitError, Timeout

try:
    response = completion(
        model="groq/llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": "Hello!"}],
        timeout=30
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except Timeout as e:
    print(f"Request timeout: {e}")
except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

LiteLLM Proxy

model_list:
  - model_name: llama-3.3-70b
    litellm_params:
      model: groq/llama-3.3-70b-versatile
      api_key: os.environ/GROQ_API_KEY

import openai

client = openai.OpenAI(
    api_key="sk-1234",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

Speed Optimization

Groq is optimized for speed - use streaming for best UX
Use smaller models (8B) for simple tasks
Use larger models (70B+) for complex reasoning

Model Selection

llama-3.3-70b-versatile for best overall performance
llama-3.1-8b-instant for fast, simple tasks
mixtral-8x7b-32768 for large context windows

Rate Limits

Groq has generous rate limits but monitor usage
Implement exponential backoff for retries
Use LiteLLM’s built-in retry logic

Providers

Provider Features

Overview

Quick Start

Supported Models

Authentication

Streaming

Function Calling

JSON Mode

Reasoning Models

Audio Transcription

Configuration

Supported Parameters

Error Handling

LiteLLM Proxy

Best Practices

Build docs developers (and LLMs) love

Providers

Provider Features

Documentation Index

​Overview

​Quick Start

​Supported Models

​Authentication

​Streaming

​Function Calling

​JSON Mode

​Reasoning Models

​Audio Transcription

​Configuration

​Supported Parameters

​Error Handling

​LiteLLM Proxy

​Best Practices

Build docs developers (and LLMs) love

Overview

Quick Start

Supported Models

Authentication

Streaming

Function Calling

JSON Mode

Reasoning Models

Audio Transcription

Configuration

Supported Parameters

Error Handling

LiteLLM Proxy

Best Practices