Documentation Index Fetch the complete documentation index at: https://mintlify.com/BerriAI/litellm/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Groq provides blazing-fast LLM inference with support for popular open-source models. LiteLLM provides seamless integration with Groq’s API, supporting all major features including streaming, function calling, and reasoning models.
Quick Start
Set API Key
export GROQ_API_KEY = "gsk_..."
Make Your First Call
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
print (response.choices[ 0 ].message.content)
Supported Models
Llama Models
Mixtral Models
Gemma Models
Meta’s Llama family on Groq’s infrastructure. from litellm import completion
# Llama 3.3 70B - Best overall
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Explain quantum computing" }]
)
# Llama 3.1 8B - Fast and efficient
response = completion(
model = "groq/llama-3.1-8b-instant" ,
messages = [{ "role" : "user" , "content" : "Quick summary" }]
)
# Llama 4 405B - Most capable (if available)
response = completion(
model = "groq/llama-4-405b" ,
messages = [{ "role" : "user" , "content" : "Complex analysis" }]
)
Mistral’s mixture-of-experts models. # Mixtral 8x7B
response = completion(
model = "groq/mixtral-8x7b-32768" ,
messages = [{ "role" : "user" , "content" : "Analyze this..." }]
)
Google’s Gemma models. # Gemma 2 9B
response = completion(
model = "groq/gemma2-9b-it" ,
messages = [{ "role" : "user" , "content" : "Help me with..." }]
)
# Gemma 7B
response = completion(
model = "groq/gemma-7b-it" ,
messages = [{ "role" : "user" , "content" : "Quick task" }]
)
Authentication
Environment Variable
Direct Parameter
Custom Base URL
export GROQ_API_KEY = "gsk_..."
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_key = "gsk_..."
)
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_base = "https://api.groq.com/openai/v1"
)
Streaming
Groq excels at fast streaming responses.
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Write a story about AI" }],
stream = True
)
for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
Function Calling
Groq supports OpenAI-compatible function calling.
from litellm import completion
tools = [
{
"type" : "function" ,
"function" : {
"name" : "get_stock_price" ,
"description" : "Get the current stock price" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"symbol" : {
"type" : "string" ,
"description" : "Stock symbol, e.g. AAPL"
}
},
"required" : [ "symbol" ]
}
}
}
]
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "What's AAPL stock price?" }],
tools = tools
)
if response.choices[ 0 ].message.tool_calls:
tool_call = response.choices[ 0 ].message.tool_calls[ 0 ]
print ( f "Function: { tool_call.function.name } " )
print ( f "Arguments: { tool_call.function.arguments } " )
JSON Mode
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "List 3 colors in JSON" }],
response_format = { "type" : "json_object" }
)
import json
data = json.loads(response.choices[ 0 ].message.content)
Structured outputs with JSON schema validation. from litellm import completion
schema = {
"type" : "object" ,
"properties" : {
"colors" : {
"type" : "array" ,
"items" : { "type" : "string" }
}
},
"required" : [ "colors" ]
}
# Supported on models like gpt-oss-120b, llama-4, kimi-k2
response = completion(
model = "groq/llama-4-405b" ,
messages = [{ "role" : "user" , "content" : "List 3 colors" }],
response_format = {
"type" : "json_schema" ,
"json_schema" : { "schema" : schema}
}
)
For models without native JSON schema support, LiteLLM uses function calling as a workaround.
Reasoning Models
Groq supports reasoning effort for compatible models.
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Solve this complex problem..." }],
reasoning_effort = "high" # low, medium, high
)
# Access reasoning content
if response.choices[ 0 ].message.reasoning_content:
print ( "Reasoning:" , response.choices[ 0 ].message.reasoning_content)
print ( "Answer:" , response.choices[ 0 ].message.content)
Audio Transcription
Groq supports Whisper for audio transcription.
from litellm import transcription
with open ( "audio.mp3" , "rb" ) as audio_file:
response = transcription(
model = "groq/whisper-large-v3" ,
file = audio_file,
language = "en"
)
print (response.text)
Configuration
from litellm import completion
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
temperature = 0.7 ,
max_tokens = 1000 ,
top_p = 0.9 ,
frequency_penalty = 0.5 ,
presence_penalty = 0.5 ,
stop = [ "STOP" ]
)
Supported Parameters
Parameter Type Description temperaturefloat Randomness (0-2) max_tokensint Max output tokens max_completion_tokensint Alternative to max_tokens top_pfloat Nucleus sampling frequency_penaltyfloat Reduce repetition (-2 to 2) presence_penaltyfloat Encourage diversity (-2 to 2) stoplist/str Stop sequences nint Number of completions response_formatdict JSON mode settings reasoning_effortstr Reasoning level (low/medium/high)
Error Handling
from litellm import completion
from litellm.exceptions import APIError, RateLimitError, Timeout
try :
response = completion(
model = "groq/llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
timeout = 30
)
except RateLimitError as e:
print ( f "Rate limit: { e } " )
except Timeout as e:
print ( f "Request timeout: { e } " )
except APIError as e:
print ( f "API error: { e.status_code } - { e.message } " )
LiteLLM Proxy
model_list :
- model_name : llama-3.3-70b
litellm_params :
model : groq/llama-3.3-70b-versatile
api_key : os.environ/GROQ_API_KEY
import openai
client = openai.OpenAI(
api_key = "sk-1234" ,
base_url = "http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model = "llama-3.3-70b" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Best Practices
Groq is optimized for speed - use streaming for best UX
Use smaller models (8B) for simple tasks
Use larger models (70B+) for complex reasoning
llama-3.3-70b-versatile for best overall performance
llama-3.1-8b-instant for fast, simple tasks
mixtral-8x7b-32768 for large context windows
Groq has generous rate limits but monitor usage
Implement exponential backoff for retries
Use LiteLLM’s built-in retry logic