Local AI Pack

The Local AI skill pack enables completely local AI inference without external API dependencies. Run language models and speech-to-text on your own hardware.

Included Services

Ollama

Local LLM inference for chat and embeddings

Whisper

Speech-to-text transcription

Skills Provided

Ollama Local LLM

Capabilities:

Chat completion
Text generation
Code generation
Text embeddings for RAG
JSON-structured output
Multi-turn conversations
Streaming responses

Example Usage:

# Chat completion
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "stream": false
  }'

# Text generation
curl -X POST "http://ollama:11434/api/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama",
    "prompt": "Write a Python function to calculate fibonacci numbers",
    "stream": false
  }'

# Generate embeddings
curl -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Text to embed", "Another text"]
  }'

# JSON output mode
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "List 3 programming languages"}
    ],
    "format": "json",
    "stream": false
  }'

Whisper Transcribe

Capabilities:

Audio transcription
Multiple language support
Speaker diarization
Timestamp generation
Various audio formats
Subtitle generation (SRT, VTT)

Example Usage:

# Transcribe audio file
curl -X POST "http://whisper:9000/asr?task=transcribe&language=en&output=json" \
  -F "audio_file=@/data/audio/recording.mp3"

# Response:
{
  "text": "Hello, this is a test recording.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a test recording."
    }
  ]
}

# Transcribe with timestamps (SRT format)
curl -X POST "http://whisper:9000/asr?task=transcribe&language=en&output=srt" \
  -F "audio_file=@/data/audio/recording.mp3"

# Translate to English
curl -X POST "http://whisper:9000/asr?task=translate&output=json" \
  -F "audio_file=@/data/audio/spanish.mp3"

Use Cases

RAG (Retrieval-Augmented Generation)

Build a complete local RAG system:

# 1. Generate embeddings for documents
curl -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Document content here..."]
  }'

# 2. Store in Qdrant (from Knowledge Base pack)
curl -X PUT "http://qdrant:6333/collections/docs/points" \
  -H "Content-Type: application/json" \
  -d '{
    "points": [{
      "id": 1,
      "vector": [...embedding...],
      "payload": {"text": "Document content"}
    }]
  }'

# 3. Query: Generate query embedding
QUERY_EMBEDDING=$(curl -s -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["What is quantum computing?"]
  }' | jq -r '.embeddings[0]')

# 4. Search Qdrant
RESULTS=$(curl -s -X POST "http://qdrant:6333/collections/docs/points/search" \
  -H "Content-Type: application/json" \
  -d "{
    \"vector\": $QUERY_EMBEDDING,
    \"limit\": 5
  }")

# 5. Generate answer with context
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"llama3.2\",
    \"messages\": [
      {\"role\": \"system\", \"content\": \"Answer based on this context: $RESULTS\"},
      {\"role\": \"user\", \"content\": \"What is quantum computing?\"}
    ],
    \"stream\": false
  }"

Video Transcription Pipeline

Combine with Video Creator pack:

# 1. Extract audio from video (FFmpeg)
ffmpeg -i /data/videos/lecture.mp4 \
  -vn -ar 16000 -ac 1 \
  /data/audio/lecture.wav

# 2. Transcribe with Whisper
curl -X POST "http://whisper:9000/asr?task=transcribe&language=en&output=srt" \
  -F "audio_file=@/data/audio/lecture.wav" \
  -o /data/subtitles/lecture.srt

# 3. Burn subtitles into video
ffmpeg -i /data/videos/lecture.mp4 \
  -vf "subtitles=/data/subtitles/lecture.srt" \
  /data/output/lecture_subtitled.mp4

# 4. Generate summary with Ollama
TRANSCRIPT=$(cat /data/subtitles/lecture.srt)
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"llama3.2\",
    \"messages\": [
      {\"role\": \"system\", \"content\": \"Summarize this lecture transcript.\"},
      {\"role\": \"user\", \"content\": \"$TRANSCRIPT\"}
    ],
    \"stream\": false
  }"

Code Assistant

Local code generation and review:

# Generate code
curl -X POST "http://ollama:11434/api/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama",
    "prompt": "Write a REST API endpoint in Python using FastAPI for user registration",
    "stream": false
  }'

# Code review
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama",
    "messages": [
      {
        "role": "system",
        "content": "You are a code reviewer. Find bugs and suggest improvements."
      },
      {
        "role": "user",
        "content": "Review this code: '$(cat app.py)'"
      }
    ],
    "stream": false
  }'

Chatbot with Memory

Build a stateful chatbot:

// Store conversation in Redis (from DevOps pack)
const conversationKey = `chat:${userId}:history`;

// Add user message
await redis.rpush(conversationKey, JSON.stringify({
  role: 'user',
  content: userMessage
}));

// Get conversation history
const history = await redis.lrange(conversationKey, -10, -1);
const messages = history.map(JSON.parse);

// Generate response
const response = await fetch('http://ollama:11434/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.2',
    messages: messages,
    stream: false
  })
});

// Store assistant response
const answer = await response.json();
await redis.rpush(conversationKey, JSON.stringify({
  role: 'assistant',
  content: answer.message.content
}));

// Set expiry (24 hours)
await redis.expire(conversationKey, 86400);

Recommended Models

General Purpose

Model	Size	Use Case	Memory
`llama3.2`	3B	Fast chat and reasoning	4 GB
`llama3.2:70b`	70B	Complex reasoning	40 GB
`mistral`	7B	Balanced performance	5 GB
`phi3`	3.8B	Efficient reasoning	4 GB

Code Generation

Model	Size	Use Case	Memory
`codellama`	7B	Code generation	5 GB
`codellama:13b`	13B	Advanced code tasks	8 GB
`deepseek-coder`	6.7B	Multi-language coding	5 GB

Embeddings

Model	Size	Dimensions	Memory
`nomic-embed-text`	137M	768	1 GB
`mxbai-embed-large`	335M	1024	2 GB
`all-minilm`	23M	384	512 MB

Managing Models

# List installed models
curl "http://ollama:11434/api/tags"

# Pull a new model
curl -X POST "http://ollama:11434/api/pull" \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

# Delete a model
curl -X DELETE "http://ollama:11434/api/delete" \
  -H "Content-Type: application/json" \
  -d '{"name": "old-model"}'

# Show model info
curl -X POST "http://ollama:11434/api/show" \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

Configuration

Environment Variables

# Ollama
OLLAMA_HOST=ollama
OLLAMA_PORT=11434
OLLAMA_MODELS=/data/ollama/models  # Model storage

# Whisper
WHISPER_HOST=whisper
WHISPER_PORT=9000
WHISPER_MODEL=base  # tiny, base, small, medium, large

Volume Mounts

Models persist across restarts:

services:
  ollama:
    volumes:
      - ollama_models:/root/.ollama
  
  whisper:
    volumes:
      - whisper_models:/root/.cache/whisper

volumes:
  ollama_models:
  whisper_models:

Memory Requirements

Ollama

Memory depends on model size:

Small models (3B-7B): 4-6 GB
Medium models (13B-30B): 10-20 GB
Large models (70B+): 40+ GB

GPU acceleration recommended for larger models.

Whisper

Memory depends on model variant:

tiny: ~1 GB
base: ~1 GB
small: ~2 GB
medium: ~5 GB
large: ~10 GB

Total Pack: ~4-8 GB minimum (with small models)

Performance Tips

Ollama

Use GPU if available: docker run --gpus all
Set num_gpu layers in model config
Lower temperature for consistent output
Use seed for reproducible results
Enable stream: false for full responses

Whisper

Use base or small model for real-time
Convert audio to 16kHz mono WAV for best performance
Use tiny model for quick drafts, medium for accuracy
Enable GPU acceleration for large models

Embedding Generation

Batch embeddings for efficiency:

# Single request with multiple inputs
curl -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": [
      "Document 1 text",
      "Document 2 text",
      "Document 3 text"
    ]
  }'

GPU Acceleration

NVIDIA GPU

Enable GPU support in docker-compose:

services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Verify GPU is detected:

docker exec ollama nvidia-smi

Service Catalog

Skill Packs

Included Services

Ollama

Whisper

Skills Provided

Ollama Local LLM

Whisper Transcribe

Use Cases

RAG (Retrieval-Augmented Generation)

Video Transcription Pipeline

Code Assistant

Chatbot with Memory

Recommended Models

General Purpose

Code Generation

Embeddings

Managing Models

Configuration

Environment Variables

Volume Mounts

Memory Requirements

Ollama

Whisper

Performance Tips

Ollama

Whisper

Embedding Generation

GPU Acceleration

NVIDIA GPU

Next Steps

Knowledge Base Pack

Video Creator Pack

Build docs developers (and LLMs) love

Service Catalog

Skill Packs

Documentation Index

​Included Services

Ollama

Whisper

​Skills Provided

​Ollama Local LLM

​Whisper Transcribe

​Use Cases

​RAG (Retrieval-Augmented Generation)

​Video Transcription Pipeline

​Code Assistant

​Chatbot with Memory

​Recommended Models

​General Purpose

​Code Generation

​Embeddings

​Managing Models

​Configuration

​Environment Variables

​Volume Mounts

​Memory Requirements

​Ollama

​Whisper

​Performance Tips

​Ollama

​Whisper

​Embedding Generation

​GPU Acceleration

​NVIDIA GPU

​Next Steps

Knowledge Base Pack

Video Creator Pack

Build docs developers (and LLMs) love

Included Services

Skills Provided

Ollama Local LLM

Whisper Transcribe

Use Cases

RAG (Retrieval-Augmented Generation)

Video Transcription Pipeline

Code Assistant

Chatbot with Memory

Recommended Models

General Purpose

Code Generation

Embeddings

Managing Models

Configuration

Environment Variables

Volume Mounts

Memory Requirements

Ollama

Whisper

Performance Tips

Ollama

Whisper

Embedding Generation

GPU Acceleration

NVIDIA GPU

Next Steps