Step 4: Bring Your Own Data

Overview

This step lets you plug in any data file and start querying it with AI instantly. You can load a single file or multiple files at once for cross-file exploration — no code changes, no configuration. Duration: ~3-5 minutes (interactive) What you’ll do:

Drop files into the userdata/ directory or provide a file path
Analyze file metadata (type, size, word count, content preview)
Build a vector index and get an AI summary
Ask questions about your data in an interactive Q&A loop
See cost comparison on every query

Prerequisites

Complete Step 2: RAG with Civic Data first, then install the file reader:

pip install llama-index-readers-file

Supported file types

Extension	Type	Notes
`.txt`	Plain text	Simplest option, works everywhere
`.pdf`	PDF document	Requires `llama-index-readers-file` (already installed)
`.csv`	CSV spreadsheet	Read as text content
`.docx`	Word document	Requires `llama-index-readers-file`

For PDFs: Text-based PDFs work. Image-only or encrypted PDFs may not extract text successfully.

Running the script

Auto-discovery from userdata/

Drop files into the userdata/ directory before running:

python scripts/demo_step4_byod.py

The script will:

0 files found: Prompt for a file path
1 file found: Automatically use that file
2+ files found: Show a numbered list to pick from (or type a to load all)

Load all files at once

python scripts/demo_step4_byod.py --all

Loads every supported file in userdata/ into a single combined index, enabling cross-document questions like:

“Compare the findings across these reports”
“What themes are common across all the data?”
“Which document discusses budget constraints?”

With a specific file path

# With a text file
python scripts/demo_step4_byod.py path/to/your/file.txt

# With a PDF
python scripts/demo_step4_byod.py ~/Downloads/report.pdf

# With a CSV
python scripts/demo_step4_byod.py ~/Documents/data.csv

Use a different model

python scripts/demo_step4_byod.py myfile.txt --model phi3:mini

Command-line options

Option	Default	Description
`file`	(auto-discover)	Path to data file (positional, optional)
`--all`	off	Load ALL files in `userdata/` into a single index for cross-file exploration
`--model`	`llama3.1`	Ollama model to use (lets you try different models)

Use --help to see all options:

python scripts/demo_step4_byod.py --help

Expected output

File analysis

════════════════════════════════════════════════════════════
  CIVICHACKS 2026 — Bring Your Own Data
════════════════════════════════════════════════════════════

⚙️  Configuring local AI stack...
   Host: YOUR-HOSTNAME
   Time: February 21, 2026 at 02:15:30 PM
   Model: llama3.1 (via Ollama — running on YOUR-HOSTNAME)
   Embeddings: all-MiniLM-L6-v2 (runs on CPU)

────────────────────────────────────────────────────────────
  📄 File Analysis
────────────────────────────────────────────────────────────

   File:      boston_budget_2026.pdf
   Path:      /Users/attendee/Downloads/boston_budget_2026.pdf
   Type:      PDF document
   Size:      2.4 MB
   Modified:  February 18, 2026

   Content:   3 document(s), 45,230 characters, ~8,120 words

   Preview:
   "CITY OF BOSTON FISCAL YEAR 2026 OPERATING BUDGET..."

────────────────────────────────────────────────────────────

Index building

🔍 Building vector index (this is the 'RAG' magic)...
   Index built in 2.3s

AI summary

────────────────────────────────────────────────────────────
  🤖 AI Summary of: boston_budget_2026.pdf
────────────────────────────────────────────────────────────

[Streamed AI response covering:
 1. What the document is about (topic and scope)
 2. Key data points or findings (citing specific numbers)
 3. Three questions someone might want to ask]

⏱️  8.4s · ~185 tokens
⚡ Local: $0.000010 (0.035 Wh @ 15W) · GPT-4o: $0.0023 (230x more)

Interactive Q&A

════════════════════════════════════════════════════════════
  💬 Interactive Q&A — Ask anything about your data
     Type 'quit' to end | 'help' for commands
════════════════════════════════════════════════════════════

  [You] >> What are the biggest budget increases this year?

────────────────────────────────────────────────────────────
  💬 Question: What are the biggest budget increases this year?
────────────────────────────────────────────────────────────

  🤖 Answer:

[Streamed AI answer grounded in the document data]

⏱️  7.1s · ~142 tokens
⚡ Local: $0.000007 (0.029 Wh @ 15W) · GPT-4o: $0.0018 (257x more)

  [You] >> quit

════════════════════════════════════════════════════════════
  ✅ Session complete — 1 question answered
     All processing done locally on YOUR-HOSTNAME.
     Zero data sent to the cloud.
════════════════════════════════════════════════════════════

Interactive commands

Command	Action
(any question)	Query the AI about your data
`summary`	Re-generate the AI summary
`help`	Show available commands
`quit` / `exit` / `q`	End the session

How it works

The script performs these steps:

File discovery and validation

find_userdata_files() scans the userdata/ directory for supported file types
validate_file() resolves the path, checks extension and file size, handles drag-and-drop quote stripping
Displays file metadata (type, size, modified date)

Load documents

Uses LlamaIndex’s SimpleDirectoryReader to load the file:

documents = SimpleDirectoryReader(input_files=[str(filepath)]).load_data()

For PDFs, this extracts text content. For CSVs, reads as plain text.

Build vector index

Same as Step 2, builds an in-memory vector index:

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)

Generate AI summary

Queries the index with a summary prompt:

SUMMARY_PROMPT = (
    "You are analyzing a document that was just loaded. "
    "Provide a concise summary covering: "
    "1) What this document is about (topic and scope), "
    "2) Key data points or findings (cite specific numbers if present), "
    "3) Three questions someone might want to ask about this data. "
    "Keep it under 200 words."
)

response = query_engine.query(SUMMARY_PROMPT)
response.print_response_stream()

For multiple files, uses a MULTI_SUMMARY_PROMPT variant.

Interactive Q&A loop

Runs a REPL-style loop:

while True:
    user_input = input("  [You] >> ").strip()
    if user_input.lower() in ("quit", "exit", "q"):
        break

    response = query_engine.query(user_input)
    response.print_response_stream()

Each question is an independent RAG query with cost comparison.

Cross-file exploration with —all

The --all flag loads every file into a single combined index:

python scripts/demo_step4_byod.py --all

Example output:

════════════════════════════════════════════════════════════
  📂 Loading 3 files from userdata/
════════════════════════════════════════════════════════════

   📄 report_2024.pdf  (1.8 MB, PDF document)
   📄 report_2025.pdf  (2.1 MB, PDF document)
   📄 budget_analysis.txt  (45 KB, Plain text)

   ──────────────────────────────────────────────────────────
   Total:     8 document(s), 123,456 characters, ~22,345 words
   Combined:  3.9 MB
════════════════════════════════════════════════════════════

Now you can ask cross-document questions:

“What changed between the 2024 and 2025 reports?”
“Which document discusses staffing shortages?”
“What themes appear across all three files?”

File size limits

Files larger than 10 MB will display a warning. Indexing may take longer, but it will work. Very large files (>100 MB) may run out of memory on machines with limited RAM.

Troubleshooting

Error: Could not extract text from PDF

The PDF may be:

Image-based (scanned document) — use OCR first
Encrypted/password-protected — remove protection first
Corrupted — try re-downloading or exporting to a new PDF

Try converting to plain text first:

pdftotext input.pdf output.txt
python scripts/demo_step4_byod.py output.txt

Error: File is empty

The file has 0 bytes. Check that the file actually contains content.

Error: Unsupported file type

Only .txt, .pdf, .csv, .docx are supported. Convert other formats to one of these first.

No files found in userdata/

Create the userdata/ directory and drop files there:

mkdir -p userdata
cp ~/Documents/myfile.pdf userdata/
python scripts/demo_step4_byod.py

Response doesn't match file content

Increase similarity_top_k to retrieve more chunks:

# Edit scripts/demo_step4_byod.py
query_engine = index.as_query_engine(streaming=True, similarity_top_k=5)

Real-world use cases

Budget analysis

Load city budget PDFs and ask:

“What are the biggest line items?”
“How does this year compare to last year?”
“Which departments saw cuts?”

Meeting notes

Load DOCX meeting notes and ask:

“What action items were assigned?”
“What decisions were made?”
“Who attended and what were the key topics?”

Data reports

Load CSV or TXT data files and ask:

“What are the key trends?”
“Which metrics are concerning?”
“What correlations exist?”

Research papers

Load academic PDFs and ask:

“What is the main finding?”
“What methodology was used?”
“What are the limitations?”

Next steps

Now that you’ve used BYOD in the terminal, move to Step 5: BYOD Web Application to wrap this in a web interface with drag-and-drop file upload.

Getting Started

Tutorial Steps

Civic Data

Customization

Reference

Step 4: Bring Your Own Data

Overview

Prerequisites

Supported file types

Running the script

Auto-discovery from userdata/

Load all files at once

With a specific file path

Use a different model

Command-line options

Expected output

Interactive commands

How it works

Cross-file exploration with —all

File size limits

Troubleshooting

Real-world use cases

Budget analysis

Meeting notes

Data reports

Research papers

Next steps

Build docs developers (and LLMs) love

Getting Started

Tutorial Steps

Civic Data

Customization

Reference

Documentation Index

​Overview

​Prerequisites

​Supported file types

​Running the script

​Auto-discovery from userdata/

​Load all files at once

​With a specific file path

​Use a different model

​Command-line options

​Expected output

​Interactive commands

​How it works

​Cross-file exploration with —all

​File size limits

​Troubleshooting

​Real-world use cases

Budget analysis

Meeting notes

Data reports

Research papers

​Next steps

Build docs developers (and LLMs) love

Overview

Prerequisites

Supported file types

Running the script

Auto-discovery from userdata/

Load all files at once

With a specific file path

Use a different model

Command-line options

Expected output

Interactive commands

How it works

Cross-file exploration with —all

File size limits

Troubleshooting

Real-world use cases

Next steps