Document Uploads

Sprout allows users to upload documents (PDFs, text files) that provide context for topic generation. Uploaded documents are stored in AWS S3 and their text content is extracted for use by Claude agents.

Overview

The document upload system:

Accepts PDFs and text files via multipart/form-data
Stores original files in S3 with unique keys
Extracts text content using pdf-parse for PDFs
Saves metadata and extracted text in SQLite (topic_documents table)
Provides extracted text to agents for context-aware concept generation

Document uploads are optional. Sprout works without AWS configuration, but upload features will fail.

AWS Setup

1. Create an S3 Bucket

Navigate to the S3 Console.

Create Bucket

Click Create bucket and configure:

Bucket name: Choose a unique name (e.g., sprout-documents-prod)
Region: Select your preferred region (e.g., us-east-1)
Block Public Access: Keep all settings enabled (documents should be private)
Versioning: Optional (recommended for production)
Encryption: Enable server-side encryption (recommended)

Create Bucket Policy (Optional)

If you need to restrict access, create a bucket policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:user/sprout-backend"
      },
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::sprout-documents-prod/*"
    }
  ]
}

2. Create IAM User

Create a dedicated IAM user for Sprout backend:

Navigate to IAM Console

Go to IAM Users.

Create User

Click Add users and configure:

User name: sprout-backend
Access type: Access key - Programmatic access

Attach Permissions

Create an inline policy with S3 access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::sprout-documents-prod",
        "arn:aws:s3:::sprout-documents-prod/*"
      ]
    }
  ]
}

Save Credentials

After creating the user, save the Access Key ID and Secret Access Key. You won’t be able to view the secret again.

Store credentials securely. Never commit them to version control.

Backend Configuration

Add AWS credentials to your backend .env file:

# AWS Configuration (Required for document uploads)
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=wJalr...
AWS_REGION=us-east-1
AWS_S3_BUCKET=sprout-documents-prod

Environment Variables

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
AWS_S3_BUCKET

Your IAM user’s access key ID.

AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE

Format: Starts with AKIA for long-term credentials.

Your IAM user’s secret access key.

AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Keep this secret. Treat it like a password.

AWS region where your S3 bucket is located.

AWS_REGION=us-east-1

Common regions:

us-east-1 (N. Virginia)
us-west-2 (Oregon)
eu-west-1 (Ireland)
ap-southeast-1 (Singapore)

Name of your S3 bucket (without s3:// prefix).

AWS_S3_BUCKET=sprout-documents-prod

Verify Configuration

Test your AWS credentials using the AWS CLI:

Install AWS CLI

macOS
Linux
Windows

brew install awscli

sudo apt install awscli  # Debian/Ubuntu
sudo yum install awscli  # RHEL/CentOS

Configure Credentials

aws configure
# Enter your Access Key ID, Secret Access Key, and region

Test S3 Access

# List bucket contents
aws s3 ls s3://sprout-documents-prod

# Upload test file
echo "test" > test.txt
aws s3 cp test.txt s3://sprout-documents-prod/test.txt

# Download test file
aws s3 cp s3://sprout-documents-prod/test.txt test-download.txt

# Delete test file
aws s3 rm s3://sprout-documents-prod/test.txt
rm test.txt test-download.txt

If all commands succeed, your configuration is correct.

Upload Implementation

The backend uses @aws-sdk/client-s3 for S3 operations:

Upload Endpoint

Route: POST /api/nodes/:nodeId/documents Headers: Content-Type: multipart/form-data Body: Form data with file field

curl -X POST http://localhost:8000/api/nodes/node_123/documents \
  -F "file=@/path/to/document.pdf"

Upload Flow

S3 Key Format

Files are stored with the following key structure:

documents/{nodeId}/{uuid}-{originalFilename}

Example:

documents/node_abc123/f47ac10b-58cc-4372-a567-0e02b2c3d479-linear-algebra.pdf

UUIDs prevent filename collisions and ensure unique keys.

Text Extraction

The backend extracts text from uploaded documents for use by agents:

Supported Formats

PDF
Text Files

MIME Type: application/pdfExtracted using pdf-parse library:

import pdfParse from 'pdf-parse';

const data = await pdfParse(buffer);
const text = data.text; // Extracted text

Limitations:

OCR not supported (scanned PDFs won’t extract)
Complex layouts may not extract perfectly
Images and diagrams are ignored

MIME Types: text/plain, text/markdown, text/csvText files are read directly:

const text = buffer.toString('utf-8');

Supports UTF-8 encoding.

Extraction Status

Document extraction status is tracked in the topic_documents table:

Status	Description
`pending`	Extraction not yet attempted
`completed`	Text successfully extracted
`failed`	Extraction failed (error logged)

SELECT 
  original_filename,
  extraction_status,
  extraction_error
FROM topic_documents
WHERE node_id = 'node_123';

Using Documents in Agents

Agents access document context via the extract_all_concept_contexts tool:

const tools = [
  {
    name: "extract_all_concept_contexts",
    description: "Extract relevant sections from uploaded documents for all concepts",
    input_schema: {
      type: "object",
      properties: {
        concepts: {
          type: "array",
          items: {
            type: "object",
            properties: {
              title: { type: "string" },
              description: { type: "string" },
            },
          },
        },
      },
    },
  },
];

Flow:

Topic Agent generates concepts
Agent calls extract_all_concept_contexts with concept list
Tool searches document text for relevant sections
Extracted context is returned to agent
Agent uses context to refine concept descriptions

Agents receive only relevant excerpts, not full document text, to stay within token limits.

Database Schema

Documents are stored in the topic_documents table:

schema.ts

export const topicDocuments = sqliteTable("topic_documents", {
  id: text("id").primaryKey(),
  nodeId: text("node_id").notNull().references(() => nodes.id),
  originalFilename: text("original_filename").notNull(),
  s3Key: text("s3_key").notNull(),
  mimeType: text("mime_type").notNull(),
  fileSizeBytes: integer("file_size_bytes").notNull(),
  extractedText: text("extracted_text"),
  extractionStatus: text("extraction_status", {
    enum: ["pending", "completed", "failed"],
  }).notNull().default("pending"),
  extractionError: text("extraction_error"),
  createdAt: text("created_at").notNull().default(sql`(datetime('now'))`),
});

The extractedText field can be large (up to several MB for long documents). SQLite TEXT fields support up to 1 GB.

Troubleshooting

Upload fails with 403 Forbidden

IAM user doesn’t have permission to upload to S3.Solution:

Check IAM policy includes s3:PutObject
Verify bucket name matches policy

Test with AWS CLI:

aws s3 cp test.txt s3://your-bucket-name/test.txt

Upload fails with 400 Bad Request

Incorrect AWS credentials or region.Solution:

Verify .env variables are set correctly
Check AWS_REGION matches bucket region
Restart backend after changing .env

Text extraction fails for PDF

Error: extraction_status = 'failed'Solution:

Check extraction_error field in database
Ensure PDF is text-based (not scanned image)
Try uploading a different PDF
Check backend logs for pdf-parse errors

Agents don't use document context

Generated concepts don’t reference uploaded documents.Solution:

Verify documents are uploaded before running agents
Check extraction_status = 'completed'
Ensure documents contain relevant text
Review agent SSE stream for extract_all_concept_contexts tool calls

AWS_S3_BUCKET not found error

Error: The specified bucket does not existSolution:

Verify bucket name in .env matches AWS console
Check bucket region matches AWS_REGION
Ensure bucket wasn’t deleted
List buckets:
```
aws s3 ls
```

Security Best Practices

Never commit credentials

Add .env to .gitignore:

.gitignore

.env
.env.local
.env.*.local

Use IAM roles in production

For EC2 or ECS deployments, use IAM roles instead of access keys:

Create an IAM role with S3 permissions
Attach role to EC2 instance or ECS task
Remove AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from .env

AWS SDK automatically uses instance role credentials.

Enable bucket versioning

Protect against accidental deletion:

aws s3api put-bucket-versioning \
  --bucket sprout-documents-prod \
  --versioning-configuration Status=Enabled

Encrypt at rest

Enable server-side encryption (SSE-S3):

aws s3api put-bucket-encryption \
  --bucket sprout-documents-prod \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "AES256"
      }
    }]
  }'

Set lifecycle policies

Automatically delete old documents:

{
  "Rules": [
    {
      "Id": "DeleteOldDocuments",
      "Status": "Enabled",
      "Expiration": {
        "Days": 90
      },
      "Filter": {
        "Prefix": "documents/"
      }
    }
  ]
}

Cost Optimization

S3 Storage Costs

Pricing (us-east-1, as of 2024):

First 50 TB: $0.023 per GB/month
PUT requests: $0.005 per 1,000 requests
GET requests: $0.0004 per 1,000 requests

Example: Storing 1000 PDFs (5 MB each):

Storage: 5 GB × $0.023 =$ 0.12/month
Uploads: 1000 × $0.005/1000 =$ 0.005 one-time

S3 costs are minimal for typical Sprout usage (hundreds to thousands of documents).

Reduce Costs

Use S3 Intelligent-Tiering: Automatically moves objects to cheaper tiers
Set lifecycle policies: Delete old/unused documents
Compress PDFs: Reduce storage size before upload
Use S3 Standard-IA: For infrequently accessed documents ($0.0125/GB)

Next Steps

Running Locally

Start all services in development mode

Database Migrations

Manage database schema changes

Setup

Guides

Document Uploads

Overview

AWS Setup

1. Create an S3 Bucket

2. Create IAM User

Backend Configuration

Environment Variables

Verify Configuration

Upload Implementation

Upload Endpoint

Upload Flow

S3 Key Format

Text Extraction

Supported Formats

Extraction Status

Using Documents in Agents

Database Schema

Troubleshooting

Security Best Practices

Cost Optimization

S3 Storage Costs

Reduce Costs

Next Steps

Running Locally

Database Migrations

Build docs developers (and LLMs) love

Setup

Guides

​Overview

​AWS Setup

​1. Create an S3 Bucket

​2. Create IAM User

​Backend Configuration

​Environment Variables

​Verify Configuration

​Upload Implementation

​Upload Endpoint

​Upload Flow

​S3 Key Format

​Text Extraction

​Supported Formats

​Extraction Status

​Using Documents in Agents

​Database Schema

​Troubleshooting

​Security Best Practices

​Cost Optimization

​S3 Storage Costs

​Reduce Costs

​Next Steps

Running Locally

Database Migrations

Build docs developers (and LLMs) love

Overview

AWS Setup

1. Create an S3 Bucket

2. Create IAM User

Backend Configuration

Environment Variables

Verify Configuration

Upload Implementation

Upload Endpoint

Upload Flow

S3 Key Format

Text Extraction

Supported Formats

Extraction Status

Using Documents in Agents

Database Schema

Troubleshooting

Security Best Practices

Cost Optimization

S3 Storage Costs

Reduce Costs

Next Steps