Overview
The document upload system:- Accepts PDFs and text files via multipart/form-data
- Stores original files in S3 with unique keys
- Extracts text content using
pdf-parsefor PDFs - Saves metadata and extracted text in SQLite (
topic_documentstable) - Provides extracted text to agents for context-aware concept generation
Document uploads are optional. Sprout works without AWS configuration, but upload features will fail.
AWS Setup
1. Create an S3 Bucket
Sign in to AWS Console
Navigate to the S3 Console.
Create Bucket
Click Create bucket and configure:
- Bucket name: Choose a unique name (e.g.,
sprout-documents-prod) - Region: Select your preferred region (e.g.,
us-east-1) - Block Public Access: Keep all settings enabled (documents should be private)
- Versioning: Optional (recommended for production)
- Encryption: Enable server-side encryption (recommended)
2. Create IAM User
Create a dedicated IAM user for Sprout backend:Navigate to IAM Console
Go to IAM Users.
Create User
Click Add users and configure:
- User name:
sprout-backend - Access type: Access key - Programmatic access
Backend Configuration
Add AWS credentials to your backend.env file:
Environment Variables
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION
- AWS_S3_BUCKET
Your IAM user’s access key ID.Format: Starts with
AKIA for long-term credentials.Verify Configuration
Test your AWS credentials using the AWS CLI:Upload Implementation
The backend uses@aws-sdk/client-s3 for S3 operations:
Upload Endpoint
Route:POST /api/nodes/:nodeId/documents
Headers: Content-Type: multipart/form-data
Body: Form data with file field
Upload Flow
S3 Key Format
Files are stored with the following key structure:UUIDs prevent filename collisions and ensure unique keys.
Text Extraction
The backend extracts text from uploaded documents for use by agents:Supported Formats
- PDF
- Text Files
MIME Type: Limitations:
application/pdfExtracted using pdf-parse library:- OCR not supported (scanned PDFs won’t extract)
- Complex layouts may not extract perfectly
- Images and diagrams are ignored
Extraction Status
Document extraction status is tracked in thetopic_documents table:
| Status | Description |
|---|---|
pending | Extraction not yet attempted |
completed | Text successfully extracted |
failed | Extraction failed (error logged) |
Using Documents in Agents
Agents access document context via theextract_all_concept_contexts tool:
- Topic Agent generates concepts
- Agent calls
extract_all_concept_contextswith concept list - Tool searches document text for relevant sections
- Extracted context is returned to agent
- Agent uses context to refine concept descriptions
Agents receive only relevant excerpts, not full document text, to stay within token limits.
Database Schema
Documents are stored in thetopic_documents table:
schema.ts
The
extractedText field can be large (up to several MB for long documents). SQLite TEXT fields support up to 1 GB.Troubleshooting
Upload fails with 403 Forbidden
Upload fails with 403 Forbidden
IAM user doesn’t have permission to upload to S3.Solution:
- Check IAM policy includes
s3:PutObject - Verify bucket name matches policy
- Test with AWS CLI:
Upload fails with 400 Bad Request
Upload fails with 400 Bad Request
Incorrect AWS credentials or region.Solution:
- Verify
.envvariables are set correctly - Check AWS_REGION matches bucket region
- Restart backend after changing
.env
Text extraction fails for PDF
Text extraction fails for PDF
Error:
extraction_status = 'failed'Solution:- Check
extraction_errorfield in database - Ensure PDF is text-based (not scanned image)
- Try uploading a different PDF
- Check backend logs for
pdf-parseerrors
Agents don't use document context
Agents don't use document context
Generated concepts don’t reference uploaded documents.Solution:
- Verify documents are uploaded before running agents
- Check
extraction_status = 'completed' - Ensure documents contain relevant text
- Review agent SSE stream for
extract_all_concept_contextstool calls
AWS_S3_BUCKET not found error
AWS_S3_BUCKET not found error
Error:
The specified bucket does not existSolution:- Verify bucket name in
.envmatches AWS console - Check bucket region matches
AWS_REGION - Ensure bucket wasn’t deleted
- List buckets:
Security Best Practices
Never commit credentials
Never commit credentials
Add
.env to .gitignore:.gitignore
Use IAM roles in production
Use IAM roles in production
For EC2 or ECS deployments, use IAM roles instead of access keys:
- Create an IAM role with S3 permissions
- Attach role to EC2 instance or ECS task
- Remove
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYfrom.env
Enable bucket versioning
Enable bucket versioning
Protect against accidental deletion:
Encrypt at rest
Encrypt at rest
Enable server-side encryption (SSE-S3):
Set lifecycle policies
Set lifecycle policies
Automatically delete old documents:
Cost Optimization
S3 Storage Costs
Pricing (us-east-1, as of 2024):- First 50 TB: $0.023 per GB/month
- PUT requests: $0.005 per 1,000 requests
- GET requests: $0.0004 per 1,000 requests
- Storage: 5 GB × 0.12/month
- Uploads: 1000 × 0.005 one-time
S3 costs are minimal for typical Sprout usage (hundreds to thousands of documents).
Reduce Costs
- Use S3 Intelligent-Tiering: Automatically moves objects to cheaper tiers
- Set lifecycle policies: Delete old/unused documents
- Compress PDFs: Reduce storage size before upload
- Use S3 Standard-IA: For infrequently accessed documents ($0.0125/GB)
Next Steps
Running Locally
Start all services in development mode
Database Migrations
Manage database schema changes