Innova Backend Data Model: Postgres, MongoDB, and S3

Innova uses a dual-database architecture: Supabase Postgres (relational, via Prisma) for structured application data — identities, curriculum, attempts, mastery, guides — and MongoDB Atlas for high-volume, schema-flexible telemetry and AI job audit records. A third storage tier, AWS S3, holds binary assets: teacher worksheet PDFs, student photo submissions, and OCR upload staging. Each tier is chosen for its strengths; no data is duplicated across them except the attempt_id foreign key that links a Postgres Attempt row to its MongoDB AttemptEvent document.

Postgres Schema Overview

The Prisma schema (prisma/schema.prisma) targets PostgreSQL via Supabase. It is organized into six logical groups:

Identity & Roles

User (linked to Supabase Auth via supabaseUid), Teacher, Student, Parent, ParentLink

Org & Courses

Organization, School, Course, CourseTeacher, Enrollment, ClassroomInvite

Curriculum

Subject, Curriculum, OfficialOACode, Unit, Topic, TopicPrerequisite, Domain, Subdomain

Exercises & Work

Exercise, Assignment, AssignmentTarget, AssignmentExercise, Attempt, AttemptStep, AttemptErrorReport

Mastery & Alerts

StudentTopicMastery (BKT state), TeacherAlert, ErrorTag

Guides & Integrations

Guide, GuideQuestion, GuideSolution, GuideSubmission, SchoolIntegration, ExternalIdMap, CostEvent

Key Model Definitions

`Attempt`

The central work record. Created on every POST /attempts call. Starts with status: PENDING and is updated asynchronously by the LLM classifier or OCR reprocess worker. classifierSource tracks which pipeline produced the final error tag.

model Attempt {
  id               String    @id @default(uuid())
  assignmentId     String?   @map("assignment_id")
  exerciseId       String?   @map("exercise_id")
  studentId        String    @map("student_id")
  courseId         String?   @map("course_id")
  inputMode        String    @default("DIGITAL") @map("input_mode")
  status           String    @default("PENDING")
  finalAnswer      String?   @map("final_answer")
  isCorrect        Boolean   @map("is_correct")
  classifierSource String    @default("RULE") @map("classifier_source")
  errorTagId       String?   @map("error_tag_id")
  confidence       Float?
  ocrConfidence    Float?    @map("ocr_confidence")
  ocrProvider      String?   @map("ocr_provider")
  llmJobId         String?   @map("llm_job_id")
  traceId          String    @default(uuid()) @map("trace_id")
  createdAt        DateTime  @default(now()) @map("created_at")
  classifiedAt     DateTime? @map("classified_at")
  updatedAt        DateTime  @default(now()) @updatedAt @map("updated_at")

  assignment      Assignment?      @relation(fields: [assignmentId], references: [id])
  exercise        Exercise?        @relation(fields: [exerciseId], references: [id])
  student         Student          @relation(fields: [studentId], references: [id])
  course          Course?          @relation(fields: [courseId], references: [id])
  errorTag        ErrorTag?        @relation(fields: [errorTagId], references: [id])
  steps           AttemptStep[]
  guideSubmission GuideSubmission?
  errorReports    AttemptErrorReport[]

  @@index([studentId])
  @@index([courseId])
  @@index([exerciseId, createdAt])
  @@index([errorTagId])
  @@map("attempts")
}

`StudentTopicMastery`

Stores the live BKT (Bayesian Knowledge Tracing) state per student-topic pair. pKnown is the probability that the student has mastered the skill. trend7d is a 7-day rolling delta written by the nightly calibration job in innova-ai-engine.

model StudentTopicMastery {
  studentId     String    @map("student_id")
  topicId       String    @map("topic_id")
  pKnown        Float     @default(0.3) @map("p_known")
  pSlip         Float     @default(0.1) @map("p_slip")
  pGuess        Float     @default(0.2) @map("p_guess")
  pTransit      Float     @default(0.1) @map("p_transit")
  attemptsCount Int       @default(0) @map("attempts_count")
  lastAttemptAt DateTime? @map("last_attempt_at")
  trend7d       Float?    @map("trend_7d")
  updatedAt     DateTime  @updatedAt @map("updated_at")

  student Student @relation(fields: [studentId], references: [id])
  topic   Topic   @relation(fields: [topicId], references: [id])

  @@id([studentId, topicId])
  @@index([topicId])
  @@index([lastAttemptAt])
  @@map("student_topic_mastery")
}

`Guide`

Represents a teacher-uploaded worksheet PDF. Moves through a GuideStatus state machine from UPLOADED → EXTRACTING → REVIEW → PUBLISHED. Linked to an Assignment (kind=GUIDE) once published so students can be assigned the worksheet.

model Guide {
  id                     String      @id @default(uuid())
  courseId               String      @map("course_id")
  createdByTeacherId     String      @map("created_by_teacher_id")
  title                  String
  description            String?
  status                 GuideStatus @default(UPLOADED)
  sourcePdfKey           String      @map("source_pdf_key")   // S3 guides/uploads/
  sourcePdfPages         Int?        @map("source_pdf_pages")
  sourceKind             String?     @map("source_kind")      // SCANNED | DIGITAL | MIXED
  latexKey               String?     @map("latex_key")        // S3 .tex (ADR-117)
  extractionConfidence   Float?      @map("extraction_confidence")
  extractionModel        String?     @map("extraction_model")
  failureReason          String?     @map("failure_reason")
  questionCount          Int         @default(0) @map("question_count")
  maxResubmissions       Int         @default(2) @map("max_resubmissions")
  showSolutionAfterGrade Boolean     @default(false) @map("show_solution_after_grade")
  assignmentId           String?     @unique @map("assignment_id")
  dueAt                  DateTime?   @map("due_at")
  publishedAt            DateTime?   @map("published_at")
  traceId                String      @default(uuid()) @map("trace_id")
  createdAt              DateTime    @default(now()) @map("created_at")
  updatedAt              DateTime    @updatedAt @map("updated_at")
  archivedAt             DateTime?   @map("archived_at")

  course      Course            @relation(fields: [courseId], references: [id])
  teacher     Teacher           @relation(fields: [createdByTeacherId], references: [id])
  assignment  Assignment?       @relation(fields: [assignmentId], references: [id])
  questions   GuideQuestion[]
  submissions GuideSubmission[]

  @@index([courseId, status])
  @@index([createdByTeacherId])
  @@map("guides")
}

Key Enums

// Error taxonomy
enum ErrorSource  { CURATED  LLM_GENERATED  FIELD_REPORTED }
enum ErrorStatus  { ACTIVE   DRAFT          DEPRECATED }
enum ErrorSeverity { LOW  MED  HIGH  CRITICAL }

// Guides pipeline state machines
enum GuideStatus {
  UPLOADED  EXTRACTING  EXTRACTION_FAILED
  GENERATING_SOLUTIONS  GENERATION_FAILED
  REVIEW  PUBLISHED  ARCHIVED
}

enum GuideQuestionStatus {
  EXTRACTED      // pipeline output, not yet reviewed
  NEEDS_REVIEW   // low confidence (topic < 0.85 or solution mismatch)
  APPROVED       // teacher approved topic + solution key
  EXCLUDED       // teacher removed from published guide
}

enum SolutionSource {
  PDF_PROVIDED    // normalised from original PDF
  LLM_GENERATED  // generated by Claude Sonnet
  TEACHER_EDITED // edited by teacher in review wizard
}

enum SubmissionStatus {
  UPLOADED  TRANSCRIBING  GRADING  GRADED  FAILED
}

MongoDB Collections

MongoDB Atlas stores three high-volume, append-heavy collections. All use Mongoose schemas registered via NestJS @Schema() / @Prop() decorators.

`attempt_events` — Keystroke Telemetry

Raw keystroke and interaction events for every attempt. Used for learning analytics, replay, and debugging. Archived to S3 after processing (archived_to_s3_at).

// src/infrastructure/database/schemas/attempt-event.schema.ts

@Schema({ _id: false })
export class KeystrokeEvent {
  @Prop({ type: Number, required: true })
  timestamp_ms!: number;

  @Prop({ type: String, required: true,
    enum: ['key_down', 'paste', 'submit', 'hint_request', 'erase', 'undo'] })
  type!: string;

  @Prop({ type: String, required: true,
    enum: ['units', 'tens', 'hundreds', 'thousands', 'numerator', 'denominator'] })
  column!: string;

  @Prop({ type: String, required: true })
  value!: string;

  @Prop({ type: Number, required: true })
  cursor_pos!: number;
}

@Schema({ _id: false })
export class EventSummary {
  @Prop({ type: Number, required: true }) total_events!: number;
  @Prop({ type: Number, required: true }) duration_ms!: number;
  @Prop({ type: Number, required: true, default: 0 }) hints_used!: number;
  @Prop({ type: Number, required: true, default: 0 }) undo_count!: number;
  @Prop({ type: Number, required: true, default: 0 }) paste_count!: number;
}

@Schema({ timestamps: true, collection: 'attempt_events' })
export class AttemptEvent {
  @Prop({ type: String, required: true, index: true })
  attempt_id!: string;

  @Prop({ type: String, required: true, index: true })
  student_id!: string;

  @Prop({ type: String, required: true })
  classroom_id!: string;

  @Prop({ type: String, required: true, index: true })
  trace_id!: string;

  @Prop({ type: [KeystrokeEventSchema], required: true })
  events!: KeystrokeEvent[];

  @Prop({ type: EventSummarySchema, required: true })
  summary!: EventSummary;

  @Prop({ type: Date })
  archived_to_s3_at?: Date;

  createdAt?: Date;
  updatedAt?: Date;
}
// Compound index: AttemptEventSchema.index({ createdAt: 1, archived_to_s3_at: 1 });

`llm_classification_jobs` — LLM Audit Log

Full audit trail of every Claude classification call: request tokens, cache usage, per-attempt classifications with evidence and confidence, estimated cost, and latency. Used for cost accounting, debugging classification quality, and prompt optimization.

// src/infrastructure/database/schemas/llm-classification-job.schema.ts

@Schema({ timestamps: true, collection: 'llm_classification_jobs' })
export class LLMClassificationJob {
  @Prop({ type: [String], required: true })
  attempt_ids!: string[];

  @Prop({ type: String, required: true, index: true })
  trace_id!: string;

  @Prop({ type: String, required: true })
  model!: string;             // e.g. "claude-haiku-4-5-20251001"

  @Prop({ type: String, required: true,
    enum: ['pending', 'completed', 'failed', 'dlq'], index: true })
  status!: string;

  @Prop({ type: RequestMetaSchema })
  request_meta?: RequestMeta;   // cached_tokens, input_tokens, output_tokens, tool_choice, cache_hit

  @Prop({ type: ResponseMetaSchema })
  response_meta?: ResponseMeta; // classifications[], raw_response_id

  @Prop({ type: Number })
  cost_estimated_usd?: number;

  @Prop({ type: Number })
  duration_ms?: number;

  @Prop({ type: Number, default: 0 })
  retries?: number;

  @Prop({ type: String })
  error_message?: string;

  @Prop({ type: Date })
  completed_at?: Date;
}

Each response_meta.classifications entry contains:

// Embedded Classification subdocument
{
  attempt_id: string;
  error_type: string;      // ErrorTag.code
  evidence?: string;       // LLM explanation string
  confidence: number;      // 0–1
}

`ocr_jobs` — OCR Audit Log

Per-image OCR job record. Tracks the primary provider (Gemini) and optional fallback (Anthropic), bounding-box step positions, overall confidence, and cost. The s3_purge_at field mirrors the bucket’s 30-day lifecycle rule for COPPA-compliant data deletion tracking.

// src/infrastructure/database/schemas/ocr-job.schema.ts

@Schema({ timestamps: true, collection: 'ocr_jobs' })
export class OCRJob {
  @Prop({ type: String, required: true, index: true, unique: true })
  upload_id!: string;

  @Prop({ type: String, required: true, index: true })
  student_id!: string;

  @Prop({ type: String, required: true })
  trace_id!: string;

  @Prop({ type: String, required: true })
  s3_key!: string;

  @Prop({ type: Date })
  s3_purge_at?: Date;           // mirrors bucket lifecycle (30 days)

  @Prop({ type: String, required: true })
  primary_provider!: string;    // "gemini"

  @Prop({ type: Boolean, default: false })
  used_fallback?: boolean;

  @Prop({ type: String })
  fallback_provider?: string;   // "anthropic"

  @Prop({ type: OCRResultSchema })
  ocr_result?: OCRResult;       // topic_hint, steps[], final_answer, overall_confidence

  @Prop({ type: Number })
  cost_estimated_usd?: number;

  @Prop({ type: Number })
  duration_ms?: number;

  @Prop({ type: String, required: true,
    enum: ['pending', 'completed', 'failed', 'low_confidence_review'], index: true })
  status!: string;

  @Prop({ type: String })
  error_message?: string;

  @Prop({ type: String })
  attempt_id?: string;          // set after reprocess creates the Attempt

  @Prop({ type: Date })
  completed_at?: Date;
}

The OCRResult.steps array contains RecognizedStep objects with rawText, type, bounding-box position (x, y, w, h), and a confidence score between 0 and 1.

S3 Buckets

Three S3 buckets are provisioned as CloudFormation resources in serverless.yml. All bucket names include the stage suffix for environment isolation.

`innova-backend-serverless-{stage}-guides`

Stores teacher worksheet PDFs and the AI-generated LaTeX keys extracted from them.

Prefix	Contents	Lifecycle
`guides/uploads/`	Raw teacher-uploaded PDFs	Deleted after 365 days
`guides/{id}/figures/`	Question figure crops (from extraction)	Inherits bucket default
`guides/{id}/latex/`	Generated `.tex` solution keys (ADR-117)	Inherits bucket default

CORS is configured for PUT and GET from any origin (*) to support presigned upload URLs from the browser. Presigned PUT TTL defaults to 600 seconds (GUIDES_PRESIGNED_PUT_TTL); presigned GET TTL defaults to 300 seconds (GUIDES_PRESIGNED_GET_TTL).

`innova-submissions-{stage}`

Stores student handwritten photo submissions uploaded from mobile apps.

Prefix	Contents	Lifecycle
(root)	Student photos (UUID-named, no PII in key)	Deleted after 30 days

The 30-day lifecycle rule is a COPPA compliance requirement — student minors’ images must not be retained longer than necessary. The OCRJob.s3_purge_at field tracks the expected deletion date independently for auditing.

`innova-backend-serverless-{stage}-ocr-uploads`

Staging bucket for photos uploaded via the OCR pipeline. An s3:ObjectCreated:* notification on this bucket triggers the ocrWorker Lambda.

Prefix	Contents	Lifecycle
`uploads/`	Photo files awaiting OCR	Deleted after 30 days

# serverless.yml — OcrUploadsBucket resource
OcrUploadsBucket:
  Type: AWS::S3::Bucket
  Properties:
    BucketName: ${self:service}-${self:provider.stage}-ocr-uploads
    LifecycleConfiguration:
      Rules:
        - Id: PurgeAfter30Days
          Status: Enabled
          ExpirationInDays: 30

Prisma Migrations

Migration files live in prisma/migrations/. Each migration is a timestamped directory containing a migration.sql file generated by Prisma Migrate.

# Create a new migration during development
pnpm prisma migrate dev --name <descriptive-name>

# Apply pending migrations in CI / production
pnpm prisma migrate deploy

# Open Prisma Studio (GUI) against the local database
pnpm prisma studio

The seed script (prisma/seed.ts) creates a demo school, courses, students, and exercises. pnpm seed:full runs the seed and also imports the innova-ai-engine error taxonomy into the error_tags table. The auth seed (pnpm seed:auth) is guarded by ALLOW_SEED=1 and creates demo Supabase Auth users idempotently via the Admin REST API.

Connection pooling in serverless mode: Prisma is configured with @prisma/adapter-pg and the pg driver. In production, DATABASE_URL must point to the Supabase transaction pooler on port :6543 with ?pgbouncer=true&connection_limit=1 appended. This prevents each Lambda container from holding an open idle connection — without it, a burst of concurrent invocations will exhaust Supabase’s connection limit. The PrismaService implements a serverless-safe singleton so that warm Lambda containers reuse an existing PrismaClient instance rather than constructing a new one on every invocation.

Get Started

Core Concepts

Configuration

Infrastructure

Innova Backend Data Model: Postgres, MongoDB, and S3

Postgres Schema Overview

Identity & Roles

Org & Courses

Curriculum

Exercises & Work

Mastery & Alerts

Guides & Integrations

Key Model Definitions

`Attempt`

`StudentTopicMastery`

`Guide`

Key Enums

MongoDB Collections

`attempt_events` — Keystroke Telemetry

`llm_classification_jobs` — LLM Audit Log

`ocr_jobs` — OCR Audit Log

S3 Buckets

`innova-backend-serverless-{stage}-guides`

`innova-submissions-{stage}`

`innova-backend-serverless-{stage}-ocr-uploads`

Prisma Migrations

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Infrastructure

Documentation Index

​Postgres Schema Overview

Identity & Roles

Org & Courses

Curriculum

Exercises & Work

Mastery & Alerts

Guides & Integrations

​Key Model Definitions

​Attempt

​StudentTopicMastery

​Guide

​Key Enums

​MongoDB Collections

​attempt_events — Keystroke Telemetry

​llm_classification_jobs — LLM Audit Log

​ocr_jobs — OCR Audit Log

​S3 Buckets

​innova-backend-serverless-{stage}-guides

​innova-submissions-{stage}

​innova-backend-serverless-{stage}-ocr-uploads

​Prisma Migrations

Build docs developers (and LLMs) love

Postgres Schema Overview

Key Model Definitions

`Attempt`

`StudentTopicMastery`

`Guide`

Key Enums

MongoDB Collections

`attempt_events` — Keystroke Telemetry

`llm_classification_jobs` — LLM Audit Log

`ocr_jobs` — OCR Audit Log

S3 Buckets

`innova-backend-serverless-{stage}-guides`

`innova-submissions-{stage}`

`innova-backend-serverless-{stage}-ocr-uploads`

Prisma Migrations