Purpose
The Grading Agent evaluates student answers to diagnostic questions, providing:
Correctness determination (true/false)
Numerical scores (0-1 scale)
Detailed feedback explaining why answers are correct or incorrect
Identification of misconceptions
Location : sprout-backend/src/agents/grade-answers.ts
Grading approaches
The agent uses different strategies based on question format:
MCQ grading
Direct comparison
Partial credit
MCQ questions are graded by direct string comparison: function gradeMCQ ( question : Question , answer : Answer ) : GradingResult {
const isCorrect = answer . selectedOption === question . correctAnswer ;
return {
isCorrect ,
score: isCorrect ? 1.0 : 0.0 ,
feedback: isCorrect
? `Correct! ${ question . correctAnswer } is the right answer.`
: `Incorrect. The correct answer is ${ question . correctAnswer } .`
};
}
Fast : No Claude API call needed.For MCQ questions with multiple correct answers: function gradeMultiSelectMCQ ( question : Question , answer : Answer ) : GradingResult {
const selected = new Set ( answer . selectedOptions );
const correct = new Set ( question . correctAnswers );
const correctSelected = intersection ( selected , correct ). size ;
const incorrectSelected = selected . size - correctSelected ;
const missed = correct . size - correctSelected ;
const score = Math . max ( 0 , correctSelected - incorrectSelected ) / correct . size ;
return {
isCorrect: score === 1.0 ,
score ,
feedback: generatePartialCreditFeedback ( correctSelected , incorrectSelected , missed )
};
}
Open-ended grading
Open-ended questions require semantic evaluation with Claude:
async function gradeOpenEnded (
question : Question ,
answer : Answer
) : Promise < GradingResult > {
const systemPrompt = `
You are an expert grader evaluating a student's answer.
Question: ${ question . prompt }
Correct answer: ${ question . correctAnswer }
Grading rubric:
${ JSON . stringify ( question . gradingRubric , null , 2 ) }
Student answer: ${ answer . answerText }
Evaluate the answer:
1. Is it correct? (true/false)
2. Score from 0.0 to 1.0
3. Detailed feedback (2-3 sentences)
4. Identified misconceptions (if any)
` ;
const response = await anthropic . messages . create ({
model: "claude-opus-4-20250514" ,
max_tokens: 2000 ,
system: systemPrompt ,
messages: [{
role: "user" ,
content: "Grade this answer."
}]
});
// Parse Claude's response
const result = parseGradingResponse ( response . content [ 0 ]. text );
return result ;
}
Claude evaluates open-ended answers semantically, not just by keyword matching. This allows it to recognize correct answers expressed differently than expected.
Grading rubrics
Open-ended questions include rubrics that guide the grading agent:
Key points rubric
{
"type" : "key_points" ,
"points" : [
"Binary search tree maintains sorted order" ,
"Left subtree contains smaller values" ,
"Right subtree contains larger values" ,
"Property applies recursively to all subtrees"
],
"scoring" : {
"all_points" : 1.0 ,
"most_points" : 0.75 ,
"some_points" : 0.5 ,
"few_points" : 0.25 ,
"no_points" : 0.0
},
"min_points_for_passing" : 3
}
Grading logic :
Count how many key points the student mentioned
Award score based on percentage of points covered
Provide feedback listing which points were missed
Comparative rubric
{
"type" : "comparative" ,
"reference_answer" : "In-order traversal visits nodes in sorted order: left subtree, root, right subtree. Pre-order visits root first, then left, then right." ,
"required_elements" : [
"in-order produces sorted sequence" ,
"pre-order visits root first"
],
"scoring" : {
"excellent" : { "threshold" : 0.9 , "score" : 1.0 },
"good" : { "threshold" : 0.7 , "score" : 0.85 },
"acceptable" : { "threshold" : 0.5 , "score" : 0.7 },
"poor" : { "threshold" : 0.3 , "score" : 0.5 },
"failing" : { "threshold" : 0.0 , "score" : 0.0 }
}
}
Grading logic :
Compare student answer to reference answer using semantic similarity
Check for required elements
Award score based on similarity tier
Code execution rubric
For code questions, the rubric includes test cases:
{
"type" : "code_execution" ,
"language" : "python" ,
"test_cases" : [
{
"input" : "search(tree, 5)" ,
"expected_output" : "True" ,
"weight" : 0.2
},
{
"input" : "search(tree, 7)" ,
"expected_output" : "False" ,
"weight" : 0.2
},
{
"input" : "search(None, 5)" ,
"expected_output" : "False" ,
"weight" : 0.3 ,
"description" : "Edge case: empty tree"
},
{
"input" : "search(single_node, 1)" ,
"expected_output" : "True" ,
"weight" : 0.3 ,
"description" : "Edge case: single node"
}
],
"code_quality_weight" : 0.2
}
Grading logic :
Execute student code against test cases
Calculate weighted score based on passed tests
Claude evaluates code quality (readability, efficiency) for remaining 20%
Feedback generation
The grading agent provides three types of feedback:
Correct answer feedback
{
"isCorrect" : true ,
"score" : 1.0 ,
"feedback" : "Excellent! You correctly identified that in-order traversal produces a sorted sequence. Your explanation of the recursive process was clear and accurate."
}
Partially correct feedback
{
"isCorrect" : false ,
"score" : 0.6 ,
"feedback" : "You're on the right track. You correctly mentioned that the left subtree contains smaller values, but you missed that the right subtree contains larger values. Also, remember that this property applies recursively to all subtrees, not just the root's children." ,
"misconceptions" : [
"Thinks BST property only applies to root level"
]
}
Incorrect answer feedback
{
"isCorrect" : false ,
"score" : 0.0 ,
"feedback" : "This isn't correct. A binary search tree has a specific ordering property: for every node, all values in the left subtree are smaller, and all values in the right subtree are larger. Review the BST properties and try again." ,
"misconceptions" : [
"Confuses BST with general binary tree" ,
"Doesn't understand sorted order property"
],
"suggested_review" : [ "BST Properties" , "Tree Ordering" ]
}
Good feedback explains why an answer is correct or incorrect, identifies specific misconceptions, and suggests what to review. This helps students learn from their mistakes.
Misconception detection
The grading agent identifies common misconceptions:
Misconception Detected When Feedback Confuses in-order with pre-order Student describes pre-order as producing sorted sequence ”In-order traversal visits left-root-right and produces sorted output. You described pre-order (root-left-right).” Thinks BST property is only at root Student only checks root’s immediate children ”The BST property must hold for every node, not just the root. All nodes in the left subtree (not just the left child) must be smaller.” Forgets edge cases in deletion Student doesn’t handle two-children case ”You handled leaf and one-child cases well, but deletion with two children requires finding the in-order successor or predecessor.” Confuses time complexity Student claims O(log n) is worst case ”O(log n) is the average case for balanced BSTs. Worst case is O(n) for a skewed tree (essentially a linked list).”
Batch grading
The grade_student_answers tool grades all diagnostic questions at once:
async function gradeAllDiagnostics (
conceptNodeId : string ,
userId : string
) : Promise < GradingSummary > {
// Load questions and answers
const assessment = await db . query . assessments . findFirst ({
where: and (
eq ( assessments . targetNodeId , conceptNodeId ),
eq ( assessments . userId , userId )
),
with: {
questions: true
}
});
const answers = await db . query . answers . findMany ({
where: eq ( answers . assessmentId , assessment . id )
});
// Grade each answer
const gradingResults = await Promise . all (
answers . map ( async ( answer ) => {
const question = assessment . questions . find ( q => q . id === answer . questionId );
if ( question . format === "mcq" ) {
return gradeMCQ ( question , answer );
} else {
return await gradeOpenEnded ( question , answer );
}
})
);
// Update database with grades
await db . transaction ( async ( tx ) => {
for ( let i = 0 ; i < answers . length ; i ++ ) {
await tx . update ( answers )
. set ({
isCorrect: gradingResults [ i ]. isCorrect ,
score: gradingResults [ i ]. score ,
feedback: gradingResults [ i ]. feedback
})
. where ( eq ( answers . id , answers [ i ]. id ));
}
});
// Generate summary
const summary = {
total_questions: answers . length ,
correct: gradingResults . filter ( r => r . isCorrect ). length ,
incorrect: gradingResults . filter ( r => ! r . isCorrect ). length ,
average_score: average ( gradingResults . map ( r => r . score )),
weak_areas: identifyWeakAreas ( gradingResults , assessment . questions ),
strong_areas: identifyStrongAreas ( gradingResults , assessment . questions ),
misconceptions: gradingResults
. flatMap ( r => r . misconceptions || [])
. filter (( m , i , arr ) => arr . indexOf ( m ) === i ) // unique
};
return summary ;
}
Batch grading is more efficient than grading one question at a time. It allows the Concept Refinement Agent to see the full picture of student performance.
Example grading flow
Student submits answers
Student completes diagnostic assessment with 8 questions (5 MCQ, 3 open-ended).
Concept Refinement Agent calls grade_student_answers
grade_student_answers({ "conceptNodeId" : "concept-uuid-123" })
Grading agent evaluates each answer
MCQ questions: Direct comparison (instant)
Open-ended questions: Claude evaluation (2-5 seconds each)
Database updated with grades
Each answer record receives:
isCorrect: boolean
score: 0-1
feedback: string
Summary returned
{
"total_questions" : 8 ,
"correct" : 5 ,
"incorrect" : 3 ,
"average_score" : 0.625 ,
"weak_areas" : [ "deletion" , "balancing" ],
"strong_areas" : [ "searching" , "insertion" ],
"misconceptions" : [
"Confuses in-order with pre-order traversal"
]
}
Refinement agent adapts learning path
Uses summary to add bridge subconcepts for weak areas.
Token usage per answer :
MCQ: 0 tokens (no API call)
Open-ended (short): 500-1,000 tokens
Open-ended (long): 1,000-2,000 tokens
Typical diagnostic (8 questions) :
5 MCQ: 0 tokens
3 open-ended: 3,000-6,000 tokens
Total: 3,000-6,000 tokens
Latency :
MCQ: under 10ms per question
Open-ended: 2-5 seconds per question
Typical diagnostic: 6-15 seconds total
Grading happens in parallel for open-ended questions, so latency is determined by the slowest question, not the sum of all questions.
API integration
The grading agent is called by the Concept Refinement Agent:
POST /api/agents/concepts/:conceptNodeId/run
When diagnostic answers exist, the refinement agent:
Calls grade_student_answers tool
Receives grading summary
Uses summary to adapt the learning path
Students don’t directly interact with the grading agent - it runs behind the scenes.
Next steps
Refinement Agent See how grading results drive path adaptation
Diagnostics Learn about the diagnostic assessment flow