Function
find_duplicate_questions(cards, threshold=0.85) → List[Tuple[int, int, float]]
The list of
Flashcard objects to check. Typically deck.cards.Combined similarity score (0.0–1.0) above which a pair is reported as a duplicate. Higher values require a closer match; lower values catch fuzzier duplicates.
A list of
(index_a, index_b, similarity_ratio) tuples. Only pairs whose combined ratio meets or exceeds threshold are included. Results are sorted by (index_a, index_b) — deck order — so related duplicates appear together.Algorithm
The function uses a two-stage approach that is roughly 10–50× faster than exhaustive pairwise comparison on decks with 500+ cards:Build 3-gram shingles
Every normalised question is converted to a set of overlapping 3-character substrings (e.g.
"abc def" → {"abc", "bc ", "c d", " de", "def"}).Build inverted index
Each shingle is mapped to the list of card indices that contain it, allowing fast candidate lookup.
Pre-filter with Jaccard similarity
Card pairs that share at least
max(threshold - 0.15, 0.5) Jaccard similarity on their shingle sets are added to the candidate set. Pairs below this estimate are skipped entirely.Run SequenceMatcher on candidates
difflib.SequenceMatcher computes an exact character-level similarity ratio for the question text of each candidate pair.Compute combined ratio
The final score weights question similarity more heavily than answer similarity:
Usage example
Answer similarity contributes 40% of the combined score. Two questions with identical wording but different correct answers will score lower than
threshold and will not be flagged as duplicates.ExportService
Export cleaned decks to Quizlet-compatible text files.
GeminiService
Extract flashcards from exam images using Gemini AI.