Documentation Index
Fetch the complete documentation index at: https://mintlify.com/AngelMoralesChazari/TautoTeacher-2.0/llms.txt
Use this file to discover all available pages before exploring further.
NaturalLexer (package tautoteacher2.nlp.lexer) converts a normalized Spanish string into an ordered list of TokenNatural objects. It works as a single-pass, left-to-right scanner: at every position it first tries to match one of the known keyword patterns; only when no keyword matches does it advance character by character to accumulate a LITERAL fragment. The keyword list is sorted longest-first so that multi-word connectives like "si y solo si " are always recognized before their shorter prefixes like "si ".
Token Types
TheTipoTokenNatural enum defines every token the lexer can emit. The table below lists each value together with the Spanish keyword or phrase that triggers it.
| Token type | Spanish keyword(s) | Logical role |
|---|---|---|
SI | si | Conditional antecedent marker |
ENTONCES | entonces | Conditional consequent marker |
Y | y | Conjunction |
O | o | Disjunction |
SIEMPRE_QUE | siempre que | Sufficient-condition variant of SI |
CUANDO | cuando | Temporal conditional (read as implication) |
SOLO_SI | solo si | Necessary-condition marker (P solo si Q → P → Q) |
A_MENOS_QUE | a menos que | Exception marker (P a menos que Q → ¬Q → P) |
SI_Y_SOLO_SI | si y solo si | Biconditional |
EN_CASO_DE_QUE | en caso de que | Alternative conditional phrasing |
LITERAL | (any non-keyword fragment) | Propositional text fragment |
The TokenNatural Class
Each token emitted by the lexer is aTokenNatural instance carrying exactly two fields:
TipoTokenNatural tipo— the token classification from the enum above.String lexema— the matched surface text (trimmed). For keyword tokens this is the keyword without surrounding spaces; forLITERALtokens it is the raw proposition fragment.
TokenNatural is immutable: both fields are set in the constructor and exposed via getTipo() and getLexema(). Its toString() method produces readable output like SI("si") or LITERAL("llueve"), which is useful for debugging pattern matching.
Keyword Matching Priority
The lexer keeps an internal array of keyword strings ordered from longest to shortest:encontrarPalabraClave(), which iterates this array in order and checks for a match using String.regionMatches. The first match wins. This ordering has two important consequences:
"si y solo si"is recognized as a singleSI_Y_SOLO_SItoken rather than three separate tokensSI,Y,SI.- The conjunctions
" y "and" o "include a leading space in their keyword string, which means they can only match between words — the lexer must attempt keyword matching before skipping whitespace, not after, so that"a y b"correctly emitsLITERAL("a"),Y("y"),LITERAL("b").
Elipsis Splitting
When aLITERAL follows immediately after SI, SIEMPRE_QUE, or CUANDO and contains multiple words but no internal connective keywords, the lexer applies an elipsis-splitting rule. For three or more words it places the first word as one LITERAL and the remainder as a second LITERAL, modelling elliptical constructions like “si estudio apruebo” (no ENTONCES present). The "no " prefix is fused with the following word during splitting so that negated fragments like "no estudio" remain a single token.
Usage Example
tokenizar method accepts the output of NormalizadorTexto directly. Passing a null or empty string returns an empty list without throwing.
Regression fix — keyword scanning before whitespace skip. An earlier version of the scan loop skipped whitespace before attempting keyword matching. This caused
" y " and " o " — whose patterns begin with a space — to never match after a literal, silently swallowing the conjunction into the preceding fragment. The fix was to call encontrarPalabraClave() at the current position before any whitespace-skipping logic. This is the reason the conjunction keywords are stored with a leading space and the loop ordering matters.