The Hades lexer is the first stage of the interpreter pipeline. It reads a raw string of source code and converts it into a flat, ordered list ofDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/ToberlerOhn/hades/llms.txt
Use this file to discover all available pages before exploring further.
Token objects — each carrying a type tag, a concrete value, and the exact line/column where it appeared. No understanding of program structure happens here; the lexer only answers the question “what is this character sequence?”
Public API
InstantiateLexer with the full source text, then call tokenize():
tokenize() repeatedly calls the internal get_next_token() method until a TT.EOF sentinel is produced, collecting every token into a list.
The lexer always appends a final
Token(TT.EOF, None, line, column) so the parser never reads past the end of the token stream.The Token Dataclass
Every token is an instance of the Token dataclass defined in modules/tokens.py:
__repr__ produces Token(TT.INT, 42, 0:5) — useful for debugging.
Token Categories
All token types are members of theTT enum. They fall into six broad categories.
Literals
INT, FLOAT, BOOL, STR, IDNumeric literals carry a Python int or float value. BOOL tokens carry a Python bool — TRUE → True, FALSE → False. STR carries the already-unescaped string content. ID carries the raw identifier text.Type Hints
INT_TYPE_HINT, FLOAT_TYPE_HINT, BOOL_TYPE_HINT, STR_TYPE_HINT, NOTHING_TYPE_HINT, LIST_TYPE_HINTEmitted when the keywords int, float, bool, str, nothing, or list are seen. The parser uses these to enforce declared types on variables and function signatures.Operators
Arithmetic:
PLUS MINUS STAR SLASH PERCENT INCREMENT DECREMENTComparison: EQ NEQ TYPE_EQ TYPE_NEQ LT GT LTE GTELogical: AND OR XOR NOTAssignment: ASSIGN PLUS_EQ MINUS_EQ STAR_EQ SLASH_EQ PERCENT_EQ AND_EQ OR_EQ XOR_EQKeywords
IF ELSE DO WHILE FOR IN FUNC CLASS CREATOR METHOD OPERATOR MYAll keywords are matched inside read_id() via a plain dictionary lookup after the full identifier text is accumulated.Punctuation
SEMICOLON (;), COLON (:), COMMA (,), DOT (.)Also: LPAREN RPAREN LBRACE RBRACE LBRACKET RBRACKETArrows & EOF
RIGHT_ARROW (->), RIGHT_DOUBLE_ARROW (=>)-> is the list-index operator. => separates a function’s return type and doubles as the return statement keyword. EOF marks the end of input.How Tokenization Works
Skip whitespace and comments
Before every token attempt,
skip_whitespace() consumes spaces, tabs, carriage returns, and newlines. skip_comment() then discards everything from a // pair to the next newline. Both routines keep self.line and self.column up to date.Read numbers
When
self.current is a digit, read_number() accumulates digits and at most one decimal point. If exactly one . is seen the token type switches from TT.INT to TT.FLOAT; a second . raises SyntaxError. The final value is cast with int() or float() accordingly.Read strings
An opening single-quote
Any other character after
' triggers read_str(). Characters are appended verbatim until either a closing ' or a backslash escape sequence is encountered. Supported escapes:| Escape | Result |
|---|---|
\n | newline |
\t | tab |
\\ | literal backslash |
\' | literal single-quote |
\ is passed through unchanged. An unterminated string (no closing ') raises SyntaxError.Read identifiers and keywords
When
self.current matches [A-Za-z_], read_id() accumulates alphanumeric characters and underscores, then checks the result against a keyword dictionary. Matches produce the corresponding keyword token type; everything else becomes TT.ID.Match multi-character operator tokens
Operator tokens are matched in strict length order — longest first — to avoid ambiguity.3-character tokens (checked first):
2-character tokens (checked second):
| Token | Symbol |
|---|---|
TYPE_EQ | === |
TYPE_NEQ | !== |
AND_EQ | &&= |
OR_EQ | ||= |
XOR_EQ | ^^= |
| Token | Symbol | Token | Symbol |
|---|---|---|---|
EQ | == | NEQ | != |
GTE | >= | LTE | <= |
PLUS_EQ | += | MINUS_EQ | -= |
STAR_EQ | *= | SLASH_EQ | /= |
PERCENT_EQ | %= | AND | && |
OR | || | XOR | ^^ |
INCREMENT | ++ | DECREMENT | -- |
RIGHT_ARROW | -> | RIGHT_DOUBLE_ARROW | => |
Debugging with pretty_print()
pretty_print() calls tokenize() internally and then formats the token stream line-by-line to standard output, grouping every token that shares the same source line:
(type, value, line:col) and the word NEWLINE is inserted whenever a new source line begins. This makes it easy to spot tokenisation problems without running the full parser.
Error reporting
Error reporting
All lexer errors are raised as Python The parser and interpreter each define their own error types on top of this primitive.
SyntaxError via the RaiseError() helper, which includes the offending line and column: