The Origin lexer (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/boblio-max/origin/llms.txt
Use this file to discover all available pages before exploring further.
lexer.py) is the first stage of the compilation pipeline. It takes raw .or source text — provided as an iterable of strings, one per line — and produces a flat, ordered list of Token objects that the parser can consume sequentially. All pattern matching is done through a single ordered table of regular expressions called TOKEN_REGEX. Patterns are compiled once at module load time so that repeated calls to lex() on different files incur no re-compilation overhead.
lex(code_lines) -> list[Token]
The main public entry point of the lexer module.
lex iterates over code_lines one line at a time, advancing a column cursor across each line. At each position it tries every compiled pattern in TOKEN_REGEX_COMPILED in order; the first match wins. If a pattern’s token type is None (whitespace or comments), the matched text is consumed silently — no Token is appended. After exhausting each line, lex always appends a synthetic NEWLINE token whose col equals len(line) (the position immediately after the last character on that line). After exhausting all lines, it appends a final EOF token. The returned list therefore always ends with EOF, making it safe for the parser to read past the last real token without an index error.
A
NEWLINE token is synthesized after every line, including empty lines. This means the token stream always has a NEWLINE between logical statements regardless of whether the source line contained any other tokens.lex raises:
Token Class
Every piece of recognized source text becomes a Token instance:
let x = 42:
return_token_type(TOKEN) -> str | None
A utility function for looking up what token type a given string would produce, without running a full lex pass:
pattern.fullmatch(TOKEN) against each compiled pattern and returns the token type of the first full match, or None if nothing matches. This is useful for testing whether a string is a keyword before attempting to lex full source.
TOKEN_REGEX — The Pattern Table
TOKEN_REGEX is the authoritative source for every token the lexer can recognize. It is declared as a module-level list of (pattern_str, token_type) tuples and contains 18 entries in total (2 with None type for silently-consumed input, 16 with named types):
TOKEN_REGEX_COMPILED:
Order is semantically significant. Multi-character operators like
**, //, ==, and **= appear before their single-character prefixes (*, /, =) so the longer form is matched first. Similarly, KEYWORD is listed before IDENT so that reserved words such as let, def, and class are never classified as identifiers.Token Type Reference
| Token Type | What it matches | Emitted? |
|---|---|---|
None | Whitespace ([ \t]+) and comments (#...) | No — silently consumed |
NEWLINE | \n within a line; also a synthetic token added after every line at col=len(line) | Yes |
HEX | 0x followed by hex digits, e.g. 0xFF | Yes |
FLOAT | Digits, dot, digits — e.g. 3.14 | Yes |
INT | One or more decimal digits — e.g. 42 | Yes |
STRING | Double- or single-quoted text — e.g. "hello", 'world' | Yes |
COMP | ===, !==, ==, !=, <=, >=, <>, <, > | Yes |
LOGIC | &&, ||, and, or, not, ! | Yes |
UNARY | ++, -- | Yes |
ASSIGN_OP | +=, -=, *=, /=, %=, **=, //=, &=, |= | Yes |
SPECIAL | ??, ->, =>, <=>, :: | Yes |
ASSIGN | = (plain assignment, after compound forms) | Yes |
ARITH | +, -, **, *, //, /, %, &, |, ^, <<, >> | Yes |
BRACKET | [, ], {, } | Yes |
SYMBOL | (, ), :, ,, ., ;, ? | Yes |
KEYWORD | Any reserved word (see list below) | Yes |
IDENT | [A-Za-z_][A-Za-z0-9_]* not matched as a keyword | Yes |
Full Keyword List
TheKEYWORD pattern matches any of the following reserved words (word-boundary anchored so that e.g. letter is not matched):
[A-Za-z_][A-Za-z0-9_]* receives the IDENT type.