Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/boblio-max/origin/llms.txt

Use this file to discover all available pages before exploring further.

The Origin lexer (lexer.py) is the first stage of the compilation pipeline. It takes raw .or source text — provided as an iterable of strings, one per line — and produces a flat, ordered list of Token objects that the parser can consume sequentially. All pattern matching is done through a single ordered table of regular expressions called TOKEN_REGEX. Patterns are compiled once at module load time so that repeated calls to lex() on different files incur no re-compilation overhead.

lex(code_lines) -> list[Token]

The main public entry point of the lexer module.
from lexer import lex

with open("main.or") as f:
    lines = [line.rstrip("\n") for line in f]

tokens = lex(lines)
# tokens[-1] is always Token(EOF, '', N, 0)
lex iterates over code_lines one line at a time, advancing a column cursor across each line. At each position it tries every compiled pattern in TOKEN_REGEX_COMPILED in order; the first match wins. If a pattern’s token type is None (whitespace or comments), the matched text is consumed silently — no Token is appended. After exhausting each line, lex always appends a synthetic NEWLINE token whose col equals len(line) (the position immediately after the last character on that line). After exhausting all lines, it appends a final EOF token. The returned list therefore always ends with EOF, making it safe for the parser to read past the last real token without an index error.
A NEWLINE token is synthesized after every line, including empty lines. This means the token stream always has a NEWLINE between logical statements regardless of whether the source line contained any other tokens.
Error behavior: If no pattern matches the current character, lex raises:
SyntaxError(f"Illegal Character {char!r} at {line_num}:{col}")

Token Class

Every piece of recognized source text becomes a Token instance:
class Token:
    def __init__(self, type_, value, line, col):
        self.type  = type_   # str  — token type name, e.g. "INT", "IDENT", "KEYWORD"
        self.value = value   # str  — the original matched source text
        self.line  = line    # int  — 1-based source line number
        self.col   = col     # int  — 0-based column index within the line

    def __repr__(self):
        return f"Token({self.type}, {self.value!r}, {self.line}:{self.col})"
Example output for the source text let x = 42:
Token(KEYWORD, 'let', 1:0)
Token(IDENT,   'x',   1:4)
Token(ASSIGN,  '=',   1:6)
Token(INT,     '42',  1:8)
Token(NEWLINE, '\\n', 1:10)
Token(EOF,     '',    2:0)

return_token_type(TOKEN) -> str | None

A utility function for looking up what token type a given string would produce, without running a full lex pass:
from lexer import return_token_type

return_token_type("42")       # "INT"
return_token_type("let")      # "KEYWORD"
return_token_type("myVar")    # "IDENT"
return_token_type("???")      # None
It uses pattern.fullmatch(TOKEN) against each compiled pattern and returns the token type of the first full match, or None if nothing matches. This is useful for testing whether a string is a keyword before attempting to lex full source.

TOKEN_REGEX — The Pattern Table

TOKEN_REGEX is the authoritative source for every token the lexer can recognize. It is declared as a module-level list of (pattern_str, token_type) tuples and contains 18 entries in total (2 with None type for silently-consumed input, 16 with named types):
TOKEN_REGEX = [
    (r"[ \t]+",              None),       # Ignore whitespace
    (r"#.*",                 None),       # Ignore comments
    (r"\n",                  "NEWLINE"),
    (r"0x[0-9a-fA-F]+",      "HEX"),
    (r"\d+\.\d+",            "FLOAT"),
    (r"\d+",                 "INT"),
    (r"\".*?\"|'.*?'",       "STRING"),
    (r"===|!==|==|!=|<=|>=|<>|<|>", "COMP"),
    (r"\&\&|\|\||\b(and|or|not)\b|!", "LOGIC"),
    (r"\+\+|\-\-",           "UNARY"),
    (r"\+=|\-=|\*=|\/=|\%=|\*\*=|\/\/=|&=|\|=", "ASSIGN_OP"),
    (r"\?\?|->|=>|<=>|::",   "SPECIAL"),
    (r"=",                   "ASSIGN"),
    (r"\+|\-|\*\*|\*|\/\/|\/|\%|\&|\||\^|<<|>>", "ARITH"),
    (r"\[|\]|\{|\}",         "BRACKET"),
    (r"\(|\)|:|,|\.|;|\?",   "SYMBOL"),
    (r"\b(none|if|elif|open|else|check|for|get|while|return|py|int|len|str|sqrt|float|let|rand_num|const|in|print|true|exec|false|break|input|continue|def|import|from|class|try|call|except|raise|set|pass|yield|with|as|del|assert|global|nonlocal|async|await|match|case|macro|inline|parallel|when|range|unless|loop|until|do|struct|enum|type|bool|interface|pub|priv)\b", "KEYWORD"),
    (r"[A-Za-z_][A-Za-z0-9_]*", "IDENT"),
]
The list is immediately compiled into TOKEN_REGEX_COMPILED:
TOKEN_REGEX_COMPILED = [(re.compile(r), t) for r, t in TOKEN_REGEX]
Order is semantically significant. Multi-character operators like **, //, ==, and **= appear before their single-character prefixes (*, /, =) so the longer form is matched first. Similarly, KEYWORD is listed before IDENT so that reserved words such as let, def, and class are never classified as identifiers.

Token Type Reference

Token TypeWhat it matchesEmitted?
NoneWhitespace ([ \t]+) and comments (#...)No — silently consumed
NEWLINE\n within a line; also a synthetic token added after every line at col=len(line)Yes
HEX0x followed by hex digits, e.g. 0xFFYes
FLOATDigits, dot, digits — e.g. 3.14Yes
INTOne or more decimal digits — e.g. 42Yes
STRINGDouble- or single-quoted text — e.g. "hello", 'world'Yes
COMP===, !==, ==, !=, <=, >=, <>, <, >Yes
LOGIC&&, ||, and, or, not, !Yes
UNARY++, --Yes
ASSIGN_OP+=, -=, *=, /=, %=, **=, //=, &=, |=Yes
SPECIAL??, ->, =>, <=>, ::Yes
ASSIGN= (plain assignment, after compound forms)Yes
ARITH+, -, **, *, //, /, %, &, |, ^, <<, >>Yes
BRACKET[, ], {, }Yes
SYMBOL(, ), :, ,, ., ;, ?Yes
KEYWORDAny reserved word (see list below)Yes
IDENT[A-Za-z_][A-Za-z0-9_]* not matched as a keywordYes

Full Keyword List

The KEYWORD pattern matches any of the following reserved words (word-boundary anchored so that e.g. letter is not matched):
none       if         elif       open       else       check
for        get        while      return     py         int
len        str        sqrt       float      let        rand_num
const      in         print      true       exec       false
break      input      continue   def        import     from
class      try        call       except     raise      set
pass       yield      with       as         del        assert
global     nonlocal   async      await      match      case
macro      inline     parallel   when       range      unless
loop       until      do         struct     enum       type
bool       interface  pub        priv
Any identifier that is not in this list and matches [A-Za-z_][A-Za-z0-9_]* receives the IDENT type.

Build docs developers (and LLMs) love