Origin Lexer: lex(), Token Fields, and TOKEN_REGEX

The Origin lexer (lexer.py) is the first stage of the compilation pipeline. It takes raw .or source text — provided as an iterable of strings, one per line — and produces a flat, ordered list of Token objects that the parser can consume sequentially. All pattern matching is done through a single ordered table of regular expressions called TOKEN_REGEX. Patterns are compiled once at module load time so that repeated calls to lex() on different files incur no re-compilation overhead.

`lex(code_lines) -> list[Token]`

The main public entry point of the lexer module.

from lexer import lex

with open("main.or") as f:
    lines = [line.rstrip("\n") for line in f]

tokens = lex(lines)
# tokens[-1] is always Token(EOF, '', N, 0)

lex iterates over code_lines one line at a time, advancing a column cursor across each line. At each position it tries every compiled pattern in TOKEN_REGEX_COMPILED in order; the first match wins. If a pattern’s token type is None (whitespace or comments), the matched text is consumed silently — no Token is appended. After exhausting each line, lex always appends a synthetic NEWLINE token whose col equals len(line) (the position immediately after the last character on that line). After exhausting all lines, it appends a final EOF token. The returned list therefore always ends with EOF, making it safe for the parser to read past the last real token without an index error.

A NEWLINE token is synthesized after every line, including empty lines. This means the token stream always has a NEWLINE between logical statements regardless of whether the source line contained any other tokens.

Error behavior: If no pattern matches the current character, lex raises:

SyntaxError(f"Illegal Character {char!r} at {line_num}:{col}")

`Token` Class

Every piece of recognized source text becomes a Token instance:

class Token:
    def __init__(self, type_, value, line, col):
        self.type  = type_   # str  — token type name, e.g. "INT", "IDENT", "KEYWORD"
        self.value = value   # str  — the original matched source text
        self.line  = line    # int  — 1-based source line number
        self.col   = col     # int  — 0-based column index within the line

    def __repr__(self):
        return f"Token({self.type}, {self.value!r}, {self.line}:{self.col})"

Example output for the source text let x = 42:

Token(KEYWORD, 'let', 1:0)
Token(IDENT,   'x',   1:4)
Token(ASSIGN,  '=',   1:6)
Token(INT,     '42',  1:8)
Token(NEWLINE, '\\n', 1:10)
Token(EOF,     '',    2:0)

`return_token_type(TOKEN) -> str | None`

A utility function for looking up what token type a given string would produce, without running a full lex pass:

from lexer import return_token_type

return_token_type("42")       # "INT"
return_token_type("let")      # "KEYWORD"
return_token_type("myVar")    # "IDENT"
return_token_type("???")      # None

It uses pattern.fullmatch(TOKEN) against each compiled pattern and returns the token type of the first full match, or None if nothing matches. This is useful for testing whether a string is a keyword before attempting to lex full source.

`TOKEN_REGEX` — The Pattern Table

TOKEN_REGEX is the authoritative source for every token the lexer can recognize. It is declared as a module-level list of (pattern_str, token_type) tuples and contains 18 entries in total (2 with None type for silently-consumed input, 16 with named types):

TOKEN_REGEX = [
    (r"[ \t]+",              None),       # Ignore whitespace
    (r"#.*",                 None),       # Ignore comments
    (r"\n",                  "NEWLINE"),
    (r"0x[0-9a-fA-F]+",      "HEX"),
    (r"\d+\.\d+",            "FLOAT"),
    (r"\d+",                 "INT"),
    (r"\".*?\"|'.*?'",       "STRING"),
    (r"===|!==|==|!=|<=|>=|<>|<|>", "COMP"),
    (r"\&\&|\|\||\b(and|or|not)\b|!", "LOGIC"),
    (r"\+\+|\-\-",           "UNARY"),
    (r"\+=|\-=|\*=|\/=|\%=|\*\*=|\/\/=|&=|\|=", "ASSIGN_OP"),
    (r"\?\?|->|=>|<=>|::",   "SPECIAL"),
    (r"=",                   "ASSIGN"),
    (r"\+|\-|\*\*|\*|\/\/|\/|\%|\&|\||\^|<<|>>", "ARITH"),
    (r"\[|\]|\{|\}",         "BRACKET"),
    (r"\(|\)|:|,|\.|;|\?",   "SYMBOL"),
    (r"\b(none|if|elif|open|else|check|for|get|while|return|py|int|len|str|sqrt|float|let|rand_num|const|in|print|true|exec|false|break|input|continue|def|import|from|class|try|call|except|raise|set|pass|yield|with|as|del|assert|global|nonlocal|async|await|match|case|macro|inline|parallel|when|range|unless|loop|until|do|struct|enum|type|bool|interface|pub|priv)\b", "KEYWORD"),
    (r"[A-Za-z_][A-Za-z0-9_]*", "IDENT"),
]

The list is immediately compiled into TOKEN_REGEX_COMPILED:

TOKEN_REGEX_COMPILED = [(re.compile(r), t) for r, t in TOKEN_REGEX]

Order is semantically significant. Multi-character operators like **, //, ==, and **= appear before their single-character prefixes (*, /, =) so the longer form is matched first. Similarly, KEYWORD is listed before IDENT so that reserved words such as let, def, and class are never classified as identifiers.

Token Type Reference

Token Type	What it matches	Emitted?
`None`	Whitespace (`[ \t]+`) and comments (`#...`)	No — silently consumed
`NEWLINE`	`\n` within a line; also a synthetic token added after every line at `col=len(line)`	Yes
`HEX`	`0x` followed by hex digits, e.g. `0xFF`	Yes
`FLOAT`	Digits, dot, digits — e.g. `3.14`	Yes
`INT`	One or more decimal digits — e.g. `42`	Yes
`STRING`	Double- or single-quoted text — e.g. `"hello"`, `'world'`	Yes
`COMP`	`===`, `!==`, `==`, `!=`, `<=`, `>=`, `<>`, `<`, `>`	Yes
`LOGIC`	`&&`, `\|\|`, `and`, `or`, `not`, `!`	Yes
`UNARY`	`++`, `--`	Yes
`ASSIGN_OP`	`+=`, `-=`, `=`, `/=`, `%=`, `*=`, `//=`, `&=`, `\|=`	Yes
`SPECIAL`	`??`, `->`, `=>`, `<=>`, `::`	Yes
`ASSIGN`	`=` (plain assignment, after compound forms)	Yes
`ARITH`	`+`, `-`, `*`, ``, `//`, `/`, `%`, `&`, `\|`, `^`, `<<`, `>>`	Yes
`BRACKET`	`[`, `]`, `{`, `}`	Yes
`SYMBOL`	`(`, `)`, `:`, `,`, `.`, `;`, `?`	Yes
`KEYWORD`	Any reserved word (see list below)	Yes
`IDENT`	`[A-Za-z_][A-Za-z0-9_]*` not matched as a keyword	Yes

Full Keyword List

The KEYWORD pattern matches any of the following reserved words (word-boundary anchored so that e.g. letter is not matched):

none       if         elif       open       else       check
for        get        while      return     py         int
len        str        sqrt       float      let        rand_num
const      in         print      true       exec       false
break      input      continue   def        import     from
class      try        call       except     raise      set
pass       yield      with       as         del        assert
global     nonlocal   async      await      match      case
macro      inline     parallel   when       range      unless
loop       until      do         struct     enum       type
bool       interface  pub        priv

Any identifier that is not in this list and matches [A-Za-z_][A-Za-z0-9_]* receives the IDENT type.

Architecture

Origin Lexer: lex(), Token Fields, and TOKEN_REGEX

`lex(code_lines) -> list[Token]`

`Token` Class

`return_token_type(TOKEN) -> str | None`

`TOKEN_REGEX` — The Pattern Table

Token Type Reference

Full Keyword List

Build docs developers (and LLMs) love

Architecture

Documentation Index

​lex(code_lines) -> list[Token]

​Token Class

​return_token_type(TOKEN) -> str | None

​TOKEN_REGEX — The Pattern Table

​Token Type Reference

​Full Keyword List

Build docs developers (and LLMs) love

`lex(code_lines) -> list[Token]`

`Token` Class

`return_token_type(TOKEN) -> str | None`

`TOKEN_REGEX` — The Pattern Table

Token Type Reference

Full Keyword List