Origin Internals: Lexer, Parser, and Interpreter Stages

Origin is a compiled-and-executed language implemented entirely in Python. A .or source file passes through four sequential stages before any user code runs: the lexer tokenizes raw text into a flat Token list, the parser consumes that list and builds an Abstract Syntax Tree (AST), the interpreter walks the AST and emits a Python source string, and finally Python’s built-in exec() runs that string inside a controlled globals dictionary. Each stage is a self-contained Python module with a clear public interface, making it straightforward to extend or embed any individual part of the pipeline.

The Four Stages

Stage 1 — Lexical analysis (lexer.py): The lex() function reads an iterable of source lines and returns a flat list[Token]. Each Token carries its type, original text value, 1-based line number, and 0-based column. Patterns are declared in TOKEN_REGEX as an ordered list of (regex, token_type) pairs and pre-compiled once at module load time. Whitespace and comments are consumed silently (they produce no tokens). The sequence always ends with an EOF token. Stage 2 — Recursive-descent parsing (parser.py): Parser(tokens).program() iterates over the token list, dispatches on token type and keyword value, and returns a ProgramNode whose .statements list contains every top-level AST node. Each syntactic form — from let assignments to parallel {} blocks — is handled by a dedicated parsing method. Expression precedence is enforced through a hand-written call chain: special_expr → logic → comparison → expr → term → unary → factor. Stage 3 — Code generation (interpreter.py): Interpreter().generate(ast) walks the AST and returns a single Python source string. Every node type maps to a specific Python code pattern: AssignNode becomes a simple assignment, FuncNode becomes a def, ClassNode becomes a class with an __init__, and so on. The interpreter also injects globals()['_origin_runtime_line'] = N markers into the emitted code so that runtime errors can be mapped back to their Origin source line. Stage 4 — Execution (runner.py): run_origin(file_path) orchestrates the full pipeline. After code generation, it builds a runtime_globals dictionary containing random, math, the hardware helpers (_execute_set_pin, _execute_i2c_read, _execute_i2c_write), and the _origin_runtime_line tracker, then calls exec(generated_python, runtime_globals). Any exception is caught, translated into a friendly message by errors.py, and displayed with file path and source line.

Pipeline Architecture

  ┌─────────────────────────────────────────────────────────────┐
  │                    .or source file                          │
  └────────────────────────────┬────────────────────────────────┘
                               │  list[str]  (lines)
                               ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  lexer.lex()          TOKEN_REGEX (18 patterns, compiled)   │
  │                       → list[Token]  …  EOF                 │
  └────────────────────────────┬────────────────────────────────┘
                               │  list[Token]
                               ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  parser.Parser(tokens).program()                            │
  │  Recursive descent: factor → unary → term → expr →         │
  │  comparison → logic → special_expr → statement → block      │
  │                       → ProgramNode (AST)                   │
  └────────────────────────────┬────────────────────────────────┘
                               │  ProgramNode
                               ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  interpreter.Interpreter().generate(ast)                    │
  │  AST-node dispatch → Python source string                   │
  │  Injects _origin_runtime_line markers                       │
  └────────────────────────────┬────────────────────────────────┘
                               │  str  (Python source)
                               ▼
  ┌─────────────────────────────────────────────────────────────┐
  │  exec(generated_python, runtime_globals)                    │
  │  globals: random, math, _execute_set_pin,                   │
  │           _execute_i2c_read, _execute_i2c_write             │
  └─────────────────────────────────────────────────────────────┘

Optional Fifth Stage: Parallel Scheduler (`parallelInt.py`)

When a program contains parallel {} blocks — or when the alternative runner invokes parallelInt directly — a fifth stage sits between parsing and code generation. parallelInt.gen(ast) performs dependency analysis on the top-level ProgramNode.statements, classifying each statement’s read-set and write-set. It then groups statements into wavefront stages where every statement in a stage has no RAW (read-after-write), WAW (write-after-write), or WAR (write-after-read) dependency on any other statement in the same stage. Each stage is executed with a threading.Thread per independent statement; all threads in a stage must complete (.join()) before the next stage begins.

Experimental Bytecode Path (`bComp.py`)

An alternative compilation target exists in bComp.py but is not used by the default runner.py. The Compiler class walks the same AST produced by the parser and emits a flat list of integer opcodes into self.bytecode, with literal values stored separately in self.constants. The OpCode class enumerates 44 numeric opcodes (PUSH_CONST = 0x01 through FOR_ITER = 0x2C). A companion VM class executes the bytecode using a value stack and a call stack, implementing jump patching for if/while/for control flow and supporting break/continue via LOOP_START/LOOP_END markers.

Programmatic API

You can drive the full pipeline from Python without going through runner.py:

from lexer import lex
from parser import Parser
from interpreter import Interpreter

with open("main.or") as f:
    lines = [l.rstrip("\n") for l in f]

tokens = lex(lines)
ast = Parser(tokens).program()
py_source = Interpreter().generate(ast)
exec(py_source)

To inspect intermediate representations, examine tokens (a list[Token]) or ast (a ProgramNode) before passing them to the next stage. To add custom built-ins, populate the globals dictionary passed to exec().

Lexer

Token types, TOKEN_REGEX, lex(), and Token fields

Parser

Recursive-descent grammar, expression precedence, and all statement forms

Interpreter

AST-to-Python code generation, generate(), and runtime helpers

AST Nodes

Complete reference for every node class in classes.py

Architecture

Documentation Index

​The Four Stages

​Pipeline Architecture

​Optional Fifth Stage: Parallel Scheduler (parallelInt.py)

​Experimental Bytecode Path (bComp.py)

​Programmatic API