Skip to main content

Overview

rfxJIT is a lightweight JIT compiler for robotics computation kernels. It provides:
  • Typed kernel IR for elementwise operations
  • Optimization passes (constant folding, dead-code elimination, fusion)
  • Multi-backend codegen for CPU, CUDA, and Metal
  • Functional transforms (autodiff, value_and_grad)
  • TinyJIT-style caching for Python expression capture
Inspired by PyTorch (ergonomics), JAX (functional transforms), and TVM (scheduling), rfxJIT stays intentionally tiny and hackable.
rfxJIT is experimental. It focuses on control-related kernels (PID, filters, kinematics) rather than large neural networks.

Architecture

┌─────────────────────────────────────────────┐
│         Python Front-End (rfx)              │
│   (tinygrad for NN, rfxJIT for control)     │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│            Kernel IR Layer                  │
│  - Typed operations (Add, Mul, Sin, etc.)   │
│  - Function-level capture via trace API     │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│         Optimization Passes                 │
│  - Constant folding                         │
│  - Dead-op elimination                      │
│  - Chain fusion                             │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│            Lowering Layer                   │
│  - Slot-based executable form               │
│  - Backend compiler (cpu/cuda/metal)        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│         Runtime Execution                   │
│  - Interpreter (reference)                  │
│  - Executor (lowered kernels)               │
│  - Dispatch queue (single-worker)           │
└─────────────────────────────────────────────┘

Kernel IR

The kernel IR is defined in rfxJIT/kernels/ir.py. Each operation is a typed node:
from rfxJIT.kernels.ir import Kernel, Input, Add, Mul, Sin, Const

# Define a kernel: f(x, y) = sin(x * 2) + y
kernel = Kernel(
    inputs=[Input("x", dtype="float32"), Input("y", dtype="float32")],
    ops=[
        Const(2.0, dtype="float32"),
        Mul(Input("x"), Const(2.0)),
        Sin(Mul(Input("x"), Const(2.0))),
        Add(Sin(...), Input("y"))
    ],
    output=Add(...)
)
The IR is explicit and immutable. Each operation produces a new node rather than mutating state.

Optimization Passes

rfxJIT applies optimization passes before lowering:

Constant Folding

from rfxJIT.kernels.optimize import constant_fold

# Before: Add(Const(1.0), Const(2.0))
# After:  Const(3.0)
optimized = constant_fold(kernel)

Dead-Code Elimination

from rfxJIT.kernels.optimize import eliminate_dead_ops

# Removes operations not reachable from output
cleaned = eliminate_dead_ops(kernel)

Chain Fusion

from rfxJIT.kernels.optimize import fuse_chains

# Fuses sequential elementwise operations
fused = fuse_chains(kernel)

Full Optimization Pipeline

from rfxJIT.kernels.optimize import optimize

# Apply all optimization passes
optimized_kernel = optimize(kernel)
print(f"Reduced from {len(kernel.ops)} to {len(optimized_kernel.ops)} ops")

Lowering and Execution

Lower kernels to slot-based executable form for fast execution:
from rfxJIT.kernels.lowering import lower_kernel
from rfxJIT.runtime.executor import execute_lowered

# Lower to executable form
lowered = lower_kernel(kernel)

# Execute with inputs
import numpy as np
x = np.array([0.0, 1.0, 2.0], dtype=np.float32)
y = np.array([1.0, 2.0, 3.0], dtype=np.float32)

result = execute_lowered(lowered, inputs={"x": x, "y": y}, backend="cpu")
print(result)
# Output: array([1.0, 2.9092975, 3.9092975], dtype=float32)

Backend Selection

Supported backends:
  • cpu: NumPy-based reference implementation
  • cuda: CUDA kernels (requires GPU)
  • metal: Metal Performance Shaders (macOS/iOS)
# CPU backend (default)
result_cpu = execute_lowered(lowered, inputs, backend="cpu")

# CUDA backend (if available)
result_cuda = execute_lowered(lowered, inputs, backend="cuda")

# Metal backend (macOS)
result_metal = execute_lowered(lowered, inputs, backend="metal")

TinyJIT Transform

Capture Python expressions and cache compiled kernels:
from rfxJIT.runtime.tinyjit import jit
import numpy as np

@jit
def compute(x, y):
    return np.sin(x * 2.0) + y

# First call: traces and compiles
result1 = compute(np.array([1.0, 2.0]), np.array([0.5, 1.0]))

# Subsequent calls: uses cached kernel
result2 = compute(np.array([3.0, 4.0]), np.array([1.5, 2.0]))
The @jit decorator caches kernels based on input shapes and dtypes. Recompilation only occurs when the trace changes.

Autodiff Transforms

Automatic differentiation over kernel IR:
from rfxJIT.kernels.trace import trace
from rfxJIT.transforms import grad, value_and_grad
import numpy as np

# Define a loss function
def loss_fn(x):
    return np.sum(x ** 2)

# Compute gradient
grad_fn = grad(loss_fn)
x = np.array([1.0, 2.0, 3.0])
dx = grad_fn(x)
print(dx)  # [2.0, 4.0, 6.0]

# Compute both value and gradient
value_and_grad_fn = value_and_grad(loss_fn)
value, dx = value_and_grad_fn(x)
print(value, dx)  # 14.0, [2.0, 4.0, 6.0]
Autodiff is implemented via IR-level reverse-mode differentiation. It supports elementwise operations, reductions, and control flow.

Benchmarking

Run performance benchmarks:
python -m rfxJIT.runtime.benchmark \
  --size 65536 \
  --iterations 200 \
  --backend cpu \
  --json-out /tmp/rfxjit-current.json
Compare against tracked baselines:
bash scripts/perf-check.sh \
  --baseline docs/perf/baselines/rfxjit_microkernels_cpu.json \
  --output /tmp/rfxjit-current.json \
  --backend cpu \
  --threshold-pct 10
Benchmarking elementwise kernels (size=65536, iterations=200)
[cpu] add: 0.032ms avg, 6250 ops/s
[cpu] mul: 0.031ms avg, 6452 ops/s
[cpu] sin: 0.089ms avg, 2247 ops/s
[cpu] fused_chain: 0.045ms avg, 4444 ops/s

Optimization impact:
  - Reduced ops: 12 -> 8 (33% reduction)
  - Fused chains: 3

Development Status

rfxJIT is under active development. Current phase status:

Phase 0: IR Foundation ✅

  • Typed elementwise kernel IR (rfxJIT/kernels/ir.py)
  • Reference interpreter (rfxJIT/runtime/interpreter.py)
  • Benchmark harness (rfxJIT/runtime/benchmark.py)
  • Tests (rfxJIT/tests/test_ir.py)

Phase 1: Lowering and Execution ✅

  • IR lowering to slot-based form (rfxJIT/kernels/lowering.py)
  • Lowered-kernel executor with backend compiler (rfxJIT/runtime/executor.py)
  • Single-worker dispatch queue (rfxJIT/runtime/queue.py)
  • Tests (rfxJIT/tests/test_lowering_queue.py)

Phase 2: Optimization ✅

  • Optimization passes (rfxJIT/kernels/optimize.py)
  • Constant folding, dead-op elimination, chain fusion
  • Tests (rfxJIT/tests/test_optimize.py)
  • Benchmark reports op count before/after optimization

Phase 3: Transforms ✅

  • Tracer API for Python expression capture (rfxJIT/kernels/trace.py)
  • TinyJIT-style cache+replay runtime (rfxJIT/runtime/tinyjit.py)
  • IR autodiff + functional transforms (grad, value_and_grad)
  • Tests (rfxJIT/tests/test_tinyjit.py, rfxJIT/tests/test_grad_transforms.py)

Phase 4: Advanced Features 🚧

  • Multi-dimensional array support (broadcasting, indexing)
  • Control flow primitives (if/while)
  • GPU kernel fusion and scheduling
  • Beam-search-style kernel exploration

Repository Layout

rfxJIT lives in the rfxJIT/ directory:
rfxJIT/
├── notes/              # Architecture notes and design records
├── runtime/
│   ├── interpreter.py  # Reference interpreter
│   ├── executor.py     # Lowered-kernel executor
│   ├── queue.py        # Dispatch queue
│   ├── tinyjit.py      # JIT cache+replay runtime
│   └── benchmark.py    # Benchmark harness
├── kernels/
│   ├── ir.py           # Kernel IR definitions
│   ├── lowering.py     # IR lowering to executable form
│   ├── optimize.py     # Optimization passes
│   └── trace.py        # Tracer API for Python capture
├── tests/
│   ├── test_ir.py
│   ├── test_lowering_queue.py
│   ├── test_optimize.py
│   ├── test_tinyjit.py
│   └── test_grad_transforms.py
├── ROADMAP.md          # Milestone plan
└── README.md           # This file

Comparison to Other Frameworks

Similar: Eager Tensor API, autograd, optim, basic datasets/layersDifferent: rfxJIT’s full compiler and IR are visible and easy to modify. PyTorch’s backend is complex and opaque.
Similar: IR-based autodiff over primitives, function-level JIT via TinyJIT-style captureDifferent: Fewer transforms today (no full vmap/pmap yet), but much smaller and easier to read.
Similar: Multiple lowering passes, scheduling, beam-search-style kernel exploration, device graph batched executionDifferent: rfxJIT is coupled with a front-end framework (rfx) rather than only being a compiler stack.

Integration with rfx

Use rfxJIT for performance-critical control kernels:
import rfx
import numpy as np
from rfxJIT.runtime.tinyjit import jit

# Define PID control kernel
@jit
def pid_update(error, integral, derivative, kp, ki, kd):
    output = kp * error + ki * integral + kd * derivative
    return np.clip(output, -1.0, 1.0)

# Use in control loop
go2 = rfx.Go2.connect()
integral = 0.0
prev_error = 0.0

def control_callback(iter, dt):
    global integral, prev_error
    
    state = go2.state()
    setpoint = 0.0  # Desired position
    measurement = state.imu.pitch
    
    error = setpoint - measurement
    integral += error * dt
    derivative = (error - prev_error) / dt
    
    output = pid_update(
        np.array([error]),
        np.array([integral]),
        np.array([derivative]),
        kp=0.5, ki=0.1, kd=0.05
    )
    
    go2.walk(output[0], 0, 0)
    prev_error = error
    return True

rfx.run_control_loop(rate_hz=100, callback=control_callback)

Best Practices

Profile First

Use the benchmark harness to identify hot kernels before optimizing.

Keep Kernels Small

rfxJIT excels at small, elementwise kernels. Use tinygrad for large neural networks.

Use Type Hints

Explicit dtypes improve IR generation and backend selection.

Test Numerics

Compare JIT results against reference implementations to catch bugs.

See Also

Build docs developers (and LLMs) love