ExecutionPlan to a Session. The session owns all GPU resources and replays the pre-compiled dispatch sequence on every training step.
blade-graphics
Meganeura uses blade-graphics as its GPU abstraction layer. blade-graphics wraps Vulkan (Linux / Windows) and Metal (macOS) behind a single API. All buffer creation, shader compilation, and command encoding go through blade-graphics — you never write Vulkan or Metal code directly. The GPU context is created insideSession::new:
MEGANEURA_DEVICE_ID environment variable to select a specific GPU on multi-adapter systems.
ExecutionPlan
compile::compile() produces an ExecutionPlan — a fully static description of everything the GPU needs to do:
BufferRef is a typed u32 index into buffers. Every node in the graph gets one buffer; leaf nodes (Input, Parameter, Constant) have their buffers filled before dispatch execution begins.
Session
Session::new(plan) allocates all GPU buffers, compiles the required shaders, and reorders dispatches for optimal barrier grouping:
| Method | Description |
|---|---|
set_parameter(name, data) | Upload f32 data to a named parameter buffer |
set_input(name, data) | Upload f32 data to a named input buffer |
set_input_u32(name, data) | Upload u32 data (e.g. token IDs) |
set_learning_rate(lr) | Schedule an SGD update on the next step() |
set_adam(lr, b1, b2, eps) | Schedule an Adam update on the next step() |
step() | Submit all dispatches (forward + backward + optimizer update) |
wait() | Block until the GPU submission completes |
read_loss() | Read back the scalar loss value |
read_output(len) | Read back the primary output tensor |
Cooperative matrix operations
Modern GPUs expose hardware matrix-multiply units that operate on small tiles (e.g. 16×16 or 32×32 f16 matrices) directly in shader registers. On Vulkan these are surfaced via theVK_KHR_cooperative_matrix extension; blade-graphics exposes them through CooperativeMatrix capabilities.
Meganeura queries the GPU at session creation time and selects the best tile configuration:
test_coop_matmul) is run immediately after selection — some drivers (e.g. AMD RADV) advertise the extension but reject specific matrix shapes at shader creation time.
MIN_COOP_WORKGROUPS: scalar vs. cooperative path
The cooperative path is not unconditionally faster. It uses a 2×2 tile layout, so each wave covers a2*tile_size × 2*tile_size output region. For small matrices the workgroup count is too low to saturate the GPU, and the scalar tiled kernel runs faster.
Meganeura evaluates each dispatch individually:
[M, N]:
coop_wgs >= MIN_COOP_WORKGROUPS (or MIN_COOP_WORKGROUPS_HIGH_K when K ≥ 1024), the dispatch is marked use_coop = true and its workgroup count is recomputed for the coop tile size. Otherwise the scalar 64×64 tiled kernel is used.
Dispatch barrier groups
Within a single training step, many dispatches are independent (e.g. the Q, K, and V projection matmuls can all execute concurrently). Meganeura groups dispatches into barrier groups to maximise concurrency:- Dispatches are sorted by dependency level (Kahn’s algorithm over buffer write→read edges).
- All dispatches at the same level form one compute pass with no internal barriers.
- A pass boundary (ALL_COMMANDS barrier in blade-graphics) separates levels.
MemorySummary
Session::memory_summary() returns a breakdown of GPU memory usage:
adam_state_bytes is the combined size of all first- and second-moment buffers (m and v in the Adam optimizer) — each parameter gets two copies of its own size for Adam state.
Perfetto trace output
Session integrates with the meganeura::profiler module to produce Perfetto binary traces (.pftrace). CPU-side work appears as nested spans on the CPU track. GPU pass durations come from blade-graphics hardware timestamp queries and appear on a separate GPU track.
Enable profiling before calling step():
trace.pftrace in the Perfetto UI to see the full timeline.
Next steps
Computation graph
How to build and inspect the graph that the session executes.
Profiling
How to capture and interpret Perfetto traces for performance analysis.