Documentation Index
Fetch the complete documentation index at: https://mintlify.com/anthropics/original_performance_takehome/llms.txt
Use this file to discover all available pages before exploring further.
What is VLIW?
VLIW (Very Large Instruction Word) is a parallel execution model where multiple operations execute simultaneously within a single instruction cycle. In this architecture, each “instruction” is actually an instruction bundle containing operations for multiple engines:- Two vector ALU operations in parallel
- One load operation in parallel
Engine Slot Limits
Each engine can execute multiple “slots” per cycle, limited by hardware constraints:problem.py
The simulator enforces these limits with assertions. Exceeding a slot limit will cause a runtime error.
Maximizing Parallelism
To optimize performance, pack as many operations as possible into each instruction bundle:- Poor packing (6 cycles)
- Good packing (1 cycle)
What is SIMD?
SIMD (Single Instruction Multiple Data) allows one instruction to operate on multiple data elements simultaneously.Vector Length
problem.py
Vector vs Scalar Operations
- Scalar ALU
- Vector ALU
Operations: Single 32-bit word operationsSlots: 12 per cycleExample:Use for: Control logic, addresses, loop counters, single values
One
valu operation does the work of 8 scalar operations using just 1 slot!Vector Operation Examples
Broadcasting
Copy a scalar value to all elements of a vector:problem.py
Vector Memory Operations
Load/store 8 contiguous elements:problem.py
Fused Multiply-Add
A special high-performance operation:problem.py
Instruction Packing Example
Here’s how the reference kernel builds an instruction from slots:perf_takehome.py
Performance Implications
Throughput Comparison
| Operation Type | Slots/Cycle | Elements/Slot | Throughput |
|---|---|---|---|
| Scalar ALU | 12 | 1 | 12 ops/cycle |
| Vector ALU | 6 | 8 | 48 ops/cycle |
| Scalar Load | 2 | 1 | 2 loads/cycle |
| Vector Load | 2 | 8 | 16 loads/cycle |
Vector operations provide 4× throughput for data-parallel work!
When to Use Each
Use scalar operations for...
Use scalar operations for...
- Computing memory addresses
- Loop counters and conditions
- Branch decisions
- Single values that don’t have parallel analogs
- Operations that can’t be vectorized (dependencies)
Use vector operations for...
Use vector operations for...
- Processing batches of independent data
- Element-wise array operations
- Loading/storing contiguous memory blocks
- Parallel hash computations
- Any operation that can be expressed as “do the same thing to N items”
Next Steps
Memory Model
Learn about memory layout and scratch space
Instruction Set
Complete reference for all operations