Available SPU Samples
sputest
Basic SPU thread creation and execution
spudma
DMA transfers between PPU and SPU
spuchain
SPU thread chains with synchronization
spuparallel
Parallel processing with multiple SPUs
spumars
MARS task queuing system
spurs
SPURS task scheduling framework
sputhread
SPU thread group management
Cell SPU Architecture
The PS3’s Cell processor contains:- 1 PPU (Power Processing Unit): Main CPU running your application
- 6 SPUs (Synergistic Processing Units): Specialized coprocessors for parallel work
- Local Store: Each SPU has 256KB of fast local memory
- DMA: Explicit data transfers between main memory and SPU local store
SPU Characteristics
- SIMD: Vector processing with 128-bit registers
- No cache: All data must be explicitly loaded via DMA
- Fast: Excellent for parallel, data-intensive operations
- Separate code: SPU programs are compiled separately and loaded by PPU
sputest - Basic SPU Execution
Location:samples/spu/sputest/
Simplest example of loading and running SPU code.
What It Demonstrates
- SPU subsystem initialization
- Creating raw SPU threads
- Loading SPU program images
- Starting SPU execution
- Reading SPU mailbox output
- Proper cleanup
PPU Code
samples/spu/sputest/source/main.c
SPU Code
samples/spu/sputest/spu/source/main.c
Execution Flow
spuchain - SPU Thread Chains
Location:samples/spu/spuchain/
Demonstrates coordinating multiple SPUs in a chain.
What It Demonstrates
- SPU thread groups
- Signal notifications between SPUs
- DMA transfers between SPU local stores
- Thread synchronization
- Chain processing pattern
Concept
Creates 6 SPU threads in a chain:- PPU signals SPU 0
- SPU 0 processes data, DMAs to SPU 1
- SPU 1 processes, DMAs to SPU 2
- … continues through SPU 5
- SPU 5 writes result to main memory
PPU Implementation
samples/spu/spuchain/source/main.c
{1, 2, 3, 4} → {64, 128, 192, 256}
DMA Transfers
SPUs cannot directly access main memory - all data must be transferred via DMA.DMA Patterns
Simple DMA Get
Simple DMA Get
Simple DMA Put
Simple DMA Put
Double buffering
Double buffering
List DMA
List DMA
DMA Requirements
SPU Programming Patterns
SPU Thread Groups
Mailboxes for Communication
SPU Signal Notifications
Building SPU Samples
Build Process
SPU samples require two compilation steps:- Compile SPU code with SPU compiler (
spu-gcc) - Embed SPU binary in PPU code
- Compile PPU code with PPU compiler (
ppu-gcc)
Build Commands
SPU Makefile Structure
Typical SPU sample has:Performance Considerations
DMA Latency
DMA transfers have latency (~200 cycles). Use double buffering to hide it.
Alignment
Keep data 128-byte aligned for best performance
SIMD
Use vector intrinsics for 4x parallelism within each SPU
Local Store
Keep working set small (256KB total, including code)
Branch Prediction
SPUs have simple branch prediction - avoid complex branching
Mailbox Limits
Mailboxes are slow - use for control, not bulk data
Common SPU Patterns
Data Parallel Processing
Pipeline Processing
Reduction
Debugging SPU Code
Related Documentation
SPU API Reference
Complete SPU API documentation
SPU Programming Guide
In-depth SPU programming concepts
DMA Guide
DMA transfer patterns and optimization
SIMD Intrinsics
SPU vector intrinsics reference