Overview
The PlayStation 3’s Cell Broadband Engine features six Synergistic Processing Units (SPUs) available to user programs. Each SPU consists of:
SPU Core : 128-bit SIMD processor
256 KB Local Store (LS) : Fast local memory
Memory Flow Controller (MFC) : DMA engine for data transfers
SPUs are optimized for vectorized data-parallel workloads. Think of them as powerful compute accelerators similar to modern GPU compute units.
SPU Architecture
Key Characteristics
Local Store 256 KB of fast SRAM - code and data must fit here
SIMD Engine 128-bit wide, operates on 16x8-bit, 8x16-bit, 4x32-bit vectors
No Cache All memory access is explicit via DMA
In-Order Execution Dual-issue, in-order pipeline
Memory Model
SPUs have a unique memory architecture:
┌─────────────────────────────────────┐
│ Main Memory (PPU) │
└──────────────┬──────────────────────┘
│ DMA Transfers
│ (via MFC)
▼
┌─────────────────────────────────────┐
│ SPU Local Store (256 KB) │
│ ┌──────────┐ ┌─────────────────┐ │
│ │ Code │ │ Data/Stack │ │
│ └──────────┘ └─────────────────┘ │
└─────────────────────────────────────┘
SPUs cannot directly access main memory. All data must be transferred via DMA using the MFC.
SPU Thread Management
Creating and Running SPU Threads
From the PPU side, SPU programs are managed through thread groups:
samples/spu/sputest/source/main.c
int main ( int argc , char * argv [] )
{
u32 spu_id = 0 ;
sysSpuImage image;
printf ( "Initializing 6 SPUs... \n " );
sysSpuInitialize ( 6 , 5 );
printf ( "Initializing raw SPU... \n " );
sysSpuRawCreate ( & spu_id, NULL );
printf ( "Importing spu image... \n " );
sysSpuImageImport ( & image, spu_bin, SPU_IMAGE_PROTECT);
printf ( "Loading spu image into SPU %d ... \n " , spu_id);
sysSpuRawImageLoad (spu_id, & image);
printf ( "Starting SPU %d ... \n " , spu_id);
sysSpuRawWriteProblemStorage (spu_id, SPU_RunCtrl, 1 );
printf ( "Waiting for SPU to return... \n " );
while ( ! ( sysSpuRawReadProblemStorage (spu_id, SPU_MBox_Status) & 1 ));
printf ( "SPU Mailbox return value: %08x \n " ,
sysSpuRawReadProblemStorage (spu_id, SPU_Out_MBox));
printf ( "Destroying SPU %d ... \n " , spu_id);
sysSpuRawDestroy (spu_id);
printf ( "Closing SPU image... \n " );
sysSpuImageClose ( & image);
return 0 ;
}
SPU Thread Groups
For better management, use SPU thread groups:
sys_spu_group_t group;
sysSpuThreadGroupAttribute attr;
sysSpuThreadAttribute thread_attr;
sysSpuThreadArgument args;
// Initialize attributes
sysSpuThreadGroupAttributeInitialize (attr);
sysSpuThreadGroupAttributeName (attr, "MyGroup" );
// Create group with 2 threads
sysSpuThreadGroupCreate ( & group , 2 , 100 , & attr );
// Initialize threads
sys_spu_thread_t thread;
sysSpuThreadAttributeInitialize (thread_attr);
sysSpuThreadArgumentInitialize (args);
args.arg0 = (u64)data_ea; // Pass effective address to SPU
sysSpuThreadInitialize ( & thread , group, 0 , & image , & thread_attr , & args );
// Start the group
sysSpuThreadGroupStart (group);
// Wait for completion
u32 cause, status;
sysSpuThreadGroupJoin (group, & cause , & status );
// Cleanup
sysSpuThreadGroupDestroy (group);
SPU_THREAD_GROUP_TYPE_NORMAL : Standard thread group
SPU_THREAD_GROUP_TYPE_SEQUENTIAL : Threads run sequentially
SPU_THREAD_GROUP_TYPE_SYSTEM : System-level priority
SPU_THREAD_GROUP_TYPE_MEMORY_FROM_CONTAINER : Use memory container
SPU Programming Basics
Simple SPU Program
Here’s the complete SPU-side code from the test sample:
samples/spu/sputest/spu/source/main.c
#include <spu_intrinsics.h>
int main ()
{
spu_writech (SPU_WrOutMbox, 0x 1337BAAD );
return 0 ;
}
This minimal program:
Writes a value to the outbound mailbox
The PPU can read this value for synchronization
SPU Channels
SPUs communicate via channels:
// Write to outbound mailbox
spu_writech (SPU_WrOutMbox, value);
// Read from inbound mailbox
u32 data = spu_readch (SPU_RdInMbox);
// Read signal notification register 1
u32 signal = spu_readch (SPU_RdSigNotify1);
Mailboxes 32-bit message passing between PPU and SPU
Signal Notifications Fast 32-bit signaling mechanism
DMA Programming
Basic DMA Operations
The MFC provides DMA commands for transferring data:
samples/spu/spudma/spu/source/main.c
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include <sys/spu_thread.h>
#define TAG 1
/* Wait for DMA transfer to be finished */
static void wait_for_completion ( void ) {
mfc_write_tag_mask ( 1 << TAG);
spu_mfcstat (MFC_TAG_UPDATE_ALL);
}
int main ( uint64_t ea , uint64_t outptr , uint64_t arg3 , uint64_t arg4 )
{
/* Memory-aligned buffer (vectors always are properly aligned) */
volatile vec_uchar16 v;
/* Fetch the 16 bytes using DMA */
mfc_get ( & v, ea, 16 , TAG, 0 , 0 );
wait_for_completion ();
/* Compare all characters with the small 'a' character code */
vec_uchar16 cmp = spu_cmpgt (v, spu_splats (( unsigned char )( 'a' - 1 )));
/* For all small characters, we remove 0x20 to get the corresponding capital */
vec_uchar16 sub = spu_splats (( unsigned char ) 0x 20 ) & cmp;
/* Convert all small characters to capitals */
v = v - sub;
/* Send the updated vector to PPE */
mfc_put ( & v, ea, 16 , TAG, 0 , 0 );
wait_for_completion ();
/* Send a message to inform the PPE program that the work is done */
uint32_t ok __attribute__ (( aligned ( 16 ))) = 1 ;
mfc_put ( & ok, outptr, 4 , TAG, 0 , 0 );
wait_for_completion ();
/* Properly exit the thread */
spu_thread_exit ( 0 );
return 0 ;
}
DMA Helper Library
PSL1GHT provides a comprehensive DMA wrapper library:
spu/include/dma/spu_dma.h
#include <dma/spu_dma.h>
// Standard DMA (16-byte aligned, multiple of 16 bytes)
spu_dma_get (ls_addr, ea, size, tag, 0 , 0 );
spu_dma_put (ls_addr, ea, size, tag, 0 , 0 );
// Small DMA (1, 2, 4, or 8 bytes)
spu_dma_small_get (ls_addr, ea, size, tag, 0 , 0 );
spu_dma_small_put (ls_addr, ea, size, tag, 0 , 0 );
// Large DMA (any size, automatically splits if > 16KB)
spu_dma_large_get (ls_addr, ea, size, tag, 0 , 0 );
spu_dma_large_put (ls_addr, ea, size, tag, 0 , 0 );
// List DMA (scatter-gather)
spu_dma_list_element list [NUM_ELEMENTS];
spu_dma_list_get (ls_addr, ea, list, list_size, tag, 0 , 0 );
DMA Requirements
Alignment Requirements:
Standard DMA: 16-byte alignment for both LS and EA, size must be multiple of 16
Small DMA: LS and EA must have same lower 4 bits, size must be power of 2 (1,2,4,8)
Maximum transfer size: 16 KB per DMA command
Tag Management
// Wait for specific tag
mfc_write_tag_mask ( 1 << TAG);
mfc_read_tag_status_all ();
// Wait for multiple tags
mfc_write_tag_mask (( 1 << TAG1) | ( 1 << TAG2));
mfc_read_tag_status_all ();
// Check tag status without blocking
uint32_t status = mfc_stat_tag_status ();
if (status & ( 1 << TAG)) {
// Transfer complete
}
PPU-SPU Communication
Memory-Mapped SPU Resources
The PPU can access SPU resources via memory-mapped addresses:
#define SPU_THREAD_BASE 0x F0000000 ULL
#define SPU_THREAD_OFFSET 0x 00100000 ULL
// Get base address for SPU thread
#define SPU_THREAD_GET_BASE_OFFSET ( spu ) \
(SPU_THREAD_BASE + (SPU_THREAD_OFFSET * (spu)))
// Access local storage
#define SPU_THREAD_GET_LOCAL_STORAGE ( spu , reg ) \
( SPU_THREAD_GET_BASE_OFFSET (spu) + SPU_LOCAL_OFFSET + (reg))
// Access problem storage (registers)
#define SPU_THREAD_GET_PROBLEM_STORAGE ( spu , reg ) \
( SPU_THREAD_GET_BASE_OFFSET (spu) + SPU_PROBLEM_OFFSET + (reg))
Direct Memory Access
PPU can read/write SPU local storage:
// Write to SPU local storage
sysSpuThreadWriteLocalStorage (thread, address, value, type);
// Read from SPU local storage
u64 value;
sysSpuThreadReadLocalStorage (thread, address, & value , type);
Signal Notifications
Fast signaling from PPU to SPU:
// From PPU: Write to SPU signal register
sysSpuThreadWriteSignal (thread, 0 , signal_value); // Signal register 1
sysSpuThreadWriteSignal (thread, 1 , signal_value); // Signal register 2
// From SPU: Read signal register
u32 signal = spu_readch (SPU_RdSigNotify1);
Signal Notification Modes
Overwrite mode : New value replaces old value
OR mode : New value is OR’ed with existing value
Configure with: sysSpuThreadSetConfiguration (thread, SPU_SIGNAL1_OR | SPU_SIGNAL2_OVERWRITE);
Mailbox Communication
// PPU writes to SPU inbound mailbox
sysSpuThreadWriteMb (thread, value);
// SPU reads from inbound mailbox
u32 msg = spu_readch (SPU_RdInMbox);
// SPU writes to outbound mailbox
spu_writech (SPU_WrOutMbox, value);
// PPU reads (via problem storage)
u32 msg = sysSpuRawReadProblemStorage (spu, SPU_Out_MBox);
SPU-to-SPU Communication
SPUs can communicate directly:
// SPU-to-SPU signal notification
u64 target_spu_signal_ea = SPU_THREAD_BASE +
(target_spu * SPU_THREAD_OFFSET) +
SPU_THREAD_Sig_Notify_1;
u32 signal_value __attribute__ (( aligned ( 16 ))) = 0x 42 ;
mfc_put ( & signal_value , target_spu_signal_ea, 4 , TAG, 0 , 0 );
SPU-to-SPU local store DMA is also possible using the memory-mapped addresses.
SPU Thread API
SPU programs can control their execution:
spu/include/sys/spu_thread.h
// Exit current SPU thread
void spu_thread_exit ( int status );
// Exit entire SPU thread group
void spu_thread_group_exit ( int status );
// Yield to scheduler
void spu_thread_group_yield ( void );
Building SPU Programs
SPU programs use separate build rules:
include $( PSL1GHT ) /spu_rules
CFLAGS = -O2 -Wall $( MACHDEP )
LDFLAGS = $( MACHDEP ) -Wl,-Map, $( notdir $@ ) .map
TARGET = spu
# Build SPU ELF
$( TARGET ) .elf: $( OFILES )
# Convert to binary for embedding
$( TARGET ) .bin: $( TARGET ) .elf
$( OBJCOPY ) -O binary $< $@
From spu_rules: MACHDEP = -mdual-nops -fmodulo-sched -ffunction-sections -fdata-sections
-mdual-nops : Generate dual-issue NOPs for better pipeline usage
-fmodulo-sched : Enable software pipelining
Best Practices
Minimize DMA Overhead
Transfer larger blocks instead of many small transfers
Use double-buffering: process one buffer while DMA transfers another
Overlap computation with DMA using multiple tags
Vectorize Your Code
SPUs are designed for SIMD. Use vector types: vec_float4 a, b, c;
c = spu_add (a, b); // 4 additions in parallel
Mind the Local Store
Keep total code + data + stack under 256 KB
Use -ffunction-sections and -Wl,--gc-sections to remove unused code
Consider streaming data for large datasets
Use Thread Groups
Thread groups provide better management and synchronization than raw SPUs.
Proper Thread Termination
Always call spu_thread_exit() to properly terminate SPU threads.
#define BUFFER_SIZE 1024
u8 buffer [ 2 ][BUFFER_SIZE] __attribute__ (( aligned ( 128 )));
int current = 0 ;
// Start first transfer
mfc_get ( buffer [current], ea, BUFFER_SIZE, TAG, 0 , 0 );
for ( int i = 0 ; i < num_iterations; i ++ ) {
int next = 1 - current;
// Start next DMA
if (i < num_iterations - 1 ) {
mfc_get ( buffer [next], ea + (i + 1 ) * BUFFER_SIZE, BUFFER_SIZE, TAG, 0 , 0 );
}
// Wait for current buffer
wait_for_tag (TAG);
// Process current buffer
process ( buffer [current]);
current = next;
}
#define TAG_READ 1
#define TAG_WRITE 2
// Issue read and write simultaneously
mfc_get (input_buffer, input_ea, size, TAG_READ, 0 , 0 );
mfc_put (output_buffer, output_ea, size, TAG_WRITE, 0 , 0 );
// Wait for both
mfc_write_tag_mask (( 1 << TAG_READ) | ( 1 << TAG_WRITE));
mfc_read_tag_status_all ();
See Also
PPU Architecture Learn about the PowerPC Processor Unit
Memory Management Memory allocation and DMA best practices
Build System Building SPU programs
SPU API Reference Complete SPU API documentation