Documentation Index Fetch the complete documentation index at: https://mintlify.com/ps3dev/PSL1GHT/llms.txt
Use this file to discover all available pages before exploring further.
The SPU samples demonstrate programming the Cell Broadband Engine’s Synergistic Processing Units (SPUs), including thread creation, DMA transfers, and parallel processing patterns.
Available SPU Samples
sputest Basic SPU thread creation and execution
spudma DMA transfers between PPU and SPU
spuchain SPU thread chains with synchronization
spuparallel Parallel processing with multiple SPUs
spumars MARS task queuing system
spurs SPURS task scheduling framework
sputhread SPU thread group management
Cell SPU Architecture
The PS3’s Cell processor contains:
1 PPU (Power Processing Unit) : Main CPU running your application
6 SPUs (Synergistic Processing Units) : Specialized coprocessors for parallel work
Local Store : Each SPU has 256KB of fast local memory
DMA : Explicit data transfers between main memory and SPU local store
SPU Characteristics
SIMD : Vector processing with 128-bit registers
No cache : All data must be explicitly loaded via DMA
Fast : Excellent for parallel, data-intensive operations
Separate code : SPU programs are compiled separately and loaded by PPU
sputest - Basic SPU Execution
Location: samples/spu/sputest/
Simplest example of loading and running SPU code.
What It Demonstrates
SPU subsystem initialization
Creating raw SPU threads
Loading SPU program images
Starting SPU execution
Reading SPU mailbox output
Proper cleanup
PPU Code
samples/spu/sputest/source/main.c
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <sys/spu.h>
#include "spu_bin.h"
int main ( int argc , char * argv [] )
{
u32 spu_id = 0 ;
sysSpuImage image;
printf ( "sputest starting.... \n " );
// Initialize SPU runtime (6 SPUs, 5 raw SPUs)
printf ( "Initializing 6 SPUs... \n " );
sysSpuInitialize ( 6 , 5 );
// Create a raw SPU thread
printf ( "Initializing raw SPU... \n " );
sysSpuRawCreate ( & spu_id, NULL );
// Import SPU binary image
printf ( "Importing spu image... \n " );
sysSpuImageImport ( & image, spu_bin, SPU_IMAGE_PROTECT);
// Load image into SPU local store
printf ( "Loading spu image into SPU %d ... \n " , spu_id);
sysSpuRawImageLoad (spu_id, & image);
// Start SPU execution
printf ( "Starting SPU %d ... \n " , spu_id);
sysSpuRawWriteProblemStorage (spu_id, SPU_RunCtrl, 1 );
// Wait for SPU to write to outbound mailbox
printf ( "Waiting for SPU to return... \n " );
while ( ! ( sysSpuRawReadProblemStorage (spu_id, SPU_MBox_Status) & 1 ));
// Read mailbox value
printf ( "SPU Mailbox return value: %08x \n " ,
sysSpuRawReadProblemStorage (spu_id, SPU_Out_MBox));
// Cleanup
printf ( "Destroying SPU %d ... \n " , spu_id);
sysSpuRawDestroy (spu_id);
printf ( "Closing SPU image... \n " );
sysSpuImageClose ( & image);
return 0 ;
}
SPU Code
samples/spu/sputest/spu/source/main.c
#include <spu_intrinsics.h>
int main ()
{
// Write a value to outbound mailbox
spu_writech (SPU_WrOutMbox, 0x 1337BAAD );
return 0 ;
}
Execution Flow
PPU initializes SPU subsystem
sysSpuInitialize(6, 5) - Use 6 SPUs total, 5 as raw SPUs
PPU creates SPU thread
sysSpuRawCreate(&spu_id, NULL) - Get an available SPU
PPU loads SPU program
SPU binary is loaded into the SPU’s local store
PPU starts SPU
Write to SPU_RunCtrl to begin execution
SPU executes
SPU runs its program (writes to mailbox)
PPU reads result
Polls mailbox status, then reads mailbox value
Cleanup
Destroy SPU thread and close image
spuchain - SPU Thread Chains
Location: samples/spu/spuchain/
Demonstrates coordinating multiple SPUs in a chain.
What It Demonstrates
SPU thread groups
Signal notifications between SPUs
DMA transfers between SPU local stores
Thread synchronization
Chain processing pattern
Concept
Creates 6 SPU threads in a chain:
PPU signals SPU 0
SPU 0 processes data, DMAs to SPU 1
SPU 1 processes, DMAs to SPU 2
… continues through SPU 5
SPU 5 writes result to main memory
Each SPU multiplies a vector by 2, so the result is original × 2^6 = × 64.
PPU Implementation
samples/spu/spuchain/source/main.c
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <sys/spu.h>
#include "spustr.h"
#include "spu_bin.h"
#define ptr2ea ( x ) ((u64)(( void * )(x)))
int main ( int argc , char * argv [] )
{
u32 * array;
u32 group_id;
spustr_t * spu;
sysSpuImage image;
u32 cause, status, i;
sysSpuThreadArgument arg [ 6 ];
sysSpuThreadGroupAttribute grpattr = {
7 + 1 , ptr2ea ( "mygroup" ), 0 , 0
};
sysSpuThreadAttribute attr = {
ptr2ea ( "mythread" ), 8 + 1 , SPU_THREAD_ATTR_NONE
};
printf ( "spuchain starting.... \n " );
// Initialize SPU subsystem
sysSpuInitialize ( 6 , 0 );
// Load SPU program
sysSpuImageImport ( & image, spu_bin, 0 );
// Create thread group
sysSpuThreadGroupCreate ( & group_id, 6 , 100 , & grpattr);
// Allocate shared data
spu = ( spustr_t * ) memalign ( 128 , 6 * sizeof ( spustr_t ));
array = (u32 * ) memalign ( 128 , 4 * sizeof (u32));
// Initialize and create 6 SPU threads
for (i = 0 ; i < 6 ; i ++ ) {
spu [i]. rank = i;
spu [i]. count = 6 ;
spu [i]. sync = 0 ;
spu [i]. array_ea = ptr2ea (array);
arg [i]. arg0 = ptr2ea ( & spu [i]);
printf ( "Creating SPU thread... " );
sysSpuThreadInitialize ( & spu [i]. id , group_id, i,
& image, & attr, & arg [i]);
printf ( " %08x \n " , spu [i]. id );
// Configure signal notification mode
sysSpuThreadSetConfiguration ( spu [i]. id ,
(SPU_SIGNAL1_OVERWRITE | SPU_SIGNAL2_OVERWRITE));
}
// Start all SPU threads
printf ( "Starting SPU thread group.... \n " );
sysSpuThreadGroupStart (group_id);
// Initialize array
printf ( "Initial array:" );
for (i = 0 ; i < 4 ; i ++ ) {
array [i] = (i + 1 );
printf ( " %d " , array [i]);
}
printf ( " \n " );
// Trigger the chain by signaling SPU 0
printf ( "sending signal.... \n " );
sysSpuThreadWriteSignal ( spu [ 0 ]. id , 0 , 1 );
// Wait for SPU 5 to complete
while ( spu [ 5 ]. sync == 0 );
// Display results
printf ( "Output array:" );
for (i = 0 ; i < 4 ; i ++ )
printf ( " %d " , array [i]);
printf ( " \n " );
// Cleanup
printf ( "Joining SPU thread group.... \n " );
sysSpuThreadGroupJoin (group_id, & cause, & status);
sysSpuImageClose ( & image);
free (array);
free (spu);
return 0 ;
}
Expected output: {1, 2, 3, 4} → {64, 128, 192, 256}
DMA Transfers
SPUs cannot directly access main memory - all data must be transferred via DMA.
DMA Patterns
// SPU code - get data from main memory
u32 local_buffer [ 256 ] __attribute__ (( aligned ( 128 )));
u64 ea_source = /* effective address in main memory */ ;
u32 size = sizeof (local_buffer);
u32 tag = 1 ;
mfc_get (local_buffer, ea_source, size, tag, 0 , 0 );
mfc_write_tag_mask ( 1 << tag);
mfc_read_tag_status_all ();
// Now local_buffer contains the data
// SPU code - write data to main memory
u32 local_buffer [ 256 ] __attribute__ (( aligned ( 128 )));
u64 ea_dest = /* effective address in main memory */ ;
u32 size = sizeof (local_buffer);
u32 tag = 1 ;
// Fill local_buffer with data
mfc_put (local_buffer, ea_dest, size, tag, 0 , 0 );
mfc_write_tag_mask ( 1 << tag);
mfc_read_tag_status_all ();
// Process data in pipeline fashion
u32 buffer [ 2 ][ 256 ] __attribute__ (( aligned ( 128 )));
u32 tag = 0 ;
// Start first transfer
mfc_get ( buffer [ 0 ], ea, size, tag, 0 , 0 );
for ( int i = 1 ; i < num_blocks; i ++ ) {
u32 current = i & 1 ;
u32 next = (i + 1 ) & 1 ;
// Start next DMA
mfc_get ( buffer [next], ea + i * size, size, next, 0 , 0 );
// Wait for current DMA
mfc_write_tag_mask ( 1 << current);
mfc_read_tag_status_all ();
// Process buffer[current]
process_data ( buffer [current]);
}
// Transfer non-contiguous data
typedef struct {
u32 size;
u64 ea;
} mfc_list_element_t ;
mfc_list_element_t list [ 8 ] __attribute__ (( aligned ( 8 )));
// Setup list elements
list [ 0 ].size = 256 ;
list [ 0 ].ea = ea_addr1;
list [ 1 ].size = 512 ;
list [ 1 ].ea = ea_addr2;
// ...
mfc_getl (local_buffer, list_ea, list, num_elements, tag, 0 , 0 );
DMA Requirements
Alignment : DMA transfers must be 16-byte aligned (both address and size)
Size : Maximum 16KB per transfer
Local store : Buffers must be in SPU local store (not main memory)
Tags : Use DMA tags (0-31) to track multiple transfers
SPU Programming Patterns
SPU Thread Groups
// PPU code - creating thread group
u32 group_id;
sysSpuThreadGroupAttribute attr = {
name_len, name_ea, priority, type
};
sysSpuThreadGroupCreate ( & group_id , num_threads, priority, & attr );
// Add threads to group
for ( int i = 0 ; i < num_threads; i ++ ) {
sysSpuThreadInitialize ( & thread_id [i], group_id, i,
& image, & thread_attr, & arg [i]);
}
// Start all threads at once
sysSpuThreadGroupStart (group_id);
// Wait for completion
u32 cause, status;
sysSpuThreadGroupJoin (group_id, & cause , & status );
Mailboxes for Communication
// PPU writing to SPU inbound mailbox
u32 data = 0x 12345678 ;
sysSpuThreadWriteSignal (thread_id, 0 , data);
// PPU reading from SPU outbound mailbox
while ( ! ( sysSpuRawReadProblemStorage (spu_id, SPU_MBox_Status) & 1 ));
u32 value = sysSpuRawReadProblemStorage (spu_id, SPU_Out_MBox);
SPU Signal Notifications
// PPU sends signal to SPU
sysSpuThreadWriteSignal (thread_id, signal_number, value);
// SPU receives signal
// Configure signal mode first (in PPU before starting)
sysSpuThreadSetConfiguration (thread_id, SPU_SIGNAL1_OVERWRITE);
// In SPU code:
u32 signal = spu_readch (SPU_RdSigNotify1);
Building SPU Samples
Build Process
SPU samples require two compilation steps:
Compile SPU code with SPU compiler (spu-gcc)
Embed SPU binary in PPU code
Compile PPU code with PPU compiler (ppu-gcc)
The Makefiles handle this automatically.
Build Commands
# Build all SPU samples
cd samples/spu
make
# Build individual sample
cd samples/spu/sputest
make
# Clean
make clean
SPU Makefile Structure
Typical SPU sample has:
sputest/
├── Makefile # Main makefile
├── source/ # PPU source code
│ └── main.c
├── spu/ # SPU source code
│ ├── Makefile
│ └── source/
│ └── main.c
└── data/ # Optional data files
DMA Latency DMA transfers have latency (~200 cycles). Use double buffering to hide it.
Alignment Keep data 128-byte aligned for best performance
SIMD Use vector intrinsics for 4x parallelism within each SPU
Local Store Keep working set small (256KB total, including code)
Branch Prediction SPUs have simple branch prediction - avoid complex branching
Mailbox Limits Mailboxes are slow - use for control, not bulk data
Common SPU Patterns
Data Parallel Processing
// PPU divides work among SPUs
for ( int i = 0 ; i < num_spus; i ++ ) {
args [i]. start_index = i * (total_items / num_spus);
args [i]. count = total_items / num_spus;
sysSpuThreadInitialize ( & thread [i], group, i, & image, & attr, & args [i]);
}
// Each SPU processes its chunk
// SPU code:
process_range (arg -> start_index , arg -> count );
Pipeline Processing
// SPU 0: Read and preprocess
// SPU 1: Main processing
// SPU 2: Post-process and write
// Data flows through stages
Reduction
// Each SPU computes partial result
// PPU combines results
float total = 0 ;
for ( int i = 0 ; i < num_spus; i ++ ) {
total += spu_results [i];
}
Debugging SPU Code
Use printf carefully
SPU printf is slow - use sparingly for debugging
Check alignment
Misaligned DMA transfers will fail silently or crash
Verify addresses
Ensure effective addresses are valid main memory addresses
Monitor DMA tags
Check that DMA transfers complete before accessing data
Use mailboxes
Send status/debug info back to PPU via mailboxes
SPU API Reference Complete SPU API documentation
SPU Programming Guide In-depth SPU programming concepts
DMA Guide DMA transfer patterns and optimization
SIMD Intrinsics SPU vector intrinsics reference