SPU Programming

Overview

The PlayStation 3’s Cell Broadband Engine features six Synergistic Processing Units (SPUs) available to user programs. Each SPU consists of:

SPU Core: 128-bit SIMD processor
256 KB Local Store (LS): Fast local memory
Memory Flow Controller (MFC): DMA engine for data transfers

SPUs are optimized for vectorized data-parallel workloads. Think of them as powerful compute accelerators similar to modern GPU compute units.

SPU Architecture

Key Characteristics

Local Store

256 KB of fast SRAM - code and data must fit here

SIMD Engine

128-bit wide, operates on 16x8-bit, 8x16-bit, 4x32-bit vectors

No Cache

All memory access is explicit via DMA

In-Order Execution

Dual-issue, in-order pipeline

Memory Model

SPUs have a unique memory architecture:

┌─────────────────────────────────────┐
│          Main Memory (PPU)          │
└──────────────┬──────────────────────┘
               │ DMA Transfers
               │ (via MFC)
               ▼
┌─────────────────────────────────────┐
│      SPU Local Store (256 KB)       │
│  ┌──────────┐  ┌─────────────────┐  │
│  │   Code   │  │   Data/Stack    │  │
│  └──────────┘  └─────────────────┘  │
└─────────────────────────────────────┘

SPUs cannot directly access main memory. All data must be transferred via DMA using the MFC.

SPU Thread Management

Creating and Running SPU Threads

From the PPU side, SPU programs are managed through thread groups:

samples/spu/sputest/source/main.c

int main(int argc, char *argv[])
{
    u32 spu_id = 0;
    sysSpuImage image;

    printf("Initializing 6 SPUs...\n");
    sysSpuInitialize(6, 5);

    printf("Initializing raw SPU...\n");
    sysSpuRawCreate(&spu_id, NULL);

    printf("Importing spu image...\n");
    sysSpuImageImport(&image, spu_bin, SPU_IMAGE_PROTECT);

    printf("Loading spu image into SPU %d...\n", spu_id);
    sysSpuRawImageLoad(spu_id, &image);

    printf("Starting SPU %d...\n", spu_id);
    sysSpuRawWriteProblemStorage(spu_id, SPU_RunCtrl, 1);

    printf("Waiting for SPU to return...\n");
    while (!(sysSpuRawReadProblemStorage(spu_id, SPU_MBox_Status) & 1));

    printf("SPU Mailbox return value: %08x\n",
           sysSpuRawReadProblemStorage(spu_id, SPU_Out_MBox));

    printf("Destroying SPU %d...\n", spu_id);
    sysSpuRawDestroy(spu_id);

    printf("Closing SPU image...\n");
    sysSpuImageClose(&image);

    return 0;
}

SPU Thread Groups

For better management, use SPU thread groups:

sys_spu_group_t group;
sysSpuThreadGroupAttribute attr;
sysSpuThreadAttribute thread_attr;
sysSpuThreadArgument args;

// Initialize attributes
sysSpuThreadGroupAttributeInitialize(attr);
sysSpuThreadGroupAttributeName(attr, "MyGroup");

// Create group with 2 threads
sysSpuThreadGroupCreate(&group, 2, 100, &attr);

// Initialize threads
sys_spu_thread_t thread;
sysSpuThreadAttributeInitialize(thread_attr);
sysSpuThreadArgumentInitialize(args);

args.arg0 = (u64)data_ea;  // Pass effective address to SPU

sysSpuThreadInitialize(&thread, group, 0, &image, &thread_attr, &args);

// Start the group
sysSpuThreadGroupStart(group);

// Wait for completion
u32 cause, status;
sysSpuThreadGroupJoin(group, &cause, &status);

// Cleanup
sysSpuThreadGroupDestroy(group);

Thread Group Types

SPU_THREAD_GROUP_TYPE_NORMAL: Standard thread group
SPU_THREAD_GROUP_TYPE_SEQUENTIAL: Threads run sequentially
SPU_THREAD_GROUP_TYPE_SYSTEM: System-level priority
SPU_THREAD_GROUP_TYPE_MEMORY_FROM_CONTAINER: Use memory container

SPU Programming Basics

Simple SPU Program

Here’s the complete SPU-side code from the test sample:

samples/spu/sputest/spu/source/main.c

#include <spu_intrinsics.h>

int main()
{
    spu_writech(SPU_WrOutMbox, 0x1337BAAD);
    return 0;
}

This minimal program:

Writes a value to the outbound mailbox
The PPU can read this value for synchronization

SPU Channels

SPUs communicate via channels:

// Write to outbound mailbox
spu_writech(SPU_WrOutMbox, value);

// Read from inbound mailbox
u32 data = spu_readch(SPU_RdInMbox);

// Read signal notification register 1
u32 signal = spu_readch(SPU_RdSigNotify1);

Mailboxes

32-bit message passing between PPU and SPU

Signal Notifications

Fast 32-bit signaling mechanism

DMA Programming

Basic DMA Operations

The MFC provides DMA commands for transferring data:

samples/spu/spudma/spu/source/main.c

#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include <sys/spu_thread.h>

#define TAG 1

/* Wait for DMA transfer to be finished */
static void wait_for_completion(void) {
    mfc_write_tag_mask(1<<TAG);
    spu_mfcstat(MFC_TAG_UPDATE_ALL);
}

int main(uint64_t ea, uint64_t outptr, uint64_t arg3, uint64_t arg4)
{
    /* Memory-aligned buffer (vectors always are properly aligned) */
    volatile vec_uchar16 v;

    /* Fetch the 16 bytes using DMA */
    mfc_get(&v, ea, 16, TAG, 0, 0);
    wait_for_completion();

    /* Compare all characters with the small 'a' character code */
    vec_uchar16 cmp = spu_cmpgt(v, spu_splats((unsigned char)('a'-1)));

    /* For all small characters, we remove 0x20 to get the corresponding capital */
    vec_uchar16 sub = spu_splats((unsigned char)0x20) & cmp;

    /* Convert all small characters to capitals */
    v = v - sub;

    /* Send the updated vector to PPE */
    mfc_put(&v, ea, 16, TAG, 0, 0);
    wait_for_completion();

    /* Send a message to inform the PPE program that the work is done */
    uint32_t ok __attribute__((aligned(16))) = 1;
    mfc_put(&ok, outptr, 4, TAG, 0, 0);
    wait_for_completion();

    /* Properly exit the thread */
    spu_thread_exit(0);
    return 0;
}

DMA Helper Library

PSL1GHT provides a comprehensive DMA wrapper library:

spu/include/dma/spu_dma.h

#include <dma/spu_dma.h>

// Standard DMA (16-byte aligned, multiple of 16 bytes)
spu_dma_get(ls_addr, ea, size, tag, 0, 0);
spu_dma_put(ls_addr, ea, size, tag, 0, 0);

// Small DMA (1, 2, 4, or 8 bytes)
spu_dma_small_get(ls_addr, ea, size, tag, 0, 0);
spu_dma_small_put(ls_addr, ea, size, tag, 0, 0);

// Large DMA (any size, automatically splits if > 16KB)
spu_dma_large_get(ls_addr, ea, size, tag, 0, 0);
spu_dma_large_put(ls_addr, ea, size, tag, 0, 0);

// List DMA (scatter-gather)
spu_dma_list_element list[NUM_ELEMENTS];
spu_dma_list_get(ls_addr, ea, list, list_size, tag, 0, 0);

DMA Requirements

Alignment Requirements:

Standard DMA: 16-byte alignment for both LS and EA, size must be multiple of 16
Small DMA: LS and EA must have same lower 4 bits, size must be power of 2 (1,2,4,8)
Maximum transfer size: 16 KB per DMA command

Tag Management

// Wait for specific tag
mfc_write_tag_mask(1 << TAG);
mfc_read_tag_status_all();

// Wait for multiple tags
mfc_write_tag_mask((1 << TAG1) | (1 << TAG2));
mfc_read_tag_status_all();

// Check tag status without blocking
uint32_t status = mfc_stat_tag_status();
if (status & (1 << TAG)) {
    // Transfer complete
}

PPU-SPU Communication

Memory-Mapped SPU Resources

The PPU can access SPU resources via memory-mapped addresses:

ppu/include/sys/spu.h

#define SPU_THREAD_BASE          0xF0000000ULL
#define SPU_THREAD_OFFSET        0x00100000ULL

// Get base address for SPU thread
#define SPU_THREAD_GET_BASE_OFFSET(spu) \
    (SPU_THREAD_BASE + (SPU_THREAD_OFFSET * (spu)))

// Access local storage
#define SPU_THREAD_GET_LOCAL_STORAGE(spu, reg) \
    (SPU_THREAD_GET_BASE_OFFSET(spu) + SPU_LOCAL_OFFSET + (reg))

// Access problem storage (registers)
#define SPU_THREAD_GET_PROBLEM_STORAGE(spu, reg) \
    (SPU_THREAD_GET_BASE_OFFSET(spu) + SPU_PROBLEM_OFFSET + (reg))

Direct Memory Access

PPU can read/write SPU local storage:

// Write to SPU local storage
sysSpuThreadWriteLocalStorage(thread, address, value, type);

// Read from SPU local storage
u64 value;
sysSpuThreadReadLocalStorage(thread, address, &value, type);

Signal Notifications

Fast signaling from PPU to SPU:

// From PPU: Write to SPU signal register
sysSpuThreadWriteSignal(thread, 0, signal_value);  // Signal register 1
sysSpuThreadWriteSignal(thread, 1, signal_value);  // Signal register 2

// From SPU: Read signal register
u32 signal = spu_readch(SPU_RdSigNotify1);

Signal Notification Modes

Overwrite mode: New value replaces old value
OR mode: New value is OR’ed with existing value

Configure with:

sysSpuThreadSetConfiguration(thread, SPU_SIGNAL1_OR | SPU_SIGNAL2_OVERWRITE);

Mailbox Communication

// PPU writes to SPU inbound mailbox
sysSpuThreadWriteMb(thread, value);

// SPU reads from inbound mailbox
u32 msg = spu_readch(SPU_RdInMbox);

// SPU writes to outbound mailbox
spu_writech(SPU_WrOutMbox, value);

// PPU reads (via problem storage)
u32 msg = sysSpuRawReadProblemStorage(spu, SPU_Out_MBox);

SPU-to-SPU Communication

SPUs can communicate directly:

// SPU-to-SPU signal notification
u64 target_spu_signal_ea = SPU_THREAD_BASE + 
                           (target_spu * SPU_THREAD_OFFSET) + 
                           SPU_THREAD_Sig_Notify_1;

u32 signal_value __attribute__((aligned(16))) = 0x42;
mfc_put(&signal_value, target_spu_signal_ea, 4, TAG, 0, 0);

SPU-to-SPU local store DMA is also possible using the memory-mapped addresses.

SPU Thread API

SPU programs can control their execution:

spu/include/sys/spu_thread.h

// Exit current SPU thread
void spu_thread_exit(int status);

// Exit entire SPU thread group
void spu_thread_group_exit(int status);

// Yield to scheduler
void spu_thread_group_yield(void);

Building SPU Programs

SPU programs use separate build rules:

include $(PSL1GHT)/spu_rules

CFLAGS = -O2 -Wall $(MACHDEP)
LDFLAGS = $(MACHDEP) -Wl,-Map,$(notdir $@).map

TARGET = spu

# Build SPU ELF
$(TARGET).elf: $(OFILES)

# Convert to binary for embedding
$(TARGET).bin: $(TARGET).elf
	$(OBJCOPY) -O binary $< $@

SPU Compiler Flags

From spu_rules:

MACHDEP = -mdual-nops -fmodulo-sched -ffunction-sections -fdata-sections

-mdual-nops: Generate dual-issue NOPs for better pipeline usage
-fmodulo-sched: Enable software pipelining

Best Practices

Minimize DMA Overhead

Transfer larger blocks instead of many small transfers
Use double-buffering: process one buffer while DMA transfers another
Overlap computation with DMA using multiple tags

Vectorize Your Code

SPUs are designed for SIMD. Use vector types:

vec_float4 a, b, c;
c = spu_add(a, b);  // 4 additions in parallel

Mind the Local Store

Keep total code + data + stack under 256 KB
Use -ffunction-sections and -Wl,--gc-sections to remove unused code
Consider streaming data for large datasets

Use Thread Groups

Thread groups provide better management and synchronization than raw SPUs.

Proper Thread Termination

Always call spu_thread_exit() to properly terminate SPU threads.

Performance Tips

Double Buffering Pattern

#define BUFFER_SIZE 1024

u8 buffer[2][BUFFER_SIZE] __attribute__((aligned(128)));
int current = 0;

// Start first transfer
mfc_get(buffer[current], ea, BUFFER_SIZE, TAG, 0, 0);

for (int i = 0; i < num_iterations; i++) {
    int next = 1 - current;
    
    // Start next DMA
    if (i < num_iterations - 1) {
        mfc_get(buffer[next], ea + (i+1)*BUFFER_SIZE, BUFFER_SIZE, TAG, 0, 0);
    }
    
    // Wait for current buffer
    wait_for_tag(TAG);
    
    // Process current buffer
    process(buffer[current]);
    
    current = next;
}

Multi-Tag DMA

#define TAG_READ  1
#define TAG_WRITE 2

// Issue read and write simultaneously
mfc_get(input_buffer, input_ea, size, TAG_READ, 0, 0);
mfc_put(output_buffer, output_ea, size, TAG_WRITE, 0, 0);

// Wait for both
mfc_write_tag_mask((1 << TAG_READ) | (1 << TAG_WRITE));
mfc_read_tag_status_all();

PPU Architecture

Learn about the PowerPC Processor Unit

Memory Management

Memory allocation and DMA best practices

Build System

Building SPU programs

SPU API Reference

Complete SPU API documentation

Getting Started

Core Concepts

Graphics

Input

Audio

Networking

System

System Utilities

SPU Development

Tools

Advanced Topics

Documentation Index

​Overview

​SPU Architecture

​Key Characteristics

Local Store

SIMD Engine

No Cache

In-Order Execution

​Memory Model

​SPU Thread Management

​Creating and Running SPU Threads

​SPU Thread Groups

​SPU Programming Basics

​Simple SPU Program

​SPU Channels

Mailboxes

Signal Notifications

​DMA Programming

​Basic DMA Operations

​DMA Helper Library

​DMA Requirements

​Tag Management

​PPU-SPU Communication

​Memory-Mapped SPU Resources

​Direct Memory Access

​Signal Notifications

​Mailbox Communication

​SPU-to-SPU Communication

​SPU Thread API

​Building SPU Programs

​Best Practices

​Performance Tips

​See Also

PPU Architecture

Memory Management

Build System

SPU API Reference

Build docs developers (and LLMs) love

Overview

SPU Architecture

Key Characteristics

Memory Model

SPU Thread Management

Creating and Running SPU Threads

SPU Thread Groups

SPU Programming Basics

Simple SPU Program

SPU Channels

DMA Programming

Basic DMA Operations

DMA Helper Library

DMA Requirements

Tag Management

PPU-SPU Communication

Memory-Mapped SPU Resources

Direct Memory Access

Signal Notifications

Mailbox Communication

SPU-to-SPU Communication

SPU Thread API

Building SPU Programs

Best Practices

Performance Tips

See Also