Skip to main content

Overview

The PlayStation 3’s Cell Broadband Engine features six Synergistic Processing Units (SPUs) available to user programs. Each SPU consists of:
  • SPU Core: 128-bit SIMD processor
  • 256 KB Local Store (LS): Fast local memory
  • Memory Flow Controller (MFC): DMA engine for data transfers
SPUs are optimized for vectorized data-parallel workloads. Think of them as powerful compute accelerators similar to modern GPU compute units.

SPU Architecture

Key Characteristics

Local Store

256 KB of fast SRAM - code and data must fit here

SIMD Engine

128-bit wide, operates on 16x8-bit, 8x16-bit, 4x32-bit vectors

No Cache

All memory access is explicit via DMA

In-Order Execution

Dual-issue, in-order pipeline

Memory Model

SPUs have a unique memory architecture:
┌─────────────────────────────────────┐
│          Main Memory (PPU)          │
└──────────────┬──────────────────────┘
               │ DMA Transfers
               │ (via MFC)

┌─────────────────────────────────────┐
│      SPU Local Store (256 KB)       │
│  ┌──────────┐  ┌─────────────────┐  │
│  │   Code   │  │   Data/Stack    │  │
│  └──────────┘  └─────────────────┘  │
└─────────────────────────────────────┘
SPUs cannot directly access main memory. All data must be transferred via DMA using the MFC.

SPU Thread Management

Creating and Running SPU Threads

From the PPU side, SPU programs are managed through thread groups:
samples/spu/sputest/source/main.c
int main(int argc, char *argv[])
{
    u32 spu_id = 0;
    sysSpuImage image;

    printf("Initializing 6 SPUs...\n");
    sysSpuInitialize(6, 5);

    printf("Initializing raw SPU...\n");
    sysSpuRawCreate(&spu_id, NULL);

    printf("Importing spu image...\n");
    sysSpuImageImport(&image, spu_bin, SPU_IMAGE_PROTECT);

    printf("Loading spu image into SPU %d...\n", spu_id);
    sysSpuRawImageLoad(spu_id, &image);

    printf("Starting SPU %d...\n", spu_id);
    sysSpuRawWriteProblemStorage(spu_id, SPU_RunCtrl, 1);

    printf("Waiting for SPU to return...\n");
    while (!(sysSpuRawReadProblemStorage(spu_id, SPU_MBox_Status) & 1));

    printf("SPU Mailbox return value: %08x\n",
           sysSpuRawReadProblemStorage(spu_id, SPU_Out_MBox));

    printf("Destroying SPU %d...\n", spu_id);
    sysSpuRawDestroy(spu_id);

    printf("Closing SPU image...\n");
    sysSpuImageClose(&image);

    return 0;
}

SPU Thread Groups

For better management, use SPU thread groups:
sys_spu_group_t group;
sysSpuThreadGroupAttribute attr;
sysSpuThreadAttribute thread_attr;
sysSpuThreadArgument args;

// Initialize attributes
sysSpuThreadGroupAttributeInitialize(attr);
sysSpuThreadGroupAttributeName(attr, "MyGroup");

// Create group with 2 threads
sysSpuThreadGroupCreate(&group, 2, 100, &attr);

// Initialize threads
sys_spu_thread_t thread;
sysSpuThreadAttributeInitialize(thread_attr);
sysSpuThreadArgumentInitialize(args);

args.arg0 = (u64)data_ea;  // Pass effective address to SPU

sysSpuThreadInitialize(&thread, group, 0, &image, &thread_attr, &args);

// Start the group
sysSpuThreadGroupStart(group);

// Wait for completion
u32 cause, status;
sysSpuThreadGroupJoin(group, &cause, &status);

// Cleanup
sysSpuThreadGroupDestroy(group);
  • SPU_THREAD_GROUP_TYPE_NORMAL: Standard thread group
  • SPU_THREAD_GROUP_TYPE_SEQUENTIAL: Threads run sequentially
  • SPU_THREAD_GROUP_TYPE_SYSTEM: System-level priority
  • SPU_THREAD_GROUP_TYPE_MEMORY_FROM_CONTAINER: Use memory container

SPU Programming Basics

Simple SPU Program

Here’s the complete SPU-side code from the test sample:
samples/spu/sputest/spu/source/main.c
#include <spu_intrinsics.h>

int main()
{
    spu_writech(SPU_WrOutMbox, 0x1337BAAD);
    return 0;
}
This minimal program:
  1. Writes a value to the outbound mailbox
  2. The PPU can read this value for synchronization

SPU Channels

SPUs communicate via channels:
// Write to outbound mailbox
spu_writech(SPU_WrOutMbox, value);

// Read from inbound mailbox
u32 data = spu_readch(SPU_RdInMbox);

// Read signal notification register 1
u32 signal = spu_readch(SPU_RdSigNotify1);

Mailboxes

32-bit message passing between PPU and SPU

Signal Notifications

Fast 32-bit signaling mechanism

DMA Programming

Basic DMA Operations

The MFC provides DMA commands for transferring data:
samples/spu/spudma/spu/source/main.c
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include <sys/spu_thread.h>

#define TAG 1

/* Wait for DMA transfer to be finished */
static void wait_for_completion(void) {
    mfc_write_tag_mask(1<<TAG);
    spu_mfcstat(MFC_TAG_UPDATE_ALL);
}

int main(uint64_t ea, uint64_t outptr, uint64_t arg3, uint64_t arg4)
{
    /* Memory-aligned buffer (vectors always are properly aligned) */
    volatile vec_uchar16 v;

    /* Fetch the 16 bytes using DMA */
    mfc_get(&v, ea, 16, TAG, 0, 0);
    wait_for_completion();

    /* Compare all characters with the small 'a' character code */
    vec_uchar16 cmp = spu_cmpgt(v, spu_splats((unsigned char)('a'-1)));

    /* For all small characters, we remove 0x20 to get the corresponding capital */
    vec_uchar16 sub = spu_splats((unsigned char)0x20) & cmp;

    /* Convert all small characters to capitals */
    v = v - sub;

    /* Send the updated vector to PPE */
    mfc_put(&v, ea, 16, TAG, 0, 0);
    wait_for_completion();

    /* Send a message to inform the PPE program that the work is done */
    uint32_t ok __attribute__((aligned(16))) = 1;
    mfc_put(&ok, outptr, 4, TAG, 0, 0);
    wait_for_completion();

    /* Properly exit the thread */
    spu_thread_exit(0);
    return 0;
}

DMA Helper Library

PSL1GHT provides a comprehensive DMA wrapper library:
spu/include/dma/spu_dma.h
#include <dma/spu_dma.h>

// Standard DMA (16-byte aligned, multiple of 16 bytes)
spu_dma_get(ls_addr, ea, size, tag, 0, 0);
spu_dma_put(ls_addr, ea, size, tag, 0, 0);

// Small DMA (1, 2, 4, or 8 bytes)
spu_dma_small_get(ls_addr, ea, size, tag, 0, 0);
spu_dma_small_put(ls_addr, ea, size, tag, 0, 0);

// Large DMA (any size, automatically splits if > 16KB)
spu_dma_large_get(ls_addr, ea, size, tag, 0, 0);
spu_dma_large_put(ls_addr, ea, size, tag, 0, 0);

// List DMA (scatter-gather)
spu_dma_list_element list[NUM_ELEMENTS];
spu_dma_list_get(ls_addr, ea, list, list_size, tag, 0, 0);

DMA Requirements

Alignment Requirements:
  • Standard DMA: 16-byte alignment for both LS and EA, size must be multiple of 16
  • Small DMA: LS and EA must have same lower 4 bits, size must be power of 2 (1,2,4,8)
  • Maximum transfer size: 16 KB per DMA command

Tag Management

// Wait for specific tag
mfc_write_tag_mask(1 << TAG);
mfc_read_tag_status_all();

// Wait for multiple tags
mfc_write_tag_mask((1 << TAG1) | (1 << TAG2));
mfc_read_tag_status_all();

// Check tag status without blocking
uint32_t status = mfc_stat_tag_status();
if (status & (1 << TAG)) {
    // Transfer complete
}

PPU-SPU Communication

Memory-Mapped SPU Resources

The PPU can access SPU resources via memory-mapped addresses:
ppu/include/sys/spu.h
#define SPU_THREAD_BASE          0xF0000000ULL
#define SPU_THREAD_OFFSET        0x00100000ULL

// Get base address for SPU thread
#define SPU_THREAD_GET_BASE_OFFSET(spu) \
    (SPU_THREAD_BASE + (SPU_THREAD_OFFSET * (spu)))

// Access local storage
#define SPU_THREAD_GET_LOCAL_STORAGE(spu, reg) \
    (SPU_THREAD_GET_BASE_OFFSET(spu) + SPU_LOCAL_OFFSET + (reg))

// Access problem storage (registers)
#define SPU_THREAD_GET_PROBLEM_STORAGE(spu, reg) \
    (SPU_THREAD_GET_BASE_OFFSET(spu) + SPU_PROBLEM_OFFSET + (reg))

Direct Memory Access

PPU can read/write SPU local storage:
// Write to SPU local storage
sysSpuThreadWriteLocalStorage(thread, address, value, type);

// Read from SPU local storage
u64 value;
sysSpuThreadReadLocalStorage(thread, address, &value, type);

Signal Notifications

Fast signaling from PPU to SPU:
// From PPU: Write to SPU signal register
sysSpuThreadWriteSignal(thread, 0, signal_value);  // Signal register 1
sysSpuThreadWriteSignal(thread, 1, signal_value);  // Signal register 2

// From SPU: Read signal register
u32 signal = spu_readch(SPU_RdSigNotify1);
  • Overwrite mode: New value replaces old value
  • OR mode: New value is OR’ed with existing value
Configure with:
sysSpuThreadSetConfiguration(thread, SPU_SIGNAL1_OR | SPU_SIGNAL2_OVERWRITE);

Mailbox Communication

// PPU writes to SPU inbound mailbox
sysSpuThreadWriteMb(thread, value);

// SPU reads from inbound mailbox
u32 msg = spu_readch(SPU_RdInMbox);

// SPU writes to outbound mailbox
spu_writech(SPU_WrOutMbox, value);

// PPU reads (via problem storage)
u32 msg = sysSpuRawReadProblemStorage(spu, SPU_Out_MBox);

SPU-to-SPU Communication

SPUs can communicate directly:
// SPU-to-SPU signal notification
u64 target_spu_signal_ea = SPU_THREAD_BASE + 
                           (target_spu * SPU_THREAD_OFFSET) + 
                           SPU_THREAD_Sig_Notify_1;

u32 signal_value __attribute__((aligned(16))) = 0x42;
mfc_put(&signal_value, target_spu_signal_ea, 4, TAG, 0, 0);
SPU-to-SPU local store DMA is also possible using the memory-mapped addresses.

SPU Thread API

SPU programs can control their execution:
spu/include/sys/spu_thread.h
// Exit current SPU thread
void spu_thread_exit(int status);

// Exit entire SPU thread group
void spu_thread_group_exit(int status);

// Yield to scheduler
void spu_thread_group_yield(void);

Building SPU Programs

SPU programs use separate build rules:
include $(PSL1GHT)/spu_rules

CFLAGS = -O2 -Wall $(MACHDEP)
LDFLAGS = $(MACHDEP) -Wl,-Map,$(notdir $@).map

TARGET = spu

# Build SPU ELF
$(TARGET).elf: $(OFILES)

# Convert to binary for embedding
$(TARGET).bin: $(TARGET).elf
	$(OBJCOPY) -O binary $< $@
From spu_rules:
MACHDEP = -mdual-nops -fmodulo-sched -ffunction-sections -fdata-sections
  • -mdual-nops: Generate dual-issue NOPs for better pipeline usage
  • -fmodulo-sched: Enable software pipelining

Best Practices

1

Minimize DMA Overhead

  • Transfer larger blocks instead of many small transfers
  • Use double-buffering: process one buffer while DMA transfers another
  • Overlap computation with DMA using multiple tags
2

Vectorize Your Code

SPUs are designed for SIMD. Use vector types:
vec_float4 a, b, c;
c = spu_add(a, b);  // 4 additions in parallel
3

Mind the Local Store

  • Keep total code + data + stack under 256 KB
  • Use -ffunction-sections and -Wl,--gc-sections to remove unused code
  • Consider streaming data for large datasets
4

Use Thread Groups

Thread groups provide better management and synchronization than raw SPUs.
5

Proper Thread Termination

Always call spu_thread_exit() to properly terminate SPU threads.

Performance Tips

#define BUFFER_SIZE 1024

u8 buffer[2][BUFFER_SIZE] __attribute__((aligned(128)));
int current = 0;

// Start first transfer
mfc_get(buffer[current], ea, BUFFER_SIZE, TAG, 0, 0);

for (int i = 0; i < num_iterations; i++) {
    int next = 1 - current;
    
    // Start next DMA
    if (i < num_iterations - 1) {
        mfc_get(buffer[next], ea + (i+1)*BUFFER_SIZE, BUFFER_SIZE, TAG, 0, 0);
    }
    
    // Wait for current buffer
    wait_for_tag(TAG);
    
    // Process current buffer
    process(buffer[current]);
    
    current = next;
}
#define TAG_READ  1
#define TAG_WRITE 2

// Issue read and write simultaneously
mfc_get(input_buffer, input_ea, size, TAG_READ, 0, 0);
mfc_put(output_buffer, output_ea, size, TAG_WRITE, 0, 0);

// Wait for both
mfc_write_tag_mask((1 << TAG_READ) | (1 << TAG_WRITE));
mfc_read_tag_status_all();

See Also

PPU Architecture

Learn about the PowerPC Processor Unit

Memory Management

Memory allocation and DMA best practices

Build System

Building SPU programs

SPU API Reference

Complete SPU API documentation

Build docs developers (and LLMs) love