Skip to main content
The SPU samples demonstrate programming the Cell Broadband Engine’s Synergistic Processing Units (SPUs), including thread creation, DMA transfers, and parallel processing patterns.

Available SPU Samples

sputest

Basic SPU thread creation and execution

spudma

DMA transfers between PPU and SPU

spuchain

SPU thread chains with synchronization

spuparallel

Parallel processing with multiple SPUs

spumars

MARS task queuing system

spurs

SPURS task scheduling framework

sputhread

SPU thread group management

Cell SPU Architecture

The PS3’s Cell processor contains:
  • 1 PPU (Power Processing Unit): Main CPU running your application
  • 6 SPUs (Synergistic Processing Units): Specialized coprocessors for parallel work
  • Local Store: Each SPU has 256KB of fast local memory
  • DMA: Explicit data transfers between main memory and SPU local store

SPU Characteristics

  • SIMD: Vector processing with 128-bit registers
  • No cache: All data must be explicitly loaded via DMA
  • Fast: Excellent for parallel, data-intensive operations
  • Separate code: SPU programs are compiled separately and loaded by PPU

sputest - Basic SPU Execution

Location: samples/spu/sputest/ Simplest example of loading and running SPU code.

What It Demonstrates

  • SPU subsystem initialization
  • Creating raw SPU threads
  • Loading SPU program images
  • Starting SPU execution
  • Reading SPU mailbox output
  • Proper cleanup

PPU Code

samples/spu/sputest/source/main.c
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <sys/spu.h>
#include "spu_bin.h"

int main(int argc, char *argv[])
{
    u32 spu_id = 0;
    sysSpuImage image;

    printf("sputest starting....\n");

    // Initialize SPU runtime (6 SPUs, 5 raw SPUs)
    printf("Initializing 6 SPUs...\n");
    sysSpuInitialize(6, 5);

    // Create a raw SPU thread
    printf("Initializing raw SPU...\n");
    sysSpuRawCreate(&spu_id, NULL);

    // Import SPU binary image
    printf("Importing spu image...\n");
    sysSpuImageImport(&image, spu_bin, SPU_IMAGE_PROTECT);

    // Load image into SPU local store
    printf("Loading spu image into SPU %d...\n", spu_id);
    sysSpuRawImageLoad(spu_id, &image);

    // Start SPU execution
    printf("Starting SPU %d...\n", spu_id);
    sysSpuRawWriteProblemStorage(spu_id, SPU_RunCtrl, 1);

    // Wait for SPU to write to outbound mailbox
    printf("Waiting for SPU to return...\n");
    while (!(sysSpuRawReadProblemStorage(spu_id, SPU_MBox_Status) & 1));

    // Read mailbox value
    printf("SPU Mailbox return value: %08x\n",
           sysSpuRawReadProblemStorage(spu_id, SPU_Out_MBox));

    // Cleanup
    printf("Destroying SPU %d...\n", spu_id);
    sysSpuRawDestroy(spu_id);

    printf("Closing SPU image...\n");
    sysSpuImageClose(&image);

    return 0;
}

SPU Code

samples/spu/sputest/spu/source/main.c
#include <spu_intrinsics.h>

int main()
{
    // Write a value to outbound mailbox
    spu_writech(SPU_WrOutMbox, 0x1337BAAD);
    return 0;
}

Execution Flow

1

PPU initializes SPU subsystem

sysSpuInitialize(6, 5) - Use 6 SPUs total, 5 as raw SPUs
2

PPU creates SPU thread

sysSpuRawCreate(&spu_id, NULL) - Get an available SPU
3

PPU loads SPU program

SPU binary is loaded into the SPU’s local store
4

PPU starts SPU

Write to SPU_RunCtrl to begin execution
5

SPU executes

SPU runs its program (writes to mailbox)
6

PPU reads result

Polls mailbox status, then reads mailbox value
7

Cleanup

Destroy SPU thread and close image

spuchain - SPU Thread Chains

Location: samples/spu/spuchain/ Demonstrates coordinating multiple SPUs in a chain.

What It Demonstrates

  • SPU thread groups
  • Signal notifications between SPUs
  • DMA transfers between SPU local stores
  • Thread synchronization
  • Chain processing pattern

Concept

Creates 6 SPU threads in a chain:
  1. PPU signals SPU 0
  2. SPU 0 processes data, DMAs to SPU 1
  3. SPU 1 processes, DMAs to SPU 2
  4. … continues through SPU 5
  5. SPU 5 writes result to main memory
Each SPU multiplies a vector by 2, so the result is original × 2^6 = × 64.

PPU Implementation

samples/spu/spuchain/source/main.c
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <sys/spu.h>
#include "spustr.h"
#include "spu_bin.h"

#define ptr2ea(x) ((u64)((void*)(x)))

int main(int argc, char *argv[])
{
    u32 *array;
    u32 group_id;
    spustr_t *spu;
    sysSpuImage image;
    u32 cause, status, i;
    sysSpuThreadArgument arg[6];
    sysSpuThreadGroupAttribute grpattr = {
        7+1, ptr2ea("mygroup"), 0, 0
    };
    sysSpuThreadAttribute attr = {
        ptr2ea("mythread"), 8+1, SPU_THREAD_ATTR_NONE
    };

    printf("spuchain starting....\n");

    // Initialize SPU subsystem
    sysSpuInitialize(6, 0);
    
    // Load SPU program
    sysSpuImageImport(&image, spu_bin, 0);
    
    // Create thread group
    sysSpuThreadGroupCreate(&group_id, 6, 100, &grpattr);

    // Allocate shared data
    spu = (spustr_t*)memalign(128, 6*sizeof(spustr_t));
    array = (u32*)memalign(128, 4*sizeof(u32));

    // Initialize and create 6 SPU threads
    for(i = 0; i < 6; i++) {
        spu[i].rank = i;
        spu[i].count = 6;
        spu[i].sync = 0;
        spu[i].array_ea = ptr2ea(array);
        arg[i].arg0 = ptr2ea(&spu[i]);

        printf("Creating SPU thread... ");
        sysSpuThreadInitialize(&spu[i].id, group_id, i,
                               &image, &attr, &arg[i]);
        printf("%08x\n", spu[i].id);
        
        // Configure signal notification mode
        sysSpuThreadSetConfiguration(spu[i].id,
            (SPU_SIGNAL1_OVERWRITE | SPU_SIGNAL2_OVERWRITE));
    }

    // Start all SPU threads
    printf("Starting SPU thread group....\n");
    sysSpuThreadGroupStart(group_id);

    // Initialize array
    printf("Initial array:");
    for(i = 0; i < 4; i++) {
        array[i] = (i + 1);
        printf(" %d", array[i]);
    }
    printf("\n");

    // Trigger the chain by signaling SPU 0
    printf("sending signal.... \n");
    sysSpuThreadWriteSignal(spu[0].id, 0, 1);

    // Wait for SPU 5 to complete
    while(spu[5].sync == 0);

    // Display results
    printf("Output array:");
    for(i = 0; i < 4; i++)
        printf(" %d", array[i]);
    printf("\n");

    // Cleanup
    printf("Joining SPU thread group....\n");
    sysSpuThreadGroupJoin(group_id, &cause, &status);
    sysSpuImageClose(&image);

    free(array);
    free(spu);

    return 0;
}
Expected output: {1, 2, 3, 4}{64, 128, 192, 256}

DMA Transfers

SPUs cannot directly access main memory - all data must be transferred via DMA.

DMA Patterns

// SPU code - get data from main memory
u32 local_buffer[256] __attribute__((aligned(128)));
u64 ea_source = /* effective address in main memory */;
u32 size = sizeof(local_buffer);
u32 tag = 1;

mfc_get(local_buffer, ea_source, size, tag, 0, 0);
mfc_write_tag_mask(1 << tag);
mfc_read_tag_status_all();

// Now local_buffer contains the data
// SPU code - write data to main memory
u32 local_buffer[256] __attribute__((aligned(128)));
u64 ea_dest = /* effective address in main memory */;
u32 size = sizeof(local_buffer);
u32 tag = 1;

// Fill local_buffer with data

mfc_put(local_buffer, ea_dest, size, tag, 0, 0);
mfc_write_tag_mask(1 << tag);
mfc_read_tag_status_all();
// Process data in pipeline fashion
u32 buffer[2][256] __attribute__((aligned(128)));
u32 tag = 0;

// Start first transfer
mfc_get(buffer[0], ea, size, tag, 0, 0);

for(int i = 1; i < num_blocks; i++) {
    u32 current = i & 1;
    u32 next = (i + 1) & 1;
    
    // Start next DMA
    mfc_get(buffer[next], ea + i*size, size, next, 0, 0);
    
    // Wait for current DMA
    mfc_write_tag_mask(1 << current);
    mfc_read_tag_status_all();
    
    // Process buffer[current]
    process_data(buffer[current]);
}
// Transfer non-contiguous data
typedef struct {
    u32 size;
    u64 ea;
} mfc_list_element_t;

mfc_list_element_t list[8] __attribute__((aligned(8)));

// Setup list elements
list[0].size = 256;
list[0].ea = ea_addr1;
list[1].size = 512;
list[1].ea = ea_addr2;
// ...

mfc_getl(local_buffer, list_ea, list, num_elements, tag, 0, 0);

DMA Requirements

  • Alignment: DMA transfers must be 16-byte aligned (both address and size)
  • Size: Maximum 16KB per transfer
  • Local store: Buffers must be in SPU local store (not main memory)
  • Tags: Use DMA tags (0-31) to track multiple transfers

SPU Programming Patterns

SPU Thread Groups

// PPU code - creating thread group
u32 group_id;
sysSpuThreadGroupAttribute attr = {
    name_len, name_ea, priority, type
};

sysSpuThreadGroupCreate(&group_id, num_threads, priority, &attr);

// Add threads to group
for(int i = 0; i < num_threads; i++) {
    sysSpuThreadInitialize(&thread_id[i], group_id, i,
                           &image, &thread_attr, &arg[i]);
}

// Start all threads at once
sysSpuThreadGroupStart(group_id);

// Wait for completion
u32 cause, status;
sysSpuThreadGroupJoin(group_id, &cause, &status);

Mailboxes for Communication

// PPU writing to SPU inbound mailbox
u32 data = 0x12345678;
sysSpuThreadWriteSignal(thread_id, 0, data);

// PPU reading from SPU outbound mailbox
while(!(sysSpuRawReadProblemStorage(spu_id, SPU_MBox_Status) & 1));
u32 value = sysSpuRawReadProblemStorage(spu_id, SPU_Out_MBox);

SPU Signal Notifications

// PPU sends signal to SPU
sysSpuThreadWriteSignal(thread_id, signal_number, value);

// SPU receives signal
// Configure signal mode first (in PPU before starting)
sysSpuThreadSetConfiguration(thread_id, SPU_SIGNAL1_OVERWRITE);

// In SPU code:
u32 signal = spu_readch(SPU_RdSigNotify1);

Building SPU Samples

Build Process

SPU samples require two compilation steps:
  1. Compile SPU code with SPU compiler (spu-gcc)
  2. Embed SPU binary in PPU code
  3. Compile PPU code with PPU compiler (ppu-gcc)
The Makefiles handle this automatically.

Build Commands

# Build all SPU samples
cd samples/spu
make

# Build individual sample
cd samples/spu/sputest
make

# Clean
make clean

SPU Makefile Structure

Typical SPU sample has:
sputest/
├── Makefile          # Main makefile
├── source/           # PPU source code
│   └── main.c
├── spu/              # SPU source code
│   ├── Makefile
│   └── source/
│       └── main.c
└── data/             # Optional data files

Performance Considerations

DMA Latency

DMA transfers have latency (~200 cycles). Use double buffering to hide it.

Alignment

Keep data 128-byte aligned for best performance

SIMD

Use vector intrinsics for 4x parallelism within each SPU

Local Store

Keep working set small (256KB total, including code)

Branch Prediction

SPUs have simple branch prediction - avoid complex branching

Mailbox Limits

Mailboxes are slow - use for control, not bulk data

Common SPU Patterns

Data Parallel Processing

// PPU divides work among SPUs
for(int i = 0; i < num_spus; i++) {
    args[i].start_index = i * (total_items / num_spus);
    args[i].count = total_items / num_spus;
    sysSpuThreadInitialize(&thread[i], group, i, &image, &attr, &args[i]);
}

// Each SPU processes its chunk
// SPU code:
process_range(arg->start_index, arg->count);

Pipeline Processing

// SPU 0: Read and preprocess
// SPU 1: Main processing  
// SPU 2: Post-process and write

// Data flows through stages

Reduction

// Each SPU computes partial result
// PPU combines results
float total = 0;
for(int i = 0; i < num_spus; i++) {
    total += spu_results[i];
}

Debugging SPU Code

1

Use printf carefully

SPU printf is slow - use sparingly for debugging
2

Check alignment

Misaligned DMA transfers will fail silently or crash
3

Verify addresses

Ensure effective addresses are valid main memory addresses
4

Monitor DMA tags

Check that DMA transfers complete before accessing data
5

Use mailboxes

Send status/debug info back to PPU via mailboxes

SPU API Reference

Complete SPU API documentation

SPU Programming Guide

In-depth SPU programming concepts

DMA Guide

DMA transfer patterns and optimization

SIMD Intrinsics

SPU vector intrinsics reference

Build docs developers (and LLMs) love