SPU Samples - PSL1GHT

The SPU samples demonstrate programming the Cell Broadband Engine’s Synergistic Processing Units (SPUs), including thread creation, DMA transfers, and parallel processing patterns.

Available SPU Samples

sputest

Basic SPU thread creation and execution

spudma

DMA transfers between PPU and SPU

spuchain

SPU thread chains with synchronization

spuparallel

Parallel processing with multiple SPUs

spumars

MARS task queuing system

spurs

SPURS task scheduling framework

sputhread

SPU thread group management

Cell SPU Architecture

The PS3’s Cell processor contains:

1 PPU (Power Processing Unit): Main CPU running your application
6 SPUs (Synergistic Processing Units): Specialized coprocessors for parallel work
Local Store: Each SPU has 256KB of fast local memory
DMA: Explicit data transfers between main memory and SPU local store

SPU Characteristics

SIMD: Vector processing with 128-bit registers
No cache: All data must be explicitly loaded via DMA
Fast: Excellent for parallel, data-intensive operations
Separate code: SPU programs are compiled separately and loaded by PPU

sputest - Basic SPU Execution

Location: samples/spu/sputest/ Simplest example of loading and running SPU code.

What It Demonstrates

SPU subsystem initialization
Creating raw SPU threads
Loading SPU program images
Starting SPU execution
Reading SPU mailbox output
Proper cleanup

PPU Code

samples/spu/sputest/source/main.c

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <sys/spu.h>
#include "spu_bin.h"

int main(int argc, char *argv[])
{
    u32 spu_id = 0;
    sysSpuImage image;

    printf("sputest starting....\n");

    // Initialize SPU runtime (6 SPUs, 5 raw SPUs)
    printf("Initializing 6 SPUs...\n");
    sysSpuInitialize(6, 5);

    // Create a raw SPU thread
    printf("Initializing raw SPU...\n");
    sysSpuRawCreate(&spu_id, NULL);

    // Import SPU binary image
    printf("Importing spu image...\n");
    sysSpuImageImport(&image, spu_bin, SPU_IMAGE_PROTECT);

    // Load image into SPU local store
    printf("Loading spu image into SPU %d...\n", spu_id);
    sysSpuRawImageLoad(spu_id, &image);

    // Start SPU execution
    printf("Starting SPU %d...\n", spu_id);
    sysSpuRawWriteProblemStorage(spu_id, SPU_RunCtrl, 1);

    // Wait for SPU to write to outbound mailbox
    printf("Waiting for SPU to return...\n");
    while (!(sysSpuRawReadProblemStorage(spu_id, SPU_MBox_Status) & 1));

    // Read mailbox value
    printf("SPU Mailbox return value: %08x\n",
           sysSpuRawReadProblemStorage(spu_id, SPU_Out_MBox));

    // Cleanup
    printf("Destroying SPU %d...\n", spu_id);
    sysSpuRawDestroy(spu_id);

    printf("Closing SPU image...\n");
    sysSpuImageClose(&image);

    return 0;
}

SPU Code

samples/spu/sputest/spu/source/main.c

#include <spu_intrinsics.h>

int main()
{
    // Write a value to outbound mailbox
    spu_writech(SPU_WrOutMbox, 0x1337BAAD);
    return 0;
}

Execution Flow

PPU initializes SPU subsystem

sysSpuInitialize(6, 5) - Use 6 SPUs total, 5 as raw SPUs

PPU creates SPU thread

sysSpuRawCreate(&spu_id, NULL) - Get an available SPU

PPU loads SPU program

SPU binary is loaded into the SPU’s local store

PPU starts SPU

Write to SPU_RunCtrl to begin execution

SPU executes

SPU runs its program (writes to mailbox)

PPU reads result

Polls mailbox status, then reads mailbox value

Cleanup

Destroy SPU thread and close image

spuchain - SPU Thread Chains

Location: samples/spu/spuchain/ Demonstrates coordinating multiple SPUs in a chain.

What It Demonstrates

SPU thread groups
Signal notifications between SPUs
DMA transfers between SPU local stores
Thread synchronization
Chain processing pattern

Concept

Creates 6 SPU threads in a chain:

PPU signals SPU 0
SPU 0 processes data, DMAs to SPU 1
SPU 1 processes, DMAs to SPU 2
… continues through SPU 5
SPU 5 writes result to main memory

Each SPU multiplies a vector by 2, so the result is original × 2^6 = × 64.

PPU Implementation

samples/spu/spuchain/source/main.c

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <sys/spu.h>
#include "spustr.h"
#include "spu_bin.h"

#define ptr2ea(x) ((u64)((void*)(x)))

int main(int argc, char *argv[])
{
    u32 *array;
    u32 group_id;
    spustr_t *spu;
    sysSpuImage image;
    u32 cause, status, i;
    sysSpuThreadArgument arg[6];
    sysSpuThreadGroupAttribute grpattr = {
        7+1, ptr2ea("mygroup"), 0, 0
    };
    sysSpuThreadAttribute attr = {
        ptr2ea("mythread"), 8+1, SPU_THREAD_ATTR_NONE
    };

    printf("spuchain starting....\n");

    // Initialize SPU subsystem
    sysSpuInitialize(6, 0);
    
    // Load SPU program
    sysSpuImageImport(&image, spu_bin, 0);
    
    // Create thread group
    sysSpuThreadGroupCreate(&group_id, 6, 100, &grpattr);

    // Allocate shared data
    spu = (spustr_t*)memalign(128, 6*sizeof(spustr_t));
    array = (u32*)memalign(128, 4*sizeof(u32));

    // Initialize and create 6 SPU threads
    for(i = 0; i < 6; i++) {
        spu[i].rank = i;
        spu[i].count = 6;
        spu[i].sync = 0;
        spu[i].array_ea = ptr2ea(array);
        arg[i].arg0 = ptr2ea(&spu[i]);

        printf("Creating SPU thread... ");
        sysSpuThreadInitialize(&spu[i].id, group_id, i,
                               &image, &attr, &arg[i]);
        printf("%08x\n", spu[i].id);
        
        // Configure signal notification mode
        sysSpuThreadSetConfiguration(spu[i].id,
            (SPU_SIGNAL1_OVERWRITE | SPU_SIGNAL2_OVERWRITE));
    }

    // Start all SPU threads
    printf("Starting SPU thread group....\n");
    sysSpuThreadGroupStart(group_id);

    // Initialize array
    printf("Initial array:");
    for(i = 0; i < 4; i++) {
        array[i] = (i + 1);
        printf(" %d", array[i]);
    }
    printf("\n");

    // Trigger the chain by signaling SPU 0
    printf("sending signal.... \n");
    sysSpuThreadWriteSignal(spu[0].id, 0, 1);

    // Wait for SPU 5 to complete
    while(spu[5].sync == 0);

    // Display results
    printf("Output array:");
    for(i = 0; i < 4; i++)
        printf(" %d", array[i]);
    printf("\n");

    // Cleanup
    printf("Joining SPU thread group....\n");
    sysSpuThreadGroupJoin(group_id, &cause, &status);
    sysSpuImageClose(&image);

    free(array);
    free(spu);

    return 0;
}

Expected output: {1, 2, 3, 4} → {64, 128, 192, 256}

DMA Transfers

SPUs cannot directly access main memory - all data must be transferred via DMA.

DMA Patterns

Simple DMA Get

// SPU code - get data from main memory
u32 local_buffer[256] __attribute__((aligned(128)));
u64 ea_source = /* effective address in main memory */;
u32 size = sizeof(local_buffer);
u32 tag = 1;

mfc_get(local_buffer, ea_source, size, tag, 0, 0);
mfc_write_tag_mask(1 << tag);
mfc_read_tag_status_all();

// Now local_buffer contains the data

Simple DMA Put

// SPU code - write data to main memory
u32 local_buffer[256] __attribute__((aligned(128)));
u64 ea_dest = /* effective address in main memory */;
u32 size = sizeof(local_buffer);
u32 tag = 1;

// Fill local_buffer with data

mfc_put(local_buffer, ea_dest, size, tag, 0, 0);
mfc_write_tag_mask(1 << tag);
mfc_read_tag_status_all();

Double buffering

// Process data in pipeline fashion
u32 buffer[2][256] __attribute__((aligned(128)));
u32 tag = 0;

// Start first transfer
mfc_get(buffer[0], ea, size, tag, 0, 0);

for(int i = 1; i < num_blocks; i++) {
    u32 current = i & 1;
    u32 next = (i + 1) & 1;
    
    // Start next DMA
    mfc_get(buffer[next], ea + i*size, size, next, 0, 0);
    
    // Wait for current DMA
    mfc_write_tag_mask(1 << current);
    mfc_read_tag_status_all();
    
    // Process buffer[current]
    process_data(buffer[current]);
}

List DMA

// Transfer non-contiguous data
typedef struct {
    u32 size;
    u64 ea;
} mfc_list_element_t;

mfc_list_element_t list[8] __attribute__((aligned(8)));

// Setup list elements
list[0].size = 256;
list[0].ea = ea_addr1;
list[1].size = 512;
list[1].ea = ea_addr2;
// ...

mfc_getl(local_buffer, list_ea, list, num_elements, tag, 0, 0);

DMA Requirements

Alignment: DMA transfers must be 16-byte aligned (both address and size)
Size: Maximum 16KB per transfer
Local store: Buffers must be in SPU local store (not main memory)
Tags: Use DMA tags (0-31) to track multiple transfers

SPU Programming Patterns

SPU Thread Groups

// PPU code - creating thread group
u32 group_id;
sysSpuThreadGroupAttribute attr = {
    name_len, name_ea, priority, type
};

sysSpuThreadGroupCreate(&group_id, num_threads, priority, &attr);

// Add threads to group
for(int i = 0; i < num_threads; i++) {
    sysSpuThreadInitialize(&thread_id[i], group_id, i,
                           &image, &thread_attr, &arg[i]);
}

// Start all threads at once
sysSpuThreadGroupStart(group_id);

// Wait for completion
u32 cause, status;
sysSpuThreadGroupJoin(group_id, &cause, &status);

Mailboxes for Communication

// PPU writing to SPU inbound mailbox
u32 data = 0x12345678;
sysSpuThreadWriteSignal(thread_id, 0, data);

// PPU reading from SPU outbound mailbox
while(!(sysSpuRawReadProblemStorage(spu_id, SPU_MBox_Status) & 1));
u32 value = sysSpuRawReadProblemStorage(spu_id, SPU_Out_MBox);

SPU Signal Notifications

// PPU sends signal to SPU
sysSpuThreadWriteSignal(thread_id, signal_number, value);

// SPU receives signal
// Configure signal mode first (in PPU before starting)
sysSpuThreadSetConfiguration(thread_id, SPU_SIGNAL1_OVERWRITE);

// In SPU code:
u32 signal = spu_readch(SPU_RdSigNotify1);

Building SPU Samples

Build Process

SPU samples require two compilation steps:

Compile SPU code with SPU compiler (spu-gcc)
Embed SPU binary in PPU code
Compile PPU code with PPU compiler (ppu-gcc)

The Makefiles handle this automatically.

Build Commands

# Build all SPU samples
cd samples/spu
make

# Build individual sample
cd samples/spu/sputest
make

# Clean
make clean

SPU Makefile Structure

Typical SPU sample has:

sputest/
├── Makefile          # Main makefile
├── source/           # PPU source code
│   └── main.c
├── spu/              # SPU source code
│   ├── Makefile
│   └── source/
│       └── main.c
└── data/             # Optional data files

Performance Considerations

DMA Latency

DMA transfers have latency (~200 cycles). Use double buffering to hide it.

Alignment

Keep data 128-byte aligned for best performance

SIMD

Use vector intrinsics for 4x parallelism within each SPU

Local Store

Keep working set small (256KB total, including code)

Branch Prediction

SPUs have simple branch prediction - avoid complex branching

Mailbox Limits

Mailboxes are slow - use for control, not bulk data

Common SPU Patterns

Data Parallel Processing

// PPU divides work among SPUs
for(int i = 0; i < num_spus; i++) {
    args[i].start_index = i * (total_items / num_spus);
    args[i].count = total_items / num_spus;
    sysSpuThreadInitialize(&thread[i], group, i, &image, &attr, &args[i]);
}

// Each SPU processes its chunk
// SPU code:
process_range(arg->start_index, arg->count);

Pipeline Processing

// SPU 0: Read and preprocess
// SPU 1: Main processing  
// SPU 2: Post-process and write

// Data flows through stages

Reduction

// Each SPU computes partial result
// PPU combines results
float total = 0;
for(int i = 0; i < num_spus; i++) {
    total += spu_results[i];
}

Debugging SPU Code

Use printf carefully

SPU printf is slow - use sparingly for debugging

Check alignment

Misaligned DMA transfers will fail silently or crash

Verify addresses

Ensure effective addresses are valid main memory addresses

Monitor DMA tags

Check that DMA transfers complete before accessing data

Use mailboxes

Send status/debug info back to PPU via mailboxes

SPU API Reference

Complete SPU API documentation

SPU Programming Guide

In-depth SPU programming concepts

DMA Guide

DMA transfer patterns and optimization

SIMD Intrinsics

SPU vector intrinsics reference

Sample Projects

Documentation Index

​Available SPU Samples

sputest

spudma

spuchain

spuparallel

spumars

spurs

sputhread

​Cell SPU Architecture

​SPU Characteristics

​sputest - Basic SPU Execution

​What It Demonstrates

​PPU Code

​SPU Code

​Execution Flow

​spuchain - SPU Thread Chains

​What It Demonstrates

​Concept

​PPU Implementation

​DMA Transfers

​DMA Patterns

​DMA Requirements

​SPU Programming Patterns

​SPU Thread Groups

​Mailboxes for Communication

​SPU Signal Notifications

​Building SPU Samples

​Build Process

​Build Commands

​SPU Makefile Structure

​Performance Considerations

DMA Latency

Alignment

SIMD

Local Store

Branch Prediction

Mailbox Limits

​Common SPU Patterns

​Data Parallel Processing

​Pipeline Processing

​Reduction

​Debugging SPU Code

​Related Documentation

SPU API Reference

SPU Programming Guide

DMA Guide

SIMD Intrinsics

Build docs developers (and LLMs) love

Available SPU Samples

Cell SPU Architecture

SPU Characteristics

sputest - Basic SPU Execution

What It Demonstrates

PPU Code

SPU Code

Execution Flow

spuchain - SPU Thread Chains

What It Demonstrates

Concept

PPU Implementation

DMA Transfers

DMA Patterns

DMA Requirements

SPU Programming Patterns

SPU Thread Groups

Mailboxes for Communication

SPU Signal Notifications

Building SPU Samples

Build Process

Build Commands

SPU Makefile Structure

Performance Considerations

Common SPU Patterns

Data Parallel Processing

Pipeline Processing

Reduction

Debugging SPU Code

Related Documentation