Overview
The PlayStation 3’s Cell Broadband Engine architecture offers unique optimization opportunities through its PowerPC Processing Unit (PPU) and six Synergistic Processing Units (SPUs). This guide covers compiler optimization, memory alignment, SIMD operations, and parallel processing techniques.
PPU Compiler Optimizations
PSL1GHT’s build system includes PowerPC-specific optimizations in the compilation flags (source:~/workspace/source/ppu_rules:21).
Default PPU Flags
MACHDEP = -mhard-float -fmodulo-sched -ffunction-sections -fdata-sections
Compiler Flag Explanations
-mhard-float : Use hardware floating-point unit (FPU) instead of software emulation
-fmodulo-sched : Enable modulo scheduling for loops (improves instruction-level parallelism)
-ffunction-sections : Place each function in its own section (enables dead code elimination)
-fdata-sections : Place each data item in its own section (reduces binary size)
Optimization Levels
# Your Makefile
# Development build
CFLAGS += -O0 -g
# Release build with optimizations
CFLAGS += -O2
# Aggressive optimization (use with caution)
CFLAGS += -O3 -funroll-loops
-O3 can increase code size significantly and may not always improve performance. Profile before and after optimization changes.
Cell-Specific Optimizations
# Enable Cell-specific instructions
CFLAGS += -mcpu=cell -mtune=cell
# Use Altivec/VMX instructions
CFLAGS += -maltivec
# Optimize for register usage
CFLAGS += -mregnames
SPU Compiler Optimizations
SPUs have different optimization characteristics than the PPU (source:~/workspace/source/spu_rules:16-38).
SPU Optimization Flags
PSL1GHT provides three SPU build modes:
WM_CFLAGS = -Os -mfixed-range=80-127 -funroll-loops -fschedule-insns
WM_STACK = 0x39e0
-Os : Optimize for size (SPU local store is only 256KB!)
-mfixed-range=80-127 : Reserve registers for work manager
-funroll-loops : Unroll loops for better pipelining
-fschedule-insns : Reorder instructions to avoid stalls
TASK_CFLAGS = -Os -ffast-math -ftree-vectorize -funroll-loops -fschedule-insns
-ffast-math : Enable aggressive floating-point optimizations
-ftree-vectorize : Auto-vectorize loops using SIMD instructions
JOB_CFLAGS = -Os -fpic -ffast-math -ftree-vectorize -funroll-loops -fschedule-insns
-fpic : Generate position-independent code for dynamic loading
-ffast-math : Fast math optimizations
SPU-Specific Optimizations
# Enable dual-issue pipeline
MACHDEP += -mdual-nops
# Enable modulo scheduling
MACHDEP += -fmodulo-sched
The SPU dual-issue pipeline can execute two instructions simultaneously. Structure code to maximize dual-issue opportunities.
Memory Alignment
The PS3 is extremely sensitive to memory alignment. Unaligned access can cause crashes or severe performance penalties.
Alignment Requirements
Type Alignment Notes char, u81 byte No alignment required short, u162 bytes Must be 2-byte aligned int, u32, float4 bytes Must be 4-byte aligned long long, u64, double8 bytes Must be 8-byte aligned vector, vec_float416 bytes Must be 16-byte aligned DMA transfers (SPU) 16 bytes Critical for SPU
Stack Alignment Macro
PSL1GHT provides a macro for stack-allocated aligned data (source:~/workspace/source/ppu/include/ppu-types.h):
#include <ppu-types.h>
void example () {
// Allocate 16-byte aligned array on stack
STACK_ALIGN ( float , aligned_data, 64 , 16 );
// aligned_data is now a pointer to 64 floats, 16-byte aligned
for ( int i = 0 ; i < 64 ; i ++ ) {
aligned_data [i] = i * 1.0 f ;
}
}
Always use STACK_ALIGN for data that will be:
Used in SIMD operations
Transferred to/from SPUs via DMA
Accessed by hardware (GCM, RSX)
Heap Alignment
#include <malloc.h>
// Allocate 16-byte aligned memory
void * aligned_ptr = memalign ( 16 , size);
// Or 128-byte aligned (cache line)
void * cache_aligned = memalign ( 128 , size);
// Always check for NULL
if (aligned_ptr == NULL ) {
printf ( "Allocation failed! \n " );
return - 1 ;
}
// Use aligned memory...
// Free aligned memory
free (aligned_ptr);
Alignment Attributes
// Align struct to 16 bytes
typedef struct {
float x, y, z, w;
} __attribute__ (( aligned ( 16 ))) Vector4;
// Align global variable
float matrix [ 16 ] __attribute__ (( aligned ( 16 )));
Never cast unaligned pointers to aligned types:// WRONG - may crash!
u8 buffer [ 100 ];
u64 * ptr = (u64 * ) & buffer [ 1 ]; // Unaligned!
// CORRECT - use memcpy for unaligned access
u64 value;
memcpy ( & value , & buffer [ 1 ], sizeof (u64));
SIMD Optimization
The PS3’s PowerPC core supports AltiVec (VMX) SIMD instructions for processing 128-bit vectors.
Vector Math Libraries
PSL1GHT includes optimized SIMD libraries (source:/workspace/source/common/vectormath and source:/workspace/source/common/libsimdmath):
#include <vectormath/c/vectormath_aos.h>
void optimize_with_vectormath () {
Vector3 a = { 1.0 f , 2.0 f , 3.0 f };
Vector3 b = { 4.0 f , 5.0 f , 6.0 f };
Vector3 result;
// Hardware-accelerated vector operations
vmathV3Add ( & result, & a, & b);
vmathV3Cross ( & result, & a, & b);
float dot = vmathV3Dot ( & a, & b);
}
SIMD Math Functions
#include <simdmath.h>
void simd_math_example () {
// Process 4 floats simultaneously
vec_float4 angles = { 0.0 f , 0.5 f , 1.0 f , 1.5 f };
vec_float4 sines = sinf4 (angles);
vec_float4 cosines = cosf4 (angles);
// Other SIMD functions available:
vec_float4 roots = sqrtf4 (angles);
vec_float4 powers = powf4 (angles, sines);
}
SIMD math functions process 4 values in parallel using vector instructions. This can be 4x faster than scalar code for suitable workloads.
Available SIMD Functions
From the libsimdmath library (source:~/workspace/source/common/libsimdmath/ppu/simdmath/):
sinf4, cosf4, tanf4
asinf4, acosf4, atanf4, atan2f4
Exponential & Logarithmic
expf4, exp2f4
logf4, log2f4, log10f4
powf4
sqrtf4, cbrtf4
fabsf4, copysignf4
floorf4, ceilf4
fminf4, fmaxf4
divf4 - Fast division
recipf4 - Reciprocal approximation
rsqrtf4 - Reciprocal square root
Writing SIMD Code
#include <altivec.h>
void process_arrays_simd ( float * a , float * b , float * result , int count ) {
// Process 4 floats at a time
vector float * va = (vector float * )a;
vector float * vb = (vector float * )b;
vector float * vr = (vector float * )result;
int vec_count = count / 4 ;
for ( int i = 0 ; i < vec_count; i ++ ) {
vr [i] = vec_add ( va [i], vb [i]); // Add 4 floats in one instruction
}
// Handle remaining elements
int remainder = count % 4 ;
for ( int i = count - remainder; i < count; i ++ ) {
result [i] = a [i] + b [i];
}
}
For best SIMD performance:
Ensure 16-byte alignment
Process data in multiples of 4 (or 16 for bytes)
Keep data contiguous in memory
Avoid branches inside SIMD loops
SPU Optimization Strategies
The six SPUs are the PS3’s real performance powerhouse. Proper SPU utilization can provide massive speedups.
When to Use SPUs
Vector/matrix mathematics
Image processing (blur, filters, scaling)
Physics calculations (collision, particle systems)
Audio processing (mixing, effects)
Data compression/decompression
Pathfinding algorithms
Heavy branching logic
Random memory access patterns
Code with many dependencies
Operations requiring large datasets (>256KB)
SPU Programming Example
// SPU code - runs on SPU
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
typedef struct {
u32 ea_src; // PPU address of source data
u32 ea_dst; // PPU address of destination
u32 count; // Number of elements
} WorkParams __attribute__ (( aligned ( 16 )));
int main (u64 params_ea , u64 env ) {
WorkParams params __attribute__ (( aligned ( 16 )));
// DMA parameters from PPU to SPU local store
mfc_get ( & params, params_ea, sizeof (WorkParams), 0 , 0 , 0 );
mfc_write_tag_mask ( 1 << 0 );
mfc_read_tag_status_all ();
// Allocate local store buffers (aligned!)
float src [ 256 ] __attribute__ (( aligned ( 16 )));
float dst [ 256 ] __attribute__ (( aligned ( 16 )));
// DMA source data from main memory
mfc_get (src, params . ea_src , params . count * sizeof ( float ), 1 , 0 , 0 );
mfc_write_tag_mask ( 1 << 1 );
mfc_read_tag_status_all ();
// Process data using SIMD
vector float * vsrc = (vector float * )src;
vector float * vdst = (vector float * )dst;
vector float scale = { 2.0 f , 2.0 f , 2.0 f , 2.0 f };
for (u32 i = 0 ; i < params . count / 4 ; i ++ ) {
vdst [i] = spu_mul ( vsrc [i], scale);
}
// DMA result back to main memory
mfc_put (dst, params . ea_dst , params . count * sizeof ( float ), 2 , 0 , 0 );
mfc_write_tag_mask ( 1 << 2 );
mfc_read_tag_status_all ();
return 0 ;
}
SPU local store is only 256KB . Carefully manage memory and use DMA to stream data in/out as needed.
SPU DMA Best Practices
// BAD - Sequential DMA (slow)
mfc_get (buffer1, ea1, size, 0 , 0 , 0 );
mfc_wait_tag_status_all ( 1 << 0 );
process (buffer1);
mfc_get (buffer2, ea2, size, 0 , 0 , 0 );
mfc_wait_tag_status_all ( 1 << 0 );
process (buffer2);
// GOOD - Overlapped DMA and processing (fast)
mfc_get (buffer1, ea1, size, 0 , 0 , 0 ); // Start DMA 1
mfc_get (buffer2, ea2, size, 1 , 0 , 0 ); // Start DMA 2
mfc_wait_tag_status_all ( 1 << 0 ); // Wait for DMA 1
process (buffer1); // Process while DMA 2 completes
mfc_wait_tag_status_all ( 1 << 1 ); // Wait for DMA 2
process (buffer2);
Use double buffering : While processing one buffer, DMA the next buffer in the background. This hides DMA latency.
SPU Mailboxes for Communication
// PPU side
#include <sys/spu.h>
u32 spu_id;
sysSpuImage image;
// Load and start SPU
sysSpuRawCreate ( & spu_id , NULL );
sysSpuImageImport ( & image , spu_bin, SPU_IMAGE_PROTECT);
sysSpuRawImageLoad (spu_id, & image );
sysSpuRawWriteProblemStorage (spu_id, SPU_RunCtrl, 1 );
// Wait for SPU completion signal
while ( ! ( sysSpuRawReadProblemStorage (spu_id, SPU_MBox_Status) & 1 ));
u32 result = sysSpuRawReadProblemStorage (spu_id, SPU_Out_MBox);
printf ( "SPU returned: 0x %08x \n " , result);
// SPU side
#include <spu_mfcio.h>
// Send completion signal to PPU
spu_write_out_mbox ( 0x DEADBEEF );
Profiling and Benchmarking
Timing Code Sections
#include <sys/systime.h>
u64 start = sysGetSystemTime ();
// Code to profile
expensive_function ();
u64 end = sysGetSystemTime ();
u64 elapsed_us = end - start;
printf ( "Function took %llu microseconds \n " , elapsed_us);
printf ( "Function took %.3f milliseconds \n " , elapsed_us / 1000.0 );
Frame Time Analysis
void measure_frame_performance () {
static u64 last_time = 0 ;
static u64 frame_times [ 60 ];
static int frame_idx = 0 ;
u64 now = sysGetSystemTime ();
if (last_time != 0 ) {
frame_times [frame_idx] = now - last_time;
frame_idx = (frame_idx + 1 ) % 60 ;
// Every 60 frames, report stats
if (frame_idx == 0 ) {
u64 total = 0 , min = ~ 0 ULL , max = 0 ;
for ( int i = 0 ; i < 60 ; i ++ ) {
total += frame_times [i];
if ( frame_times [i] < min) min = frame_times [i];
if ( frame_times [i] > max) max = frame_times [i];
}
u64 avg = total / 60 ;
float fps = 1000000.0 f / avg;
printf ( "FPS: %.2f (avg: %llu us, min: %llu us, max: %llu us) \n " ,
fps, avg, min, max);
}
}
last_time = now;
}
// Cache line size on PS3 is 128 bytes
#define CACHE_LINE_SIZE 128
// Align structures to cache lines to avoid false sharing
typedef struct {
u32 data [ 32 ]; // 128 bytes
} __attribute__ (( aligned (CACHE_LINE_SIZE))) CacheLineData;
The PS3’s L2 cache is only 512KB shared across all threads. Design data structures to maximize cache locality.
Optimization Checklist
Advanced: Link-Time Optimization
# Enable LTO for whole-program optimization
CFLAGS += -flto
LDFLAGS += -flto -fuse-linker-plugin
# Or use separate LTO optimization level
LDFLAGS += -flto -O3
Link-Time Optimization (LTO) significantly increases build time but can provide better inlining and dead code elimination across translation units.
Problem : Crashes or slow performanceSolution : Use memalign(), STACK_ALIGN(), or alignment attributes
Scalar Math Instead of SIMD
Problem : Not utilizing vector unitsSolution : Use libsimdmath functions and vectormath library
Problem : Only 1/7th of CPU power used (PPU only)Solution : Identify parallelizable work and move to SPUs
Problem : SPU stalled waiting for DMASolution : Use double buffering and overlap DMA with computation
Problem : Poor memory access patternsSolution : Process data sequentially, align to cache lines
See Also