Skip to main content

Overview

The SPU atomic operations API provides lock-free synchronization primitives for safe concurrent access to shared memory locations in main memory. These operations use the Cell’s reservation-based atomic mechanism (Load Locked / Store Conditional).

Key Features

  • Lock-free: No OS-level locks required
  • 128-byte granularity: Operations work on cache-line sized blocks
  • Automatic retry: Built-in retry logic on contention
  • Multiple SPU safe: Coordinate between multiple SPUs
  • PPU compatible: Can synchronize with PPU threads

Atomic Integer Operations

All atomic operations take a local buffer (128-byte aligned) and an effective address (also 128-byte aligned). They return the previous value from memory.

spu_atomic_incr32

Atomically increment a 32-bit value in main memory.
uint32_t spu_atomic_incr32(uint32_t *ls, uint64_t ea)
ls
uint32_t*
required
Local store buffer (128-byte aligned)
ea
uint64_t
required
Effective address in main memory (128-byte aligned)
return
uint32_t
The value before incrementing

spu_atomic_incr64

Atomically increment a 64-bit value.
uint64_t spu_atomic_incr64(uint64_t *ls, uint64_t ea)

spu_atomic_decr32

Atomically decrement a 32-bit value.
uint32_t spu_atomic_decr32(uint32_t *ls, uint64_t ea)

spu_atomic_decr64

Atomically decrement a 64-bit value.
uint64_t spu_atomic_decr64(uint64_t *ls, uint64_t ea)

spu_atomic_test_and_decr32

Atomically decrement if value is greater than zero.
uint32_t spu_atomic_test_and_decr32(uint32_t *ls, uint64_t ea)
ls
uint32_t*
required
Local store buffer
ea
uint64_t
required
Effective address
return
uint32_t
Previous value (decrements only if it was > 0)

spu_atomic_test_and_decr64

Atomically test and decrement a 64-bit value.
uint64_t spu_atomic_test_and_decr64(uint64_t *ls, uint64_t ea)

Atomic Arithmetic

spu_atomic_add32

Atomically add a value to a 32-bit integer.
uint32_t spu_atomic_add32(uint32_t *ls, uint64_t ea, uint32_t value)
ls
uint32_t*
required
Local store buffer
ea
uint64_t
required
Effective address
value
uint32_t
required
Value to add
return
uint32_t
Previous value before addition

spu_atomic_add64

Atomically add to a 64-bit integer.
uint64_t spu_atomic_add64(uint64_t *ls, uint64_t ea, uint64_t value)

spu_atomic_sub32

Atomically subtract from a 32-bit integer.
uint32_t spu_atomic_sub32(uint32_t *ls, uint64_t ea, uint32_t value)

spu_atomic_sub64

Atomically subtract from a 64-bit integer.
uint64_t spu_atomic_sub64(uint64_t *ls, uint64_t ea, uint64_t value)

Atomic Bitwise Operations

spu_atomic_or32

Atomically OR a value with a 32-bit integer.
uint32_t spu_atomic_or32(uint32_t *ls, uint64_t ea, uint32_t value)
value
uint32_t
required
Bitmask to OR
return
uint32_t
Previous value before OR operation

spu_atomic_or64

Atomically OR with a 64-bit integer.
uint64_t spu_atomic_or64(uint64_t *ls, uint64_t ea, uint64_t value)

spu_atomic_and32

Atomically AND a value with a 32-bit integer.
uint32_t spu_atomic_and32(uint32_t *ls, uint64_t ea, uint32_t value)

spu_atomic_and64

Atomically AND with a 64-bit integer.
uint64_t spu_atomic_and64(uint64_t *ls, uint64_t ea, uint64_t value)

Atomic Store and Swap

spu_atomic_store32

Atomically store a new value.
uint32_t spu_atomic_store32(uint32_t *ls, uint64_t ea, uint32_t value)
value
uint32_t
required
New value to store
return
uint32_t
Previous value (effectively an atomic exchange)

spu_atomic_store64

Atomically store a 64-bit value.
uint64_t spu_atomic_store64(uint64_t *ls, uint64_t ea, uint64_t value)

Compare and Swap

spu_atomic_compare_and_swap32

Atomically compare and swap a 32-bit value.
uint32_t spu_atomic_compare_and_swap32(uint32_t *ls, uint64_t ea, uint32_t compare, uint32_t value)
ls
uint32_t*
required
Local store buffer
ea
uint64_t
required
Effective address
compare
uint32_t
required
Expected current value
value
uint32_t
required
New value to store if comparison succeeds
return
uint32_t
Actual value in memory (check if equal to compare to determine success)

spu_atomic_compare_and_swap64

Atomically compare and swap a 64-bit value.
uint64_t spu_atomic_compare_and_swap64(uint64_t *ls, uint64_t ea, uint64_t compare, uint64_t value)

Manual Atomic Operations

spu_atomic_lock_line32

Load a cache line with reservation for manual atomic operation.
uint32_t spu_atomic_lock_line32(uint32_t *ls, uint64_t ea)
ls
uint32_t*
required
Local store buffer (128 bytes, 128-byte aligned)
ea
uint64_t
required
Effective address (will be aligned to 128-byte boundary)
return
uint32_t
Current value at the specified address

spu_atomic_lock_line64

Load a cache line for 64-bit atomic operation.
uint64_t spu_atomic_lock_line64(uint64_t *ls, uint64_t ea)

spu_atomic_store_conditional32

Attempt to store with reservation check.
int spu_atomic_store_conditional32(uint32_t *ls, uint64_t ea, uint32_t value)
ls
uint32_t*
required
Local store buffer (same as used in lock_line)
ea
uint64_t
required
Effective address
value
uint32_t
required
Value to store at the aligned offset
return
int
Nonzero if store succeeded, zero if reservation was lost

spu_atomic_store_conditional64

Conditional store for 64-bit values.
int spu_atomic_store_conditional64(uint64_t *ls, uint64_t ea, uint64_t value)

No-Op Operation

spu_atomic_nop32

Atomic no-op (reads and writes back unchanged).
uint32_t spu_atomic_nop32(uint32_t *ls, uint64_t ea)
Useful for testing atomic mechanism or forcing a cache line load.

spu_atomic_nop64

uint64_t spu_atomic_nop64(uint64_t *ls, uint64_t ea)

Example Usage

Simple Atomic Counter

#include <sys/spu_atomic.h>

// Shared counter in main memory (128-byte aligned)
uint64_t shared_counter_addr = 0x20000000;

// Local buffer (must be 128-byte aligned)
uint32_t local_buf[32] __attribute__((aligned(128)));

// Atomically increment counter
uint32_t old_value = spu_atomic_incr32(local_buf, shared_counter_addr);
printf("Counter was %u, now %u\n", old_value, old_value + 1);

Atomic Flags (Bitwise Operations)

// Set multiple status bits atomically
uint32_t flags_addr = 0x20000080;  // Must be 128-byte aligned
uint32_t local_buf[32] __attribute__((aligned(128)));

#define FLAG_READY    0x01
#define FLAG_BUSY     0x02
#define FLAG_COMPLETE 0x04

// Set BUSY flag
spu_atomic_or32(local_buf, flags_addr, FLAG_BUSY);

// Clear BUSY, set COMPLETE
spu_atomic_and32(local_buf, flags_addr, ~FLAG_BUSY);
spu_atomic_or32(local_buf, flags_addr, FLAG_COMPLETE);

Spinlock Implementation

typedef struct {
    uint32_t locked;
    uint32_t padding[31];  // Pad to 128 bytes
} __attribute__((aligned(128))) spinlock_t;

uint64_t lock_addr = 0x20000000;
uint32_t local_buf[32] __attribute__((aligned(128)));

void spinlock_acquire() {
    uint32_t old;
    do {
        // Try to swap 0 (unlocked) with 1 (locked)
        old = spu_atomic_compare_and_swap32(local_buf, lock_addr, 0, 1);
    } while (old != 0);  // Retry if lock was already held
}

void spinlock_release() {
    // Store 0 (unlocked)
    spu_atomic_store32(local_buf, lock_addr, 0);
}

Semaphore (Test and Decrement)

// Semaphore in main memory
uint64_t sem_addr = 0x20000000;
uint32_t local_buf[32] __attribute__((aligned(128)));

void semaphore_wait() {
    uint32_t old;
    do {
        // Decrement only if > 0
        old = spu_atomic_test_and_decr32(local_buf, sem_addr);
        if (old == 0) {
            // Semaphore is zero, wait a bit
            for (volatile int i = 0; i < 1000; i++);
        }
    } while (old == 0);
}

void semaphore_post() {
    spu_atomic_incr32(local_buf, sem_addr);
}

Manual Atomic Operation

// Custom atomic operation: multiply by 2
uint64_t value_addr = 0x20000000;
uint32_t local_buf[32] __attribute__((aligned(128)));

int success;
do {
    // Load with reservation
    uint32_t current = spu_atomic_lock_line32(local_buf, value_addr);
    
    // Compute new value
    uint32_t new_value = current * 2;
    
    // Try to store
    success = spu_atomic_store_conditional32(local_buf, value_addr, new_value);
    
} while (!success);  // Retry if another SPU modified the value

Lock-Free Stack (ABA-safe)

typedef struct node {
    uint64_t next;  // Address of next node
    uint32_t data;
    uint32_t aba_counter;  // Prevent ABA problem
} node_t;

typedef struct {
    uint64_t head;         // Address of head node
    uint32_t aba_counter;
    uint32_t padding[29];
} __attribute__((aligned(128))) stack_t;

uint64_t stack_addr = 0x20000000;
uint64_t local_buf[16] __attribute__((aligned(128)));

void push(uint64_t node_addr) {
    node_t node;
    uint64_t old_head;
    
    do {
        // Read current stack state
        old_head = spu_atomic_lock_line64(local_buf, stack_addr);
        
        // Update node to point to current head
        node.next = old_head;
        
        // Try to make this node the new head
    } while (!spu_atomic_store_conditional64(local_buf, stack_addr, node_addr));
}

Performance Considerations

  1. Alignment: All atomic operations require 128-byte alignment
  2. Contention: High contention causes retry loops - consider alternatives
  3. Cache effects: Each atomic op loads 128 bytes even for small values
  4. False sharing: Separate frequently-updated atomics by 128 bytes
  5. Alternative patterns: Sometimes message passing is more efficient than atomics

Alignment Requirements

// Correct: 128-byte aligned
uint32_t atomic_value[32] __attribute__((aligned(128)));
uint64_t addr = 0x20000000;  // Must be 128-byte aligned

// The actual value can be anywhere within the 128-byte block
// The API handles the offset automatically

Memory Ordering

All atomic operations include:
  • spu_dsync() before store conditional (ensures local changes are visible)
  • Implicit memory barriers from the atomic mechanism itself
This ensures proper ordering for most use cases. For complex memory ordering requirements, additional synchronization may be needed.

Build docs developers (and LLMs) love