Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MilesONerd/neurenix/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Neurenix provides comprehensive support for ARM processors, enabling efficient AI inference and training on ARM-based devices. The framework includes:
- ARM NEON: SIMD instructions for accelerated vector operations
- ARM SVE: Scalable Vector Extension for flexible vectorization
- ARM Compute Library: Optimized neural network primitives
- ARM Ethos-U: Neural Processing Unit for edge AI
- CPU optimization: Multi-threading and cache-aware algorithms
- ARM Cortex-A series (A53, A55, A57, A72, A76, A78, X1, X2)
- ARM Neoverse (N1, N2, V1, V2)
- Apple Silicon (M1, M2, M3 series)
- Qualcomm Snapdragon
- MediaTek Dimensity
- NVIDIA Jetson (ARM CPU components)
Requirements
- ARM processor with NEON support (ARMv7-A or later)
- ARM Compute Library (optional, recommended)
- GCC 9.0+ or Clang 10.0+ with ARM extensions
Installation
# Install Neurenix for ARM
pip install neurenix
# Build from source with ARM optimizations
export NEURENIX_WITH_ARM=1
export ARM_COMPUTE_LIB=/path/to/arm_compute
pip install -e .
# Install ARM Compute Library (optional)
git clone https://github.com/ARM-software/ComputeLibrary.git
cd ComputeLibrary
scons Werror=0 -j8 debug=0 neon=1 opencl=0 embed_kernels=0 os=linux arch=arm64-v8a
Device Detection
Check ARM Availability
import neurenix as nx
# Check ARM acceleration
if nx.arm.is_available():
print("ARM acceleration is available")
print(f"ARM devices: {nx.arm.device_count()}")
else:
print("ARM acceleration not available")
// C++ ARM detection
#include "phynexus/hardware/arm.h"
using namespace phynexus::hardware;
if (arm_is_available()) {
int device_count = arm_get_device_count();
std::cout << "ARM devices: " << device_count << std::endl;
}
Get Device Properties
props = nx.arm.get_device_properties(0)
print(f"Device: {props.device_name}")
print(f"NEON support: {props.has_neon}")
print(f"SVE support: {props.has_sve}")
if props.has_sve:
print(f"SVE vector length: {props.sve_vector_length} bits")
print(f"Ethos-U NPU: {props.has_ethos_u}")
// C++ device properties
ARMDeviceProperties props;
if (arm_get_device_properties(0, &props)) {
std::cout << "Device: " << props.device_name << std::endl;
std::cout << "NEON: " << (props.has_neon ? "Yes" : "No") << std::endl;
std::cout << "SVE: " << (props.has_sve ? "Yes" : "No") << std::endl;
if (props.has_sve) {
std::cout << "SVE vector length: " << props.sve_vector_length << std::endl;
}
}
ARM NEON
Overview
NEON provides 128-bit SIMD instructions for parallel processing of data:
import neurenix as nx
# Operations automatically use NEON when available
a = nx.randn(1000, device='cpu')
b = nx.randn(1000, device='cpu')
c = a + b # Uses NEON acceleration
Explicit NEON Operations
from neurenix.hardware import arm
# NEON-accelerated addition
a = nx.randn(1024)
b = nx.randn(1024)
c = arm.neon_add(a, b)
# NEON-accelerated multiplication
d = arm.neon_multiply(a, b)
// C++ NEON operations
#include "phynexus/hardware/arm.h"
using namespace phynexus::hardware;
float* a = new float[1024];
float* b = new float[1024];
float* c = new float[1024];
// NEON-accelerated add
arm_neon_add(a, b, c, 1024);
// NEON-accelerated multiply
arm_neon_multiply(a, b, c, 1024);
NEON Data Types
NEON supports various data types:
float32x4_t - 4x 32-bit floats
int32x4_t - 4x 32-bit integers
int16x8_t - 8x 16-bit integers
int8x16_t - 16x 8-bit integers
ARM SVE (Scalable Vector Extension)
Overview
SVE provides vector operations with runtime-determined vector lengths:
from neurenix.hardware import arm
# Check SVE support
if arm.has_sve():
vector_length = arm.get_sve_vector_length()
print(f"SVE vector length: {vector_length} bits")
# SVE-accelerated operations
a = nx.randn(2048)
b = nx.randn(2048)
c = arm.sve_add(a, b)
// C++ SVE operations
#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
using namespace phynexus::hardware;
float* a = new float[2048];
float* b = new float[2048];
float* c = new float[2048];
// SVE-accelerated operations
arm_sve_add(a, b, c, 2048);
arm_sve_multiply(a, b, c, 2048);
#endif
SVE Advantages
- Vector length agnostic code
- Future-proof for longer vectors
- Improved performance on Neoverse and future ARM CPUs
- Better handling of loop remainders
ARM Compute Library
Overview
ARM Compute Library provides highly optimized functions for computer vision and machine learning:
from neurenix.hardware import arm
# Enable ARM Compute Library
arm.set_compute_library_enabled(True)
# Operations use ACL when beneficial
conv = nx.nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
input = nx.randn(1, 3, 224, 224)
output = conv(input) # Uses ACL optimized convolution
Convolution with ACL
from neurenix.hardware.arm import acl_conv2d
# Explicit ACL convolution
params = {
'batch_size': 1,
'input_channels': 3,
'input_height': 224,
'input_width': 224,
'output_channels': 64,
'kernel_height': 3,
'kernel_width': 3,
'stride_height': 1,
'stride_width': 1,
'padding_height': 1,
'padding_width': 1
}
output = acl_conv2d(input, weights, bias, params)
// C++ ACL convolution
#include "phynexus/hardware/arm.h"
using namespace phynexus::hardware;
Conv2DParams params;
params.batch_size = 1;
params.input_channels = 3;
params.input_height = 224;
params.input_width = 224;
params.output_channels = 64;
params.kernel_height = 3;
params.kernel_width = 3;
params.stride_height = 1;
params.stride_width = 1;
params.padding_height = 1;
params.padding_width = 1;
arm_acl_conv2d(input_data, weights_data, bias_data, output_data, params);
ARM Ethos-U NPU
Overview
Ethos-U is ARM’s neural processing unit for edge AI, providing:
- Efficient inference for quantized models
- Low power consumption
- Integration with Cortex-M processors
from neurenix.hardware import arm
# Check Ethos-U availability
if arm.has_ethos_u():
print("Ethos-U NPU available")
# Compile model for Ethos-U
model = MyModel()
ethos_model = arm.compile_for_ethos_u(
model,
input_shape=(1, 3, 224, 224),
quantization='int8'
)
# Run inference on Ethos-U
output = ethos_model(input)
Quantization for Ethos-U
from neurenix.quantization import quantize_model
# Quantize model for Ethos-U
quantized_model = quantize_model(
model,
calibration_data=calibration_loader,
quantization_scheme='int8',
target='ethos-u'
)
# Deploy to Ethos-U
ethos_model = arm.deploy_to_ethos_u(quantized_model)
Multi-Threading
import neurenix as nx
# Set number of threads for ARM operations
nx.set_num_threads(4)
# Get current thread count
num_threads = nx.get_num_threads()
print(f"Using {num_threads} threads")
# Use all available cores
import os
nx.set_num_threads(os.cpu_count())
Thread Affinity
# Set thread affinity for better performance
nx.arm.set_thread_affinity([
[0, 1], # Threads 0-1 on cores 0-1 (efficiency cores)
[2, 3, 4, 5] # Threads 2-5 on cores 2-5 (performance cores)
])
Memory Management
Aligned Allocation
# Allocate aligned memory for SIMD
tensor = nx.empty(1024, dtype=nx.float32, aligned=64) # 64-byte alignment
// C++ aligned memory
void* ptr;
arm_malloc(&ptr, 1024 * sizeof(float)); // 64-byte aligned
// ... use memory ...
arm_free(ptr);
Memory Copy
// Optimized memory copy
float* src = new float[1024];
float* dst = new float[1024];
arm_memcpy_device_to_device(dst, src, 1024 * sizeof(float));
Best Practices
- Use appropriate data types
# Use float16 on ARM for better performance
model = model.half() # Convert to FP16
input = input.half()
output = model(input)
- Enable kernel fusion
nx.arm.set_fusion_enabled(True)
- Optimize tensor layout
# Use NHWC layout for better cache locality
tensor = tensor.to_nhwc() # Convert from NCHW to NHWC
- Use ARM Compute Library
nx.backends.arm_compute_lib.enabled = True
Profiling
# Profile ARM operations
with nx.arm.profiler.profile():
output = model(input)
# Print profiling results
print(nx.arm.profiler.key_averages())
Mobile and Edge Deployment
Model Optimization
from neurenix.mobile import optimize_for_mobile
# Optimize model for ARM mobile devices
mobile_model = optimize_for_mobile(
model,
target='arm',
quantization='int8',
optimize_for_latency=True
)
# Export for deployment
mobile_model.save('model_arm.ptl')
Android Deployment
# Export for Android (ARM)
from neurenix.mobile import export_to_android
export_to_android(
model,
'model.pt',
use_neon=True,
use_arm_compute_lib=True
)
Environment Variables
# Set number of threads
export OMP_NUM_THREADS=4
export NEURENIX_NUM_THREADS=4
# Enable ARM Compute Library
export NEURENIX_ARM_USE_ACL=1
# Set thread affinity
export NEURENIX_ARM_THREAD_AFFINITY=0-3
# Enable NEON
export NEURENIX_ARM_USE_NEON=1
Benchmarking
import neurenix as nx
from neurenix.benchmark import benchmark
# Benchmark ARM operations
results = benchmark(
model,
input_shape=(1, 3, 224, 224),
device='cpu',
num_iterations=100
)
print(f"Average latency: {results.mean_ms:.2f} ms")
print(f"Throughput: {results.throughput:.2f} fps")
See Also