Profiling helps identify performance bottlenecks in native code, enabling you to optimize CPU usage, memory allocation, and overall application performance.
The NDK and Android platform provide several profiling tools:
- Simpleperf - CPU profiling tool for native code, part of the NDK
- Android Studio Profiler - Visual profiling with native support
- Perfetto/Systrace - System-wide performance tracing
- Heapprofd - Native memory profiling
Start with Android Studio Profiler for quick insights, then use Simpleperf for detailed CPU analysis.
Preparing for profiling
Enable profiling in your build
In build.gradle:
android {
buildTypes {
release {
// Enable profiling in release builds
debuggable false
minifyEnabled true
profileable true // Android 10+ (API level 29)
}
}
}
For CMake builds:
# Keep frame pointers for better stack traces
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -fno-omit-frame-pointer")
# Add debug symbols without optimization reduction
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -g")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -g")
Frame pointers slightly increase binary size but provide much better profiling data.
CPU profiling with Simpleperf
Simpleperf is a command-line profiling tool that uses the CPU’s performance monitoring unit (PMU).
Installing Simpleperf
# Simpleperf is included in the NDK
cd $NDK_PATH/simpleperf
# Or download standalone version
git clone https://android.googlesource.com/platform/system/extras
cd extras/simpleperf
Recording CPU profile
Push Simpleperf to device
adb push $NDK_PATH/simpleperf/bin/android/arm64/simpleperf /data/local/tmp/
adb shell chmod +x /data/local/tmp/simpleperf
Record profile data
# Profile the entire app
adb shell /data/local/tmp/simpleperf record -p $(adb shell pidof your.package.name) -o /data/local/tmp/perf.data
# Profile for specific duration
adb shell /data/local/tmp/simpleperf record --duration 10 -p $(adb shell pidof your.package.name) -o /data/local/tmp/perf.data
# Profile with call graph
adb shell /data/local/tmp/simpleperf record -g -p $(adb shell pidof your.package.name) -o /data/local/tmp/perf.data
Pull profile data
adb pull /data/local/tmp/perf.data .
Generate report
# Text report
$NDK_PATH/simpleperf/report.py -i perf.data
# Generate flamegraph (requires FlameGraph scripts)
$NDK_PATH/simpleperf/report.py -i perf.data -g | FlameGraph/flamegraph.pl > flame.svg
# Interactive HTML report
$NDK_PATH/simpleperf/report_html.py -i perf.data
Interpreting Simpleperf output
Text report shows function-level CPU usage:
Overhead Command Shared Object Symbol
45.23% myapp libmyapp.so [.] processData
23.45% myapp libmyapp.so [.] calculateResult
12.34% myapp libc.so [.] memcpy
8.90% myapp libmyapp.so [.] render
- Overhead - Percentage of CPU time spent in this function
- Symbol - Function name (symbolicated if debug symbols available)
Focus optimization efforts on functions with high overhead percentages.
Advanced Simpleperf options
# Profile specific events
adb shell /data/local/tmp/simpleperf record -e cpu-cycles,cache-misses -p PID
# Sample at higher frequency (default: 4000 Hz)
adb shell /data/local/tmp/simpleperf record -f 8000 -p PID
# Profile only specific thread
adb shell /data/local/tmp/simpleperf record -t TID
# Record with symbols (copies libraries from device)
$NDK_PATH/simpleperf/app_profiler.py -p your.package.name
Profiling with Android Studio
CPU profiler
Open the Profiler
View > Tool Windows > Profiler
Start CPU recording
Click CPU timeline, then click Record. Choose:
- Java/Kotlin Method Trace - For Java/Kotlin profiling
- System Trace - For native and system profiling
- Sampled (Native) - For native code sampling
Perform operations
Interact with your app to trigger the code you want to profile.
Stop and analyze
Click Stop. The profiler displays:
- Flame chart - Visualize call stack over time
- Top Down/Bottom Up - Function call hierarchy
- Call Chart - Timeline of function calls
Memory profiler
Profile native memory allocations:
- Open Memory Profiler
- Click Record native allocations
- Perform operations
- Stop recording
- Analyze allocation call stacks
Native memory profiling requires Android 10+ (API level 29) and a profileable or debuggable app.
System-wide tracing with Perfetto
Perfetto (successor to systrace) provides system-wide performance traces.
Recording a trace
Using command line
# Record 10-second trace
adb shell perfetto -o /data/misc/perfetto-traces/trace.perfetto-trace \
-t 10s sched freq idle am wm gfx view binder_driver hal dalvik camera input res memory
# Pull trace
adb pull /data/misc/perfetto-traces/trace.perfetto-trace .
Using System Tracing app
- Install System Tracing app from Play Store
- Open app and tap Record trace
- Select categories and duration
- Perform operations in your app
- Stop recording and share trace file
Analyzing traces
Open trace at ui.perfetto.dev:
- View thread activity over time
- Identify frame drops and jank
- Analyze scheduling and CPU usage
- Inspect native function calls
Use the search function to find specific events or thread names.
Adding custom trace points
Native tracing with ATrace
#include <android/trace.h>
void processData() {
// Start trace section
ATrace_beginSection("ProcessData");
// Your code here
for (int i = 0; i < size; i++) {
ATrace_beginSection("ProcessItem");
processItem(i);
ATrace_endSection();
}
ATrace_endSection();
}
Add to CMakeLists.txt:
find_library(android-lib android)
target_link_libraries(your-app ${android-lib})
Scoped tracing helper
class ScopedTrace {
public:
ScopedTrace(const char* name) {
ATrace_beginSection(name);
}
~ScopedTrace() {
ATrace_endSection();
}
};
// Use with RAII
void myFunction() {
ScopedTrace trace("myFunction");
// Automatically ends when trace goes out of scope
}
CPU bottlenecks
Look for:
- Functions with high overhead in Simpleperf
- Long-running operations blocking UI thread
- Inefficient algorithms (O(n²) when O(n log n) possible)
// Bad: O(n²) algorithm
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
if (array[i] == array[j] && i != j) {
// Found duplicate
}
}
}
// Good: O(n) using hash set
std::unordered_set<int> seen;
for (int i = 0; i < n; i++) {
if (seen.count(array[i])) {
// Found duplicate
}
seen.insert(array[i]);
}
Memory bottlenecks
Look for:
- Frequent allocations in hot paths
- Memory leaks (growing memory usage)
- Cache misses
// Bad: Allocating in loop
for (int i = 0; i < iterations; i++) {
float* temp = new float[size];
process(temp);
delete[] temp;
}
// Good: Reuse allocation
float* temp = new float[size];
for (int i = 0; i < iterations; i++) {
process(temp);
}
delete[] temp;
I/O bottlenecks
Look for:
- File operations on main thread
- Synchronous network calls
- Excessive logging
Never perform I/O operations in audio or rendering callbacks - they must complete in microseconds.
Optimization techniques
Use NEON SIMD instructions
#include <arm_neon.h>
// Multiply arrays with NEON (processes 4 floats at once)
void multiplyArraysNEON(const float* a, const float* b, float* result, int count) {
int i = 0;
for (; i <= count - 4; i += 4) {
float32x4_t va = vld1q_f32(a + i);
float32x4_t vb = vld1q_f32(b + i);
float32x4_t vresult = vmulq_f32(va, vb);
vst1q_f32(result + i, vresult);
}
// Handle remaining elements
for (; i < count; i++) {
result[i] = a[i] * b[i];
}
}
Enable compiler optimizations
# Use -O3 for maximum optimization
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -O3")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3")
# Enable link-time optimization
set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)
Reduce memory allocations
// Use object pooling for frequently created objects
class ObjectPool {
public:
Object* acquire() {
if (!pool.empty()) {
Object* obj = pool.back();
pool.pop_back();
return obj;
}
return new Object();
}
void release(Object* obj) {
obj->reset();
pool.push_back(obj);
}
private:
std::vector<Object*> pool;
};
Cache-friendly data structures
// Bad: Array of structures (poor cache locality)
struct Particle {
float x, y, z;
float vx, vy, vz;
float r, g, b, a;
};
Particle particles[10000];
// Good: Structure of arrays (better cache locality)
struct ParticleSystem {
float x[10000], y[10000], z[10000];
float vx[10000], vy[10000], vz[10000];
float r[10000], g[10000], b[10000], a[10000];
};
Structure of arrays (SoA) often improves performance when processing large amounts of data.
Benchmarking
Measure performance consistently:
#include <chrono>
class Benchmark {
public:
void start() {
startTime = std::chrono::high_resolution_clock::now();
}
double elapsedMs() {
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration<double, std::milli>(end - startTime).count();
}
private:
std::chrono::high_resolution_clock::time_point startTime;
};
// Usage
Benchmark bench;
bench.start();
processData();
LOGD("Processing took %.2f ms", bench.elapsedMs());
Automated benchmarking
Use Google Benchmark library:
#include <benchmark/benchmark.h>
static void BM_ProcessData(benchmark::State& state) {
// Setup
std::vector<int> data(state.range(0));
// Benchmark loop
for (auto _ : state) {
processData(data.data(), data.size());
}
// Report throughput
state.SetItemsProcessed(state.iterations() * state.range(0));
}
BENCHMARK(BM_ProcessData)->Range(8, 8<<10);
BENCHMARK_MAIN();
Best practices
- Profile on real devices - Emulator performance doesn’t match real hardware
- Profile release builds - Debug builds can be 10x slower
- Profile representative workloads - Test with realistic data and usage patterns
- Use frame pointers - Enable for better stack traces in profiling
- Focus on hot paths - Optimize code that runs frequently
- Measure before and after - Verify optimizations actually improve performance
- Consider battery impact - Balance performance with power consumption
- Test on low-end devices - Ensure acceptable performance on minimum-spec devices
Premature optimization is the root of all evil. Profile first, then optimize the actual bottlenecks.
Additional resources