Overview
This guide covers practical techniques to optimize bun-scikit performance for your specific use cases. From choosing the right backend to data preparation strategies, these tips will help you get the most out of native Zig acceleration.Choose the Right Backend
Tree Models: Zig vs JavaScript
bun-scikit offers both native Zig and optimized JavaScript backends for tree-based models.When to Use Zig Backend (Default)
When to Use Zig Backend (Default)
Best for:
- Medium to large datasets (>1000 samples)
- Deep trees (
maxDepth > 5) - Random forests with many estimators (
nEstimators > 50) - Production workloads prioritizing throughput
- DecisionTree fit: 1.82x faster than JS
- RandomForest fit: 2.65x faster than JS
- RandomForest predict: 2.26x faster than JS
When to Use JavaScript Backend
When to Use JavaScript Backend
Best for:
- Small datasets (<500 samples)
- Shallow trees (
maxDepth≤ 3) - Single decision trees (lower overhead)
- Development without building native code
- DecisionTree predict: 1.6x faster than Zig on small data
- Lower FFI overhead for simple models
- Faster startup time
Benchmark both backends with your specific data. The optimal choice depends on dataset size, tree depth, and whether you’re optimizing for training or inference.
Linear Models: Native Required
LinearRegression and LogisticRegression require native Zig kernels:
Data Preparation
Use Typed Arrays
Native kernels work directly with typed arrays for zero-copy data transfer:bun-scikit automatically converts regular arrays to typed arrays when needed, but pre-converting your data eliminates this overhead.
Contiguous Memory Layout
Ensure data is contiguous for optimal native performance:Batch Predictions
Predict on multiple samples at once to amortize overhead:Model Configuration
Random Forests: Tune Estimators
Balance accuracy and speed by adjustingnEstimators:
Small Datasets (<1000 samples)
Small Datasets (<1000 samples)
- 50 estimators provide good accuracy
- Training time: ~15-30ms
- Diminishing returns beyond 100 estimators
Medium Datasets (1000-10000 samples)
Medium Datasets (1000-10000 samples)
- 100 estimators balance accuracy and speed
- Training time: ~50-200ms
- Consider parallel processing for larger values
Large Datasets (>10000 samples)
Large Datasets (>10000 samples)
- 200+ estimators for maximum accuracy
- Native Zig backend essential (6.4x speedup)
- Training time: varies widely by data size
Decision Trees: Limit Depth
Control tree complexity to prevent overfitting and improve speed:maxDepth: 5- Very fast, risk underfittingmaxDepth: 8- Good balance (recommended)maxDepth: 15- High accuracy, slower, risk overfittingmaxDepth: null- Unlimited (use with caution)
Logistic Regression: Solver Selection
Choose the appropriate solver for your data:Native Zig gradient descent (
solver: "gd") provides 2.5x speedup over Python scikit-learn. Use for datasets with >1000 samples.Preprocessing Optimization
Reuse Scalers
Fit scalers once, transform multiple times:Pipeline for Efficiency
Combine preprocessing and modeling in pipelines:- Automatic data flow between steps
- No intermediate array allocations
- Cleaner code, fewer errors
Memory Management
Avoid Memory Leaks
Models with native handles are automatically cleaned up, but be aware of lifecycle:Large Dataset Strategies
For datasets that strain memory:Use Incremental Learning
Use Incremental Learning
partialFit:SGDClassifierSGDRegressor
Subsample for Prototyping
Subsample for Prototyping
Benchmarking Your Code
Measure What Matters
Use precise timing for optimization decisions:Compare Backends
Test both Zig and JS backends with your data:Run Official Benchmarks
Compare your results against CI benchmarks:Production Deployment
Verify Native Acceleration
Confirm native kernels are active in production:Environment Configuration
Optimal production settings:Model Serialization
Save trained models to avoid refitting:Model serialization APIs are under development. Current approach requires manual state management.
Performance Checklist
Data Preparation
Data Preparation
- Use
Float64Arrayfor features and targets - Ensure contiguous memory layout
- Batch predictions instead of looping
- Reuse fitted scalers for multiple transforms
- Consider pipelines for multi-step workflows
Model Configuration
Model Configuration
- Choose appropriate
nEstimatorsfor dataset size - Set reasonable
maxDepthto prevent overfitting - Use native Zig backend for medium/large data
- Select optimal solver for linear models
- Tune hyperparameters with cross-validation
Runtime Environment
Runtime Environment
- Verify native kernels are active (
fitBackend_) - Use prebuilt binaries (Linux/Windows)
- Build locally on macOS (
bun run native:build) - Set
BUN_SCIKIT_TREE_BACKEND=zigexplicitly - Benchmark both backends with your data
Production Deployment
Production Deployment
- Confirm native acceleration in production logs
- Use appropriate environment variables
- Monitor fit/predict times
- Cache trained models when possible
- Test performance regression in CI
Next Steps
Benchmarks
Review detailed performance comparisons
Native Runtime
Deep dive into Zig acceleration