Skip to main content

Overview

Local mode runs all Lucille components (Runner, Worker, Indexer) inside a single JVM process. This deployment mode is ideal for:
  • Development and testing - Quick iteration without external dependencies
  • Small-scale ingestion - Processing datasets that fit within single-machine resources
  • Proof of concept - Evaluating Lucille before scaling to distributed mode
  • Simple use cases - When throughput requirements don’t demand horizontal scaling
Local mode uses in-memory queues for inter-component communication. No external message broker is required.

Architecture

In local mode, the Runner launches Worker and Indexer threads within the same JVM:
┌─────────────────────────────────────┐
│         Single JVM Process          │
│                                     │
│  ┌────────────┐                    │
│  │  Runner    │ (Main Thread)      │
│  │ + Connector│                    │
│  └─────┬──────┘                    │
│        │                            │
│        ├─→ In-Memory Queues         │
│        │                            │
│  ┌─────▼──────┐   ┌─────────────┐ │
│  │  Worker    │   │   Indexer   │ │
│  │  Thread(s) │   │   Thread    │ │
│  └────────────┘   └─────────────┘ │
└─────────────────────────────────────┘
1
Step 1: Prepare Configuration
2
Create a configuration file defining your connector, pipeline, and indexer:
3
application.conf
connectors: [
  {
    class: "com.kmwllc.lucille.connector.FileConnector",
    paths: ["data/input.csv"],
    name: "file_connector",
    pipeline: "my_pipeline"
    fileHandlers: {
      csv: { }
    }
  }
]

pipelines: [
  {
    name: "my_pipeline",
    stages: [
      {
        class: "com.kmwllc.lucille.stage.RenameFields"
        fieldMapping {
          "old_name" : "new_name"
        }
      }
    ]
  }
]

indexer {
  type: "Solr"
  batchSize: 100
  batchTimeout: 100
}

solr {
  useCloudClient: true
  defaultCollection: "my_collection"
  url: ["http://localhost:8983/solr"]
}
4
Step 2: Run Lucille
5
Execute the Runner class from the command line:
6
java \
  -Dconfig.file=path/to/application.conf \
  -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
  com.kmwllc.lucille.core.Runner
7
Command Breakdown:
8
  • -Dconfig.file - Path to your configuration file
  • -cp - Classpath including Lucille JAR and dependencies
  • com.kmwllc.lucille.core.Runner - Main class (no arguments = local mode)
  • 9
    Step 3: Monitor Progress
    10
    Lucille outputs real-time metrics to the console:
    11
    25/10/31 13:40:21 6790d2e9-1079  INFO WorkerPool: 27017 docs processed. 
      One minute rate: 1787.10 docs/sec. Mean pipeline latency: 10.63 ms/doc.
    
    25/10/31 13:40:22 6790d2e9-1079  INFO Indexer: 17016 docs indexed. 
      One minute rate: 455.07 docs/sec. Mean backend latency: 6.90 ms/doc.
    
    12
    Step 4: Verify Completion
    13
    Upon completion, Lucille prints a run summary:
    14
    25/10/31 13:46:47  INFO Runner: 
    RUN SUMMARY: Success. 1/1 connectors complete. 
      All published docs succeeded.
    connector1: complete. 200000 docs succeeded. 
      0 docs failed. 0 docs dropped. Time: 416.47 secs.
    

    Thread Configuration

    Local mode creates these threads:
    1. Main Thread - Launches components and monitors completion
    2. Connector Thread - Reads source data and publishes documents
    3. Worker Thread(s) - Process documents through pipeline stages
    4. Indexer Thread - Batches and sends documents to destination

    Configuring Worker Threads

    By default, Lucille creates one worker thread per CPU core. Override this in your config:
    worker {
      numThreads: 4  # Explicitly set worker thread count
    }
    
    Setting numThreads too high can cause memory pressure and thread contention. Start conservatively and tune based on profiling.

    Use Cases

    Development and Testing

    Best For:
    • Writing and debugging custom stages
    • Testing pipeline configurations
    • Validating connector behavior
    • Integration tests in CI/CD
    Example:
    # Quick test with small dataset
    java -Dconfig.file=test.conf \
      -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
      com.kmwllc.lucille.core.Runner
    

    Small-Scale Production Workloads

    Best For:
    • Periodic batch jobs (under 1M documents)
    • Non-time-critical ingestion
    • Single-source ETL pipelines
    • Resource-constrained environments
    Example:
    # Nightly batch job
    0 2 * * * /usr/local/bin/run_lucille_local.sh
    

    Limitations

    Local mode has important constraints that make it unsuitable for large-scale production deployments.

    Single Point of Failure

    If the JVM crashes or the process is killed, all in-flight work is lost. There is no recovery mechanism.

    Memory Constraints

    All components share the same heap:
    • In-memory queues hold documents between stages
    • Large documents or deep queues can cause OutOfMemoryErrors
    • Worker threads and indexer batches compete for heap space
    Mitigation:
    # Increase heap size for larger workloads
    java -Xmx8g -Xms4g \
      -Dconfig.file=application.conf \
      -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
      com.kmwllc.lucille.core.Runner
    

    No Horizontal Scaling

    You cannot add more machines to increase throughput. Performance is bounded by:
    • Single-machine CPU cores (limits worker parallelism)
    • Single-machine memory (limits queue depth and batch sizes)
    • Single-machine network I/O (limits indexing throughput)

    Limited Observability

    Metrics are logged to console only. There is no:
    • Centralized metrics collection
    • Distributed tracing
    • External monitoring integration

    Validation and Testing

    Lucille provides a validation mode to check configurations before running:
    java -Dconfig.file=application.conf \
      -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
      com.kmwllc.lucille.core.Runner \
      -validate
    
    Output:
    Pipeline Configuration is valid.
    Connector Configuration is valid.
    Indexer Configuration is valid.
    
    Always validate configurations in CI/CD pipelines to catch errors before deployment.

    Graceful Shutdown

    Local mode handles SIGINT (Ctrl+C) gracefully:
    // Runner.java:212-218
    Signal.handle(new Signal("INT"), signal -> {
      if (state != null) {
        log.info("Runner attempting clean shutdown after receiving INT signal");
        state.close();  // Stops connector, workers, indexer
      }
      SystemHelper.exit(0);
    });
    
    This ensures:
    1. Connector stops producing new documents
    2. Workers finish processing in-flight documents
    3. Indexer flushes final batch
    4. Connections are closed cleanly

    When to Use Local Mode

    • Developing and testing pipelines locally
    • Processing small datasets (under 100K documents)
    • Running one-off batch jobs
    • Evaluating Lucille before production
    • Constrained to single-machine deployment
    • External dependencies (Kafka) are not available

    Next Steps

    Distributed Mode

    Scale to distributed deployment with Kafka for production workloads

    Production Best Practices

    Learn monitoring, tuning, and troubleshooting for production systems

    Build docs developers (and LLMs) love