Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/abelperezr/nokia-bng-lab/llms.txt

Use this file to discover all available pages before exploring further.

Dashboard Guide

This guide provides detailed information about the pre-configured Grafana dashboards for monitoring Nokia SROS and SR Linux devices.

Dashboard Access

Access Grafana dashboards at:
http://localhost:3030
Navigation:
  1. Click Dashboards icon (four squares) in left sidebar
  2. Select Browse
  3. Choose a dashboard:
    • SROS Dashboard - BNG monitoring
    • SR Linux Telemetry - Switch/OLT monitoring
Anonymous access is enabled - no login required for viewing. Use admin/admin for editing privileges.

SROS Dashboard

Comprehensive monitoring for Nokia SROS BNG routers (BNG1, BNG2). File: configs/grafana/dashboards/SROS-Dashboard.json
Purpose: Monitor BNG system health, interfaces, subscriber sessions, and routing protocols

Dashboard Layout

The SROS dashboard is organized into collapsible rows:
Overview of device health and resourcesPanels:
  1. CPU Utilization (Bar Gauge)
    • Shows CPU percentage per device
    • Color-coded: Green (under 70%), Yellow (70-85%), Red (over 85%)
    • Query: system_cpu
  2. Memory Usage (Gauge)
    • Displays memory utilization percentage
    • Calculated: (in_use / available) * 100
    • Threshold warnings at 80% and 90%
  3. System Uptime (Stat)
    • Time since last restart
    • Useful for tracking reboots
  4. Temperature Sensors (Graph)
    • Card and module temperatures
    • Query: card_hardware_data_temperature
    • Alert threshold at 70°C
  5. Fan Speeds (Graph)
    • Chassis fan RPM monitoring
    • Query: chassis_fan_speed
    • Detects fan failures (speed = 0)
Use Cases:
  • Quick health check before maintenance
  • Identify overheating or resource exhaustion
  • Monitor system stability over time

SROS Dashboard Variables

The dashboard includes dynamic variables:
VariableTypeValuesUsage
$deviceQuerybng1, bng2, AllFilter by BNG device
$portQuery(port IDs)Filter by specific port
$intervalCustom5m, 15m, 1hRate calculation window
$serviceQuery(VPLS names)Filter VPLS services
Query Example:
# Uses $device variable
system_cpu{device="$device"}

# Uses $port variable (when not "All")
port_statistics_in_octets{device="$device",port_id="$port"}

SROS Dashboard Time Ranges

Recommended time ranges:
  • Real-time monitoring: Last 5 minutes (5s refresh)
  • Active troubleshooting: Last 1 hour (10s refresh)
  • Performance review: Last 24 hours
  • Capacity planning: Last 7 days
  • Incident analysis: Custom range around event

SR Linux Dashboard

Monitoring for Nokia SR Linux switches and routers (Switch, OLT, TX). File: configs/grafana/dashboards/srlinux-telemetry-lite.json
Purpose: Monitor platform health, interface statistics, and network instances

Dashboard Layout

Control plane resources and system healthPanels:
  1. CPU Usage per Slot (Time Series)
    • CPU percentage for each control module
    • Query: platform_control_cpu_total
    • Labels: slot, index
  2. Memory Utilization (Gauge)
    • Physical memory usage percentage
    • Query:
      ((platform_control_memory_physical - platform_control_memory_free) /
        platform_control_memory_physical) * 100
      
    • Thresholds: 70% (yellow), 85% (red)
  3. Free Memory (Graph)
    • Available memory in GB over time
    • Query: platform_control_memory_free / 1073741824
    • Trend monitoring for memory leaks
  4. Application Resource Usage (Table)
    • Per-application CPU and memory
    • Query: system_app_management_application_*
    • Columns: App Name, CPU%, Memory (MB)
    • Identifies resource-hungry applications
Use Cases:
  • Monitor control plane health
  • Detect memory leaks
  • Identify misbehaving applications
  • Resource capacity planning

SR Linux Dashboard Variables

VariableTypeValuesUsage
$deviceQueryswitch, olt, tx, AllFilter by device
$interfaceQuery(interface names)Filter specific interface
$network_instanceQuery(VRF names)Filter network instance

Dashboard Usage Tips

Comparing Metrics Across Devices

  1. Set $device variable to All
  2. Use color coding to distinguish devices
  3. Enable legend for identification
  4. Example query:
    system_cpu
    
    Returns CPU for all devices with device label

Correlating Events

  1. Select custom time range around incident
  2. Open multiple dashboards in tabs
  3. Use same time range across all dashboards
  4. Look for correlations:
    • CPU spike + BGP route loss?
    • Session drops + interface errors?
    • Memory growth + application restart?

Zooming In on Issues

  1. Click and drag on graph to zoom time range
  2. Click legend entry to hide/show series
  3. Shift+click legend to isolate single series
  4. Double-click graph to reset zoom

Exporting Data

  1. Click panel title → InspectData
  2. View raw data table
  3. Click Download CSV to export
  4. Use for reporting or external analysis

Sharing Dashboard Views

  1. Set time range and variables as desired
  2. Click Share dashboard icon (top-right)
  3. Options:
    • Link: Copy URL with current settings
    • Snapshot: Create static snapshot (if enabled)
    • Export: Download JSON

Creating Alert Annotations

Manually mark events on graphs:
  1. Ctrl+click on graph at event time
  2. Select Add annotation
  3. Enter description (e.g., “Config change deployed”)
  4. Annotation appears on all panels
Annotations are per-dashboard and not persistent in this lab configuration.

Performance Optimization

Dashboard Loading Slowly

Solutions:
  1. Reduce time range (e.g., 1h instead of 7d)
  2. Limit variable selections (specific device vs. All)
  3. Collapse unused rows
  4. Increase $interval variable value

Too Many Series in Graph

Solutions:
  1. Use topk() to limit to top N:
    topk(10, rate(port_statistics_in_octets[5m]))
    
  2. Filter by specific labels:
    interface_statistics_in_octets{name=~"ethernet-1/[1-5]"}
    
  3. Use aggregation:
    sum by (device) (system_cpu)
    

Query Taking Too Long

Solutions:
  1. Reduce query range (use [$interval] instead of hardcoded)
  2. Add more specific label filters
  3. Use irate() instead of rate() for recent data
  4. Consider creating recording rules in Prometheus

Common Dashboard Workflows

Daily Health Check

  1. Open SROS Dashboard
  2. Check System Status row:
    • CPU < 80%
    • Memory < 85%
    • No temperature alarms
  3. Verify Port Statistics:
    • All expected ports UP
    • No error counters increasing
  4. Review BNG Sessions:
    • Session count within normal range
    • No failed session spikes

Troubleshooting Subscriber Issue

  1. Open SROS Dashboard
  2. Set time range to incident window
  3. Check BNG Sessions row:
    • Session drops?
    • Failed authentications?
  4. Review Port Statistics:
    • Interface errors on subscriber-facing ports?
    • Traffic patterns abnormal?
  5. Check VPLS Services:
    • SAP down?
    • Service state issues?

Capacity Planning

  1. Set time range to Last 30 days
  2. SROS Dashboard → System Status:
    • CPU trend (linear regression)
    • Memory growth rate
  3. BNG Sessions:
    • Peak session count
    • Growth rate (sessions/day)
  4. Port Statistics:
    • Max interface utilization
    • 95th percentile traffic rates
  5. Export data for external analysis

Incident Analysis

  1. Identify incident time window
  2. Open all dashboards in separate tabs
  3. Set same custom time range on all
  4. Look for anomalies:
    • CPU/memory spikes
    • Route count changes
    • Interface state changes
    • Error rate increases
  5. Take screenshots for incident report
  6. Export relevant panel data

Next Steps

Available Metrics

Explore all metrics used in dashboards

Customize Grafana

Learn to create custom dashboards

Build docs developers (and LLMs) love