Skip to content

Monitoring

The POC cluster includes the Prometheus Operator stack (kube-prometheus-stack v80.4.2) for comprehensive monitoring and alerting. This provides Prometheus for metrics collection, Grafana for visualization, and Alertmanager for notification routing.

Stack Components

Component Version Purpose
Prometheus Included in operator 80.4.2 Metrics collection and storage
Grafana Included in operator 80.4.2 Dashboards and visualization
Alertmanager Included in operator 80.4.2 Alert routing and notification
Node Exporter Included Hardware and OS metrics
kube-state-metrics Included Kubernetes object metrics

Accessing Grafana

Grafana is exposed via NGINX Ingress or can be accessed via port-forward:

Via kubectl Port-Forward

Port-forward Grafana
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80

Then open http://localhost:3000 in your browser.

Default Credentials

Field Value
Username admin
Password spectro

Change Grafana Password

The default Grafana password should be changed after first login, especially if Grafana is exposed via ingress.

Pre-Built Dashboards

The Prometheus Operator includes several pre-built dashboards:

Kubernetes Dashboards

Dashboard Metrics Shown
Kubernetes / Compute Resources / Cluster CPU, memory, network across all nodes
Kubernetes / Compute Resources / Node Per-node CPU, memory, disk, network
Kubernetes / Compute Resources / Namespace Per-namespace resource usage
Kubernetes / Compute Resources / Pod Per-pod CPU and memory
Kubernetes / Networking / Cluster Cluster-wide network throughput
Kubernetes / Persistent Volumes PVC usage, IOPS, latency

Node Dashboards

Dashboard Metrics Shown
Node Exporter / Nodes CPU, memory, disk, network per node
Node Exporter / USE Method / Node Utilization, Saturation, Errors per node

Portworx Dashboards

If Portworx monitoring is enabled, additional dashboards are available:

Dashboard Metrics Shown
Portworx Cluster Overall cluster storage health
Portworx Volume Per-volume IOPS, latency, throughput
Portworx Node Per-node storage capacity and usage
Portworx Alerts Active storage alerts

Key Metrics to Monitor

Cluster Health

Metric PromQL Alert Threshold
Node Ready kube_node_status_condition{condition="Ready",status="true"} < 3 nodes
Pod Restarts increase(kube_pod_container_status_restarts_total[1h]) > 5 restarts/hour
etcd Health etcd_server_has_leader 0 (no leader)
API Server Latency apiserver_request_duration_seconds_bucket p99 > 1s

Node Resources

Metric PromQL Alert Threshold
CPU Usage 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80% sustained
Memory Usage (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85%
Disk Usage (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 80%
Network Errors rate(node_network_receive_errs_total[5m]) > 0

Storage (Portworx)

Metric PromQL Alert Threshold
Pool Capacity px_cluster_disk_utilized_bytes / px_cluster_disk_total_bytes * 100 > 80%
Volume Read Throughput rate(px_volume_read_bytes_total[5m]) Baseline dependent
Replication Status px_volume_replication_status != healthy
Drive Health px_disk_state != online

VMO

Metric PromQL Alert Threshold
VM Running kubevirt_vm_running_status_last_transition_timestamp_seconds VM not running
VM CPU Usage kubevirt_vmi_cpu_usage_seconds_total Baseline dependent
VM Memory kubevirt_vmi_memory_available_bytes < 10% free
VM Network kubevirt_vmi_network_receive_bytes_total Baseline dependent

Alertmanager Configuration

Alertmanager handles alert routing and notification. For the POC, alerts can be configured to send to:

  • Email (SMTP relay required)
  • Webhook (if external endpoint is reachable)
  • Silence/inhibit for non-critical alerts during testing

Air-Gap Constraint

In the air-gapped environment, Alertmanager notification channels that require external network access (Slack, PagerDuty, etc.) will not work. Use email via internal SMTP relay or webhook to an internal endpoint.

Prometheus Data Retention

Setting Value
Retention Period 15 days (default)
Storage Portworx PVC
Recommended PVC Size 50-100 GB for 3-node cluster

To adjust retention:

Prometheus retention configuration (in pack values)
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "80GB"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: portworx-sc-repl2
          resources:
            requests:
              storage: 100Gi

Monitoring Architecture

graph TB
    subgraph Cluster["3-Node Cluster"]
        subgraph N1["Node 1"]
            NE1["Node Exporter"]
            KL1["Kubelet"]
        end
        subgraph N2["Node 2"]
            NE2["Node Exporter"]
            KL2["Kubelet"]
        end
        subgraph N3["Node 3"]
            NE3["Node Exporter"]
            KL3["Kubelet"]
        end

        PROM["Prometheus<br/>(Scrapes all targets)"]
        GRAF["Grafana<br/>(Dashboards)"]
        AM["Alertmanager<br/>(Notifications)"]
        KSM["kube-state-metrics"]
    end

    NE1 & NE2 & NE3 --> PROM
    KL1 & KL2 & KL3 --> PROM
    KSM --> PROM
    PROM --> GRAF
    PROM --> AM