Monitoring¶

The POC cluster includes the Prometheus Operator stack (kube-prometheus-stack v80.4.2) for comprehensive monitoring and alerting. This provides Prometheus for metrics collection, Grafana for visualization, and Alertmanager for notification routing.

Stack Components¶

Component	Version	Purpose
Prometheus	Included in operator 80.4.2	Metrics collection and storage
Grafana	Included in operator 80.4.2	Dashboards and visualization
Alertmanager	Included in operator 80.4.2	Alert routing and notification
Node Exporter	Included	Hardware and OS metrics
kube-state-metrics	Included	Kubernetes object metrics

Accessing Grafana¶

Grafana is exposed via NGINX Ingress or can be accessed via port-forward:

Via kubectl Port-Forward¶

Port-forward Grafana

kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80

Then open http://localhost:3000 in your browser.

Default Credentials¶

Field	Value
Username	`admin`
Password	`spectro`

Change Grafana Password

The default Grafana password should be changed after first login, especially if Grafana is exposed via ingress.

Pre-Built Dashboards¶

The Prometheus Operator includes several pre-built dashboards:

Kubernetes Dashboards¶

Dashboard	Metrics Shown
Kubernetes / Compute Resources / Cluster	CPU, memory, network across all nodes
Kubernetes / Compute Resources / Node	Per-node CPU, memory, disk, network
Kubernetes / Compute Resources / Namespace	Per-namespace resource usage
Kubernetes / Compute Resources / Pod	Per-pod CPU and memory
Kubernetes / Networking / Cluster	Cluster-wide network throughput
Kubernetes / Persistent Volumes	PVC usage, IOPS, latency

Node Dashboards¶

Dashboard	Metrics Shown
Node Exporter / Nodes	CPU, memory, disk, network per node
Node Exporter / USE Method / Node	Utilization, Saturation, Errors per node

Portworx Dashboards¶

If Portworx monitoring is enabled, additional dashboards are available:

Dashboard	Metrics Shown
Portworx Cluster	Overall cluster storage health
Portworx Volume	Per-volume IOPS, latency, throughput
Portworx Node	Per-node storage capacity and usage
Portworx Alerts	Active storage alerts

Key Metrics to Monitor¶

Cluster Health¶

Metric	PromQL	Alert Threshold
Node Ready	`kube_node_status_condition{condition="Ready",status="true"}`	< 3 nodes
Pod Restarts	`increase(kube_pod_container_status_restarts_total[1h])`	> 5 restarts/hour
etcd Health	`etcd_server_has_leader`	0 (no leader)
API Server Latency	`apiserver_request_duration_seconds_bucket`	p99 > 1s

Node Resources¶

Metric	PromQL	Alert Threshold
CPU Usage	`100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	> 80% sustained
Memory Usage	`(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100`	> 85%
Disk Usage	`(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100`	> 80%
Network Errors	`rate(node_network_receive_errs_total[5m])`	> 0

Storage (Portworx)¶

Metric	PromQL	Alert Threshold
Pool Capacity	`px_cluster_disk_utilized_bytes / px_cluster_disk_total_bytes * 100`	> 80%
Volume Read Throughput	`rate(px_volume_read_bytes_total[5m])`	Baseline dependent
Replication Status	`px_volume_replication_status`	!= healthy
Drive Health	`px_disk_state`	!= online

VMO¶

Metric	PromQL	Alert Threshold
VM Running	`kubevirt_vm_running_status_last_transition_timestamp_seconds`	VM not running
VM CPU Usage	`kubevirt_vmi_cpu_usage_seconds_total`	Baseline dependent
VM Memory	`kubevirt_vmi_memory_available_bytes`	< 10% free
VM Network	`kubevirt_vmi_network_receive_bytes_total`	Baseline dependent

Alertmanager Configuration¶

Alertmanager handles alert routing and notification. For the POC, alerts can be configured to send to:

Email (SMTP relay required)
Webhook (if external endpoint is reachable)
Silence/inhibit for non-critical alerts during testing

Air-Gap Constraint

In the air-gapped environment, Alertmanager notification channels that require external network access (Slack, PagerDuty, etc.) will not work. Use email via internal SMTP relay or webhook to an internal endpoint.

Prometheus Data Retention¶

Setting	Value
Retention Period	15 days (default)
Storage	Portworx PVC
Recommended PVC Size	50-100 GB for 3-node cluster

To adjust retention:

Prometheus retention configuration (in pack values)

prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "80GB"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: portworx-sc-repl2
          resources:
            requests:
              storage: 100Gi

Monitoring Architecture¶

graph TB
    subgraph Cluster["3-Node Cluster"]
        subgraph N1["Node 1"]
            NE1["Node Exporter"]
            KL1["Kubelet"]
        end
        subgraph N2["Node 2"]
            NE2["Node Exporter"]
            KL2["Kubelet"]
        end
        subgraph N3["Node 3"]
            NE3["Node Exporter"]
            KL3["Kubelet"]
        end

        PROM["Prometheus<br/>(Scrapes all targets)"]
        GRAF["Grafana<br/>(Dashboards)"]
        AM["Alertmanager<br/>(Notifications)"]
        KSM["kube-state-metrics"]
    end

    NE1 & NE2 & NE3 --> PROM
    KL1 & KL2 & KL3 --> PROM
    KSM --> PROM
    PROM --> GRAF
    PROM --> AM