Monitoring¶
The POC cluster includes the Prometheus Operator stack (kube-prometheus-stack v80.4.2) for comprehensive monitoring and alerting. This provides Prometheus for metrics collection, Grafana for visualization, and Alertmanager for notification routing.
Stack Components¶
| Component | Version | Purpose |
|---|---|---|
| Prometheus | Included in operator 80.4.2 | Metrics collection and storage |
| Grafana | Included in operator 80.4.2 | Dashboards and visualization |
| Alertmanager | Included in operator 80.4.2 | Alert routing and notification |
| Node Exporter | Included | Hardware and OS metrics |
| kube-state-metrics | Included | Kubernetes object metrics |
Accessing Grafana¶
Grafana is exposed via NGINX Ingress or can be accessed via port-forward:
Via kubectl Port-Forward¶
Then open http://localhost:3000 in your browser.
Default Credentials¶
| Field | Value |
|---|---|
| Username | admin |
| Password | spectro |
Change Grafana Password
The default Grafana password should be changed after first login, especially if Grafana is exposed via ingress.
Pre-Built Dashboards¶
The Prometheus Operator includes several pre-built dashboards:
Kubernetes Dashboards¶
| Dashboard | Metrics Shown |
|---|---|
| Kubernetes / Compute Resources / Cluster | CPU, memory, network across all nodes |
| Kubernetes / Compute Resources / Node | Per-node CPU, memory, disk, network |
| Kubernetes / Compute Resources / Namespace | Per-namespace resource usage |
| Kubernetes / Compute Resources / Pod | Per-pod CPU and memory |
| Kubernetes / Networking / Cluster | Cluster-wide network throughput |
| Kubernetes / Persistent Volumes | PVC usage, IOPS, latency |
Node Dashboards¶
| Dashboard | Metrics Shown |
|---|---|
| Node Exporter / Nodes | CPU, memory, disk, network per node |
| Node Exporter / USE Method / Node | Utilization, Saturation, Errors per node |
Portworx Dashboards¶
If Portworx monitoring is enabled, additional dashboards are available:
| Dashboard | Metrics Shown |
|---|---|
| Portworx Cluster | Overall cluster storage health |
| Portworx Volume | Per-volume IOPS, latency, throughput |
| Portworx Node | Per-node storage capacity and usage |
| Portworx Alerts | Active storage alerts |
Key Metrics to Monitor¶
Cluster Health¶
| Metric | PromQL | Alert Threshold |
|---|---|---|
| Node Ready | kube_node_status_condition{condition="Ready",status="true"} |
< 3 nodes |
| Pod Restarts | increase(kube_pod_container_status_restarts_total[1h]) |
> 5 restarts/hour |
| etcd Health | etcd_server_has_leader |
0 (no leader) |
| API Server Latency | apiserver_request_duration_seconds_bucket |
p99 > 1s |
Node Resources¶
| Metric | PromQL | Alert Threshold |
|---|---|---|
| CPU Usage | 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
> 80% sustained |
| Memory Usage | (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 |
> 85% |
| Disk Usage | (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 |
> 80% |
| Network Errors | rate(node_network_receive_errs_total[5m]) |
> 0 |
Storage (Portworx)¶
| Metric | PromQL | Alert Threshold |
|---|---|---|
| Pool Capacity | px_cluster_disk_utilized_bytes / px_cluster_disk_total_bytes * 100 |
> 80% |
| Volume Read Throughput | rate(px_volume_read_bytes_total[5m]) |
Baseline dependent |
| Replication Status | px_volume_replication_status |
!= healthy |
| Drive Health | px_disk_state |
!= online |
VMO¶
| Metric | PromQL | Alert Threshold |
|---|---|---|
| VM Running | kubevirt_vm_running_status_last_transition_timestamp_seconds |
VM not running |
| VM CPU Usage | kubevirt_vmi_cpu_usage_seconds_total |
Baseline dependent |
| VM Memory | kubevirt_vmi_memory_available_bytes |
< 10% free |
| VM Network | kubevirt_vmi_network_receive_bytes_total |
Baseline dependent |
Alertmanager Configuration¶
Alertmanager handles alert routing and notification. For the POC, alerts can be configured to send to:
- Email (SMTP relay required)
- Webhook (if external endpoint is reachable)
- Silence/inhibit for non-critical alerts during testing
Air-Gap Constraint
In the air-gapped environment, Alertmanager notification channels that require external network access (Slack, PagerDuty, etc.) will not work. Use email via internal SMTP relay or webhook to an internal endpoint.
Prometheus Data Retention¶
| Setting | Value |
|---|---|
| Retention Period | 15 days (default) |
| Storage | Portworx PVC |
| Recommended PVC Size | 50-100 GB for 3-node cluster |
To adjust retention:
prometheus:
prometheusSpec:
retention: 30d
retentionSize: "80GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: portworx-sc-repl2
resources:
requests:
storage: 100Gi
Monitoring Architecture¶
graph TB
subgraph Cluster["3-Node Cluster"]
subgraph N1["Node 1"]
NE1["Node Exporter"]
KL1["Kubelet"]
end
subgraph N2["Node 2"]
NE2["Node Exporter"]
KL2["Kubelet"]
end
subgraph N3["Node 3"]
NE3["Node Exporter"]
KL3["Kubelet"]
end
PROM["Prometheus<br/>(Scrapes all targets)"]
GRAF["Grafana<br/>(Dashboards)"]
AM["Alertmanager<br/>(Notifications)"]
KSM["kube-state-metrics"]
end
NE1 & NE2 & NE3 --> PROM
KL1 & KL2 & KL3 --> PROM
KSM --> PROM
PROM --> GRAF
PROM --> AM