Cluster Management¶
Palette manages the full lifecycle of the 3-node bare-metal cluster through cluster profiles, declarative updates, and A/B partition-based OS upgrades.
Cluster Profiles¶
The cluster is deployed with three layered profiles:
Infrastructure Profile¶
Defines the base platform stack:
| Layer | Pack | Version | Purpose |
|---|---|---|---|
| OS | edge-native-byoi | 2.1.0 | Ubuntu 24.04 with Kairos |
| K8s | edge-k8s (PXKe) | 1.33.6 | Kubernetes (kubeadm) |
| CNI | cni-cilium-oss | 1.18.4 | eBPF networking |
| CSI | Portworx | -- | Software-defined storage |
Core Add-on Profile¶
Platform services deployed on top of K8s:
| Pack | Version | Purpose |
|---|---|---|
| MetalLB | 0.15.2 | L2 load balancer for bare-metal |
| NGINX Ingress | 1.14.1 | Ingress controller |
| Prometheus Operator | 80.4.2 | Monitoring and alerting |
| VMO | 4.8.9 | Virtual Machine Orchestrator |
VMA Profile¶
| Pack | Version | Purpose |
|---|---|---|
| VM Migration Assistant | 4.8.8 | VM migration tooling |
Profile Updates¶
When a cluster profile is updated, Palette orchestrates the rollout:
graph LR
EDIT["Edit Profile<br/>in Palette"] --> DIFF["Review Diff<br/>Pending Update"]
DIFF --> APPLY["Apply Update"]
APPLY --> ROLL["Rolling Rollout<br/>Node by Node"]
ROLL --> DONE["Update Complete"]
style EDIT fill:#1F7A78,color:#fff
style DIFF fill:#005B5B,color:#fff
style APPLY fill:#043736,color:#fff
style ROLL fill:#005B5B,color:#fff
style DONE fill:#9EB277,color:#fff
Update Types¶
| Update Type | Behavior | Downtime |
|---|---|---|
| Pack value change | Rolling update, pod restart | Minimal (per-pod) |
| Pack version upgrade | Rolling update, new images pulled | Minimal (per-node) |
| K8s version upgrade | Rolling drain + upgrade per node | One node at a time |
| OS upgrade | A/B partition swap + reboot | One node at a time |
A/B Partition OS Upgrades¶
Appliance mode uses an A/B partition layout for zero-downtime OS upgrades. This is a key POC success criterion for Toyota.
How It Works¶
graph TB
subgraph Boot["Boot Drive (/dev/sda)"]
A["Partition A<br/>(Active OS)"]
B["Partition B<br/>(Inactive)"]
BOOT["Boot Loader"]
end
subgraph Upgrade["OS Upgrade Process"]
direction TB
S1["1. Write new OS to<br/>Partition B"]
S2["2. Reboot node<br/>Boot from B"]
S3["3. If success:<br/>B becomes active"]
S4["4. If failure:<br/>Rollback to A"]
end
S1 --> S2 --> S3
S2 --> S4
Upgrade Process¶
- Trigger -- Update the OS pack version in the cluster profile
- Download -- New OS image is pulled from the PMA internal registry
- Write -- New image is written to the inactive partition (B)
- Drain -- Node is cordoned and workloads are drained to other nodes
- Reboot -- Node reboots into the new partition
- Validate -- Palette agent verifies the node is healthy
- Activate -- If healthy, the new partition becomes the active boot target
- Rollback -- If unhealthy, the node reverts to the previous partition automatically
- Next node -- Process repeats for the remaining nodes (rolling)
Zero-Downtime Upgrades
With a 3-node cluster, one node can be upgraded at a time while the other two maintain workload availability. Portworx handles storage replication during the drain/reboot cycle, and VMO can live-migrate VMs to available nodes.
Node Operations¶
Adding a Node¶
To add a 4th (or more) node to the cluster:
- Image the new bare-metal node using the install ISO + site-user-data ISO
- Node registers with Palette as a new edge host
- In Palette, navigate to the cluster --> Machine Pools
- Edit the machine pool and add the new edge host
- Palette joins the node to the cluster automatically
Removing a Node¶
- In Palette, navigate to the cluster --> Machine Pools
- Select the node to remove
- Palette cordons the node, drains workloads, and removes it from the cluster
- Portworx rebalances storage replicas across remaining nodes
Minimum 3 Nodes
The cluster requires a minimum of 3 nodes for K8s control plane quorum, etcd HA, and Portworx replication. Removing a node below 3 will compromise cluster availability.
Node Replacement¶
For hardware failure scenarios:
- Image a replacement bare-metal node using the same install ISO + a new site-user-data ISO
- Register the replacement node with Palette
- Remove the failed node from the machine pool
- Add the replacement node to the machine pool
- Portworx resyncs data to the new node
Cluster Health Monitoring¶
Palette provides built-in cluster health monitoring:
| Health Check | Frequency | Alert Condition |
|---|---|---|
| Node heartbeat | 30 seconds | Node unreachable > 5 minutes |
| Pod status | Continuous | Pods in CrashLoopBackOff |
| Profile drift | Continuous | Cluster state differs from profile |
| etcd health | 30 seconds | etcd member unhealthy |
| Resource utilization | 1 minute | CPU/memory threshold exceeded |
The Palette console shows real-time cluster status with color-coded health indicators:
- Green -- All components healthy
- Yellow -- Warning condition (non-critical)
- Red -- Error condition requiring attention
kubectl Access¶
For direct cluster access, download the kubeconfig from Palette:
- Navigate to the cluster in Palette
- Click Download Kubeconfig
- Save to your local machine
export KUBECONFIG=/path/to/kubeconfig.yaml
kubectl get nodes
kubectl get pods -A
Air-Gap Constraint
kubectl access requires network connectivity to the cluster K8s API on port 6443. In the air-gapped Toyota environment, kubectl must be run from a machine on VLAN 111 that can reach the cluster nodes.