Cluster Management¶

Palette manages the full lifecycle of the 3-node bare-metal cluster through cluster profiles, declarative updates, and A/B partition-based OS upgrades.

Cluster Profiles¶

The cluster is deployed with three layered profiles:

Infrastructure Profile¶

Defines the base platform stack:

Layer	Pack	Version	Purpose
OS	edge-native-byoi	2.1.0	Ubuntu 24.04 with Kairos
K8s	edge-k8s (PXKe)	1.33.6	Kubernetes (kubeadm)
CNI	cni-cilium-oss	1.18.4	eBPF networking
CSI	Portworx	--	Software-defined storage

Core Add-on Profile¶

Platform services deployed on top of K8s:

Pack	Version	Purpose
MetalLB	0.15.2	L2 load balancer for bare-metal
NGINX Ingress	1.14.1	Ingress controller
Prometheus Operator	80.4.2	Monitoring and alerting
VMO	4.8.9	Virtual Machine Orchestrator

VMA Profile¶

Pack	Version	Purpose
VM Migration Assistant	4.8.8	VM migration tooling

Profile Updates¶

When a cluster profile is updated, Palette orchestrates the rollout:

graph LR
    EDIT["Edit Profile<br/>in Palette"] --> DIFF["Review Diff<br/>Pending Update"]
    DIFF --> APPLY["Apply Update"]
    APPLY --> ROLL["Rolling Rollout<br/>Node by Node"]
    ROLL --> DONE["Update Complete"]

    style EDIT fill:#1F7A78,color:#fff
    style DIFF fill:#005B5B,color:#fff
    style APPLY fill:#043736,color:#fff
    style ROLL fill:#005B5B,color:#fff
    style DONE fill:#9EB277,color:#fff

Update Types¶

Update Type	Behavior	Downtime
Pack value change	Rolling update, pod restart	Minimal (per-pod)
Pack version upgrade	Rolling update, new images pulled	Minimal (per-node)
K8s version upgrade	Rolling drain + upgrade per node	One node at a time
OS upgrade	A/B partition swap + reboot	One node at a time

A/B Partition OS Upgrades¶

Appliance mode uses an A/B partition layout for zero-downtime OS upgrades. This is a key POC success criterion for Toyota.

How It Works¶

graph TB
    subgraph Boot["Boot Drive (/dev/sda)"]
        A["Partition A<br/>(Active OS)"]
        B["Partition B<br/>(Inactive)"]
        BOOT["Boot Loader"]
    end

    subgraph Upgrade["OS Upgrade Process"]
        direction TB
        S1["1. Write new OS to<br/>Partition B"]
        S2["2. Reboot node<br/>Boot from B"]
        S3["3. If success:<br/>B becomes active"]
        S4["4. If failure:<br/>Rollback to A"]
    end

    S1 --> S2 --> S3
    S2 --> S4

Upgrade Process¶

Trigger -- Update the OS pack version in the cluster profile
Download -- New OS image is pulled from the PMA internal registry
Write -- New image is written to the inactive partition (B)
Drain -- Node is cordoned and workloads are drained to other nodes
Reboot -- Node reboots into the new partition
Validate -- Palette agent verifies the node is healthy
Activate -- If healthy, the new partition becomes the active boot target
Rollback -- If unhealthy, the node reverts to the previous partition automatically
Next node -- Process repeats for the remaining nodes (rolling)

Zero-Downtime Upgrades

With a 3-node cluster, one node can be upgraded at a time while the other two maintain workload availability. Portworx handles storage replication during the drain/reboot cycle, and VMO can live-migrate VMs to available nodes.

Node Operations¶

Adding a Node¶

To add a 4th (or more) node to the cluster:

Image the new bare-metal node using the install ISO + site-user-data ISO
Node registers with Palette as a new edge host
In Palette, navigate to the cluster --> Machine Pools
Edit the machine pool and add the new edge host
Palette joins the node to the cluster automatically

Removing a Node¶

In Palette, navigate to the cluster --> Machine Pools
Select the node to remove
Palette cordons the node, drains workloads, and removes it from the cluster
Portworx rebalances storage replicas across remaining nodes

Minimum 3 Nodes

The cluster requires a minimum of 3 nodes for K8s control plane quorum, etcd HA, and Portworx replication. Removing a node below 3 will compromise cluster availability.

Node Replacement¶

For hardware failure scenarios:

Image a replacement bare-metal node using the same install ISO + a new site-user-data ISO
Register the replacement node with Palette
Remove the failed node from the machine pool
Add the replacement node to the machine pool
Portworx resyncs data to the new node

Cluster Health Monitoring¶

Palette provides built-in cluster health monitoring:

Health Check	Frequency	Alert Condition
Node heartbeat	30 seconds	Node unreachable > 5 minutes
Pod status	Continuous	Pods in CrashLoopBackOff
Profile drift	Continuous	Cluster state differs from profile
etcd health	30 seconds	etcd member unhealthy
Resource utilization	1 minute	CPU/memory threshold exceeded

The Palette console shows real-time cluster status with color-coded health indicators:

Green -- All components healthy
Yellow -- Warning condition (non-critical)
Red -- Error condition requiring attention

kubectl Access¶

For direct cluster access, download the kubeconfig from Palette:

Navigate to the cluster in Palette
Click Download Kubeconfig
Save to your local machine

Verify cluster access

export KUBECONFIG=/path/to/kubeconfig.yaml
kubectl get nodes
kubectl get pods -A

Air-Gap Constraint

kubectl access requires network connectivity to the cluster K8s API on port 6443. In the air-gapped Toyota environment, kubectl must be run from a machine on VLAN 111 that can reach the cluster nodes.