Skip to content

Cluster Management

Palette manages the full lifecycle of the 3-node bare-metal cluster through cluster profiles, declarative updates, and A/B partition-based OS upgrades.

Cluster Profiles

The cluster is deployed with three layered profiles:

Infrastructure Profile

Defines the base platform stack:

Layer Pack Version Purpose
OS edge-native-byoi 2.1.0 Ubuntu 24.04 with Kairos
K8s edge-k8s (PXKe) 1.33.6 Kubernetes (kubeadm)
CNI cni-cilium-oss 1.18.4 eBPF networking
CSI Portworx -- Software-defined storage

Core Add-on Profile

Platform services deployed on top of K8s:

Pack Version Purpose
MetalLB 0.15.2 L2 load balancer for bare-metal
NGINX Ingress 1.14.1 Ingress controller
Prometheus Operator 80.4.2 Monitoring and alerting
VMO 4.8.9 Virtual Machine Orchestrator

VMA Profile

Pack Version Purpose
VM Migration Assistant 4.8.8 VM migration tooling

Profile Updates

When a cluster profile is updated, Palette orchestrates the rollout:

graph LR
    EDIT["Edit Profile<br/>in Palette"] --> DIFF["Review Diff<br/>Pending Update"]
    DIFF --> APPLY["Apply Update"]
    APPLY --> ROLL["Rolling Rollout<br/>Node by Node"]
    ROLL --> DONE["Update Complete"]

    style EDIT fill:#1F7A78,color:#fff
    style DIFF fill:#005B5B,color:#fff
    style APPLY fill:#043736,color:#fff
    style ROLL fill:#005B5B,color:#fff
    style DONE fill:#9EB277,color:#fff

Update Types

Update Type Behavior Downtime
Pack value change Rolling update, pod restart Minimal (per-pod)
Pack version upgrade Rolling update, new images pulled Minimal (per-node)
K8s version upgrade Rolling drain + upgrade per node One node at a time
OS upgrade A/B partition swap + reboot One node at a time

A/B Partition OS Upgrades

Appliance mode uses an A/B partition layout for zero-downtime OS upgrades. This is a key POC success criterion for Toyota.

How It Works

graph TB
    subgraph Boot["Boot Drive (/dev/sda)"]
        A["Partition A<br/>(Active OS)"]
        B["Partition B<br/>(Inactive)"]
        BOOT["Boot Loader"]
    end

    subgraph Upgrade["OS Upgrade Process"]
        direction TB
        S1["1. Write new OS to<br/>Partition B"]
        S2["2. Reboot node<br/>Boot from B"]
        S3["3. If success:<br/>B becomes active"]
        S4["4. If failure:<br/>Rollback to A"]
    end

    S1 --> S2 --> S3
    S2 --> S4

Upgrade Process

  1. Trigger -- Update the OS pack version in the cluster profile
  2. Download -- New OS image is pulled from the PMA internal registry
  3. Write -- New image is written to the inactive partition (B)
  4. Drain -- Node is cordoned and workloads are drained to other nodes
  5. Reboot -- Node reboots into the new partition
  6. Validate -- Palette agent verifies the node is healthy
  7. Activate -- If healthy, the new partition becomes the active boot target
  8. Rollback -- If unhealthy, the node reverts to the previous partition automatically
  9. Next node -- Process repeats for the remaining nodes (rolling)

Zero-Downtime Upgrades

With a 3-node cluster, one node can be upgraded at a time while the other two maintain workload availability. Portworx handles storage replication during the drain/reboot cycle, and VMO can live-migrate VMs to available nodes.

Node Operations

Adding a Node

To add a 4th (or more) node to the cluster:

  1. Image the new bare-metal node using the install ISO + site-user-data ISO
  2. Node registers with Palette as a new edge host
  3. In Palette, navigate to the cluster --> Machine Pools
  4. Edit the machine pool and add the new edge host
  5. Palette joins the node to the cluster automatically

Removing a Node

  1. In Palette, navigate to the cluster --> Machine Pools
  2. Select the node to remove
  3. Palette cordons the node, drains workloads, and removes it from the cluster
  4. Portworx rebalances storage replicas across remaining nodes

Minimum 3 Nodes

The cluster requires a minimum of 3 nodes for K8s control plane quorum, etcd HA, and Portworx replication. Removing a node below 3 will compromise cluster availability.

Node Replacement

For hardware failure scenarios:

  1. Image a replacement bare-metal node using the same install ISO + a new site-user-data ISO
  2. Register the replacement node with Palette
  3. Remove the failed node from the machine pool
  4. Add the replacement node to the machine pool
  5. Portworx resyncs data to the new node

Cluster Health Monitoring

Palette provides built-in cluster health monitoring:

Health Check Frequency Alert Condition
Node heartbeat 30 seconds Node unreachable > 5 minutes
Pod status Continuous Pods in CrashLoopBackOff
Profile drift Continuous Cluster state differs from profile
etcd health 30 seconds etcd member unhealthy
Resource utilization 1 minute CPU/memory threshold exceeded

The Palette console shows real-time cluster status with color-coded health indicators:

  • Green -- All components healthy
  • Yellow -- Warning condition (non-critical)
  • Red -- Error condition requiring attention

kubectl Access

For direct cluster access, download the kubeconfig from Palette:

  1. Navigate to the cluster in Palette
  2. Click Download Kubeconfig
  3. Save to your local machine
Verify cluster access
export KUBECONFIG=/path/to/kubeconfig.yaml
kubectl get nodes
kubectl get pods -A

Air-Gap Constraint

kubectl access requires network connectivity to the cluster K8s API on port 6443. In the air-gapped Toyota environment, kubectl must be run from a machine on VLAN 111 that can reach the cluster nodes.