Known Issues and Troubleshooting¶

This page documents issues encountered during the Toyota TMNA POC deployment, their root causes, and the solutions applied. These are specific to the NX-8150-G7 (Supermicro) hardware and the air-gapped appliance mode deployment.

UEFI Boot Failure on Supermicro Hardware¶

Severity: Blocking (resolved with Legacy mode)

Symptoms¶

Custom appliance ISO mounts via IPMI virtual media but does not appear in the UEFI boot menu
ISO fails to boot in UEFI mode on all 3 NX-8150-G7 nodes
Windows Server 64-bit ISO boots and installs in UEFI mode on the same hardware -- confirming UEFI is functional

Root Cause¶

The Spectro Cloud appliance ISO's EFI partition structure is not compatible with the Supermicro BIOS version on the NX-8150-G7 chassis. The IPMI firmware may be an older version that has limited UEFI support for non-Windows ISOs. The exact failure is related to the x86/x64 EFI loader handoff -- the ISO's EFI boot path is not recognized by the Supermicro UEFI implementation.

BIOS Settings Tested¶

Setting	Values Tested	Result
CSM	Disabled (UEFI only)	ISO not visible in boot menu
CSM	Enabled (Legacy + UEFI)	ISO boots in Legacy mode only
Secure Boot	Disabled	No change
Boot Order	CD first, then hard disk	Correct but UEFI path fails
Legacy-to-UEFI Handoff	Enabled	Not confirmed effective

Solution¶

Use Legacy boot mode with CSM enabled. The ISO boots and installs correctly in Legacy/CSM mode.

Enter BIOS setup (DEL or F2 during POST)
Navigate to Boot --> CSM Configuration
Set CSM Support to Enabled
Set Boot option filter to Legacy only or UEFI and Legacy
Set Launch Storage OpROM policy to Legacy
Save and exit
Boot from the virtual media ISO

All Nodes Must Match

Ensure all 3 bare-metal nodes have identical BIOS settings. Legacy boot mode must be consistent across the cluster.

Drive Wipe Errors from Prior Platform Metadata¶

Severity: Blocking (resolved with sgdisk wipe)

Symptoms¶

Portworx fails to initialize storage pools on drives from the prior platform deployment
lsblk shows unexpected partition tables on 7TB SSDs and NVMe drives
Drive initialization errors referencing prior metadata or partition UUID conflicts

Root Cause¶

The NX-8150-G7 nodes were previously part of an existing cluster (STG-P-AHV001). The local SSDs and NVMe drives contain metadata, partition tables, and filesystem signatures from the prior deployment. Portworx requires clean, unpartitioned drives to build its storage pool.

Solution¶

Wipe all 7TB+ drives during the install stage of the appliance ISO using sgdisk --zap-all:

Drive wipe commands

# Wipe all 7TB drives (SSDs and NVMe), skip boot drive
for disk in $(lsblk -dno NAME,SIZE | awk '$2 == "7T" {print "/dev/"$1}'); do
    echo "Wiping $disk..."
    sgdisk --zap-all "$disk"     # Destroy GPT and MBR
    wipefs -a "$disk"            # Clear filesystem signatures
    dd if=/dev/zero of="$disk" bs=1M count=100   # Zero first 100MB
done

This is baked into the install ISO user-data under stages.install and stages.reset so it runs automatically during both initial install and any future node reset.

Data Destruction

This wipe is destructive and irreversible. It is intentional for the POC -- these drives contained data from the decommissioned prior platform cluster.

IPMI Virtual Media Mount Issues¶

Severity: Moderate

Symptoms¶

ISO fails to mount via IPMI virtual media
Virtual media shows "Connected" but the drive is not visible as a boot device
Intermittent disconnection of virtual media during ISO boot/install

Root Cause¶

IPMI firmware on the NX-8150-G7 may have limitations with large ISO files or specific virtual media configurations. Network latency between the management station and the IPMI interface can cause disconnections.

Solutions¶

Use a smaller ISO when possible -- The production install ISO (~1.8 GB) is much more reliable than large ISOs via virtual media
Upload to local storage -- If virtual media is unreliable, upload the ISO to a local USB drive or the node's IPMI SD card slot
Verify IPMI firmware -- Check if a firmware update is available for the Supermicro BMC
Use HTML5 console -- The IPMI HTML5 console may handle virtual media more reliably than the Java-based console

Check IPMI firmware version

ipmitool -I lanplus -H <IPMI-IP> -U admin -P <password> mc info

Site User Data ISO

The site-user-data ISOs are very small (~1 MB each) and mount reliably via virtual media. Issues are primarily with the larger install ISO.

Cloudflared Tunnel Catch-All 404¶

Severity: Low (development/testing only)

Symptoms¶

Cloudflared tunnel configured for remote access returns 404 for all routes
Tunnel is connected but no traffic reaches the target service

Root Cause¶

The config.yml file at /etc/cloudflared/config.yml has a catch-all rule that intercepts all traffic before it reaches the intended ingress rules.

Solution¶

Move or rename the default config file:

Fix cloudflared catch-all

# Move the default config out of the way
sudo mv /etc/cloudflared/config.yml /etc/cloudflared/config.yml.bak

# Restart cloudflared to pick up the tunnel-specific config
sudo systemctl restart cloudflared

POC Context

This issue only applies if cloudflared is used for remote access to the POC environment. In the fully air-gapped deployment, cloudflared is not used.

Bond Not Forming / NIC Not Detected¶

Severity: High (if encountered)

Symptoms¶

cat /proc/net/bonding/bond0 shows fewer than 4 slave interfaces
Network connectivity is available but without full bandwidth/redundancy
ip link shows some NICs in DOWN state

Possible Causes¶

Switch-side LACP not configured on all 4 ports
NIC firmware requires update
Cable issue on one or more ports
NIC naming mismatch (interface names differ from expected)

Diagnostic Steps¶

Check bond status

cat /proc/net/bonding/bond0

# Check all network interfaces
ip link show

# Check for link on each NIC
ethtool enp134s0f0np0 | grep "Link detected"
ethtool enp134s0f0np1 | grep "Link detected"
ethtool enp175s0f0np0 | grep "Link detected"
ethtool enp175s0f0np1 | grep "Link detected"

# Check systemd-networkd status
systemctl status systemd-networkd
networkctl status bond0

Solution¶

Verify switch LACP configuration on all 4 ports per node
Verify cable connections
If NIC names differ, update the systemd-networkd config files and rebuild the ISO
Check journalctl -u systemd-networkd for errors

Portworx Pool Initialization Failure¶

Severity: High (if encountered)

Symptoms¶

Portworx pods are running but no storage pool is created
pxctl status shows 0 pools or pool in error state
Drive errors in Portworx logs

Diagnostic Steps¶

Check Portworx status

# Get Portworx pod name
PX_POD=$(kubectl get pods -n portworx -l name=portworx -o jsonpath='{.items[0].metadata.name}')

# Check PX status
kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl status

# Check drives
kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl service drive show

# Check PX logs
kubectl logs -n portworx "$PX_POD" --tail=100

Common Fixes¶

Drives not wiped -- Re-run sgdisk --zap-all on all data drives
Boot drive in pool -- Verify Portworx is not trying to use /dev/sda (the boot drive)
Insufficient metadata space -- Ensure at least 64GB is available for PX metadata on the fastest NVMe drive
KVDB failure -- Check internal etcd logs: kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl service kvdb members

General Troubleshooting Commands¶

Cluster health checks

# Node status
kubectl get nodes -o wide

# All pods across namespaces
kubectl get pods -A -o wide

# Events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Palette agent status (SSH to node)
systemctl status stylus

# Palette agent logs (SSH to node)
journalctl -u stylus --since "1 hour ago"

# etcd health
kubectl exec -n kube-system etcd-<node-name> -- etcdctl endpoint health

# Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status

# VMO/KubeVirt status
kubectl get kubevirt -A
kubectl get vm -A
kubectl get vmi -A