Skip to content

Known Issues and Troubleshooting

This page documents issues encountered during the Toyota TMNA POC deployment, their root causes, and the solutions applied. These are specific to the NX-8150-G7 (Supermicro) hardware and the air-gapped appliance mode deployment.

UEFI Boot Failure on Supermicro Hardware

Severity: Blocking (resolved with Legacy mode)

Symptoms

  • Custom appliance ISO mounts via IPMI virtual media but does not appear in the UEFI boot menu
  • ISO fails to boot in UEFI mode on all 3 NX-8150-G7 nodes
  • Windows Server 64-bit ISO boots and installs in UEFI mode on the same hardware -- confirming UEFI is functional

Root Cause

The Spectro Cloud appliance ISO's EFI partition structure is not compatible with the Supermicro BIOS version on the NX-8150-G7 chassis. The IPMI firmware may be an older version that has limited UEFI support for non-Windows ISOs. The exact failure is related to the x86/x64 EFI loader handoff -- the ISO's EFI boot path is not recognized by the Supermicro UEFI implementation.

BIOS Settings Tested

Setting Values Tested Result
CSM Disabled (UEFI only) ISO not visible in boot menu
CSM Enabled (Legacy + UEFI) ISO boots in Legacy mode only
Secure Boot Disabled No change
Boot Order CD first, then hard disk Correct but UEFI path fails
Legacy-to-UEFI Handoff Enabled Not confirmed effective

Solution

Use Legacy boot mode with CSM enabled. The ISO boots and installs correctly in Legacy/CSM mode.

  1. Enter BIOS setup (DEL or F2 during POST)
  2. Navigate to Boot --> CSM Configuration
  3. Set CSM Support to Enabled
  4. Set Boot option filter to Legacy only or UEFI and Legacy
  5. Set Launch Storage OpROM policy to Legacy
  6. Save and exit
  7. Boot from the virtual media ISO

All Nodes Must Match

Ensure all 3 bare-metal nodes have identical BIOS settings. Legacy boot mode must be consistent across the cluster.


Drive Wipe Errors from Prior Platform Metadata

Severity: Blocking (resolved with sgdisk wipe)

Symptoms

  • Portworx fails to initialize storage pools on drives from the prior platform deployment
  • lsblk shows unexpected partition tables on 7TB SSDs and NVMe drives
  • Drive initialization errors referencing prior metadata or partition UUID conflicts

Root Cause

The NX-8150-G7 nodes were previously part of an existing cluster (STG-P-AHV001). The local SSDs and NVMe drives contain metadata, partition tables, and filesystem signatures from the prior deployment. Portworx requires clean, unpartitioned drives to build its storage pool.

Solution

Wipe all 7TB+ drives during the install stage of the appliance ISO using sgdisk --zap-all:

Drive wipe commands
# Wipe all 7TB drives (SSDs and NVMe), skip boot drive
for disk in $(lsblk -dno NAME,SIZE | awk '$2 == "7T" {print "/dev/"$1}'); do
    echo "Wiping $disk..."
    sgdisk --zap-all "$disk"     # Destroy GPT and MBR
    wipefs -a "$disk"            # Clear filesystem signatures
    dd if=/dev/zero of="$disk" bs=1M count=100   # Zero first 100MB
done

This is baked into the install ISO user-data under stages.install and stages.reset so it runs automatically during both initial install and any future node reset.

Data Destruction

This wipe is destructive and irreversible. It is intentional for the POC -- these drives contained data from the decommissioned prior platform cluster.


IPMI Virtual Media Mount Issues

Severity: Moderate

Symptoms

  • ISO fails to mount via IPMI virtual media
  • Virtual media shows "Connected" but the drive is not visible as a boot device
  • Intermittent disconnection of virtual media during ISO boot/install

Root Cause

IPMI firmware on the NX-8150-G7 may have limitations with large ISO files or specific virtual media configurations. Network latency between the management station and the IPMI interface can cause disconnections.

Solutions

  1. Use a smaller ISO when possible -- The production install ISO (~1.8 GB) is much more reliable than large ISOs via virtual media
  2. Upload to local storage -- If virtual media is unreliable, upload the ISO to a local USB drive or the node's IPMI SD card slot
  3. Verify IPMI firmware -- Check if a firmware update is available for the Supermicro BMC
  4. Use HTML5 console -- The IPMI HTML5 console may handle virtual media more reliably than the Java-based console
Check IPMI firmware version
ipmitool -I lanplus -H <IPMI-IP> -U admin -P <password> mc info

Site User Data ISO

The site-user-data ISOs are very small (~1 MB each) and mount reliably via virtual media. Issues are primarily with the larger install ISO.


Cloudflared Tunnel Catch-All 404

Severity: Low (development/testing only)

Symptoms

  • Cloudflared tunnel configured for remote access returns 404 for all routes
  • Tunnel is connected but no traffic reaches the target service

Root Cause

The config.yml file at /etc/cloudflared/config.yml has a catch-all rule that intercepts all traffic before it reaches the intended ingress rules.

Solution

Move or rename the default config file:

Fix cloudflared catch-all
# Move the default config out of the way
sudo mv /etc/cloudflared/config.yml /etc/cloudflared/config.yml.bak

# Restart cloudflared to pick up the tunnel-specific config
sudo systemctl restart cloudflared

POC Context

This issue only applies if cloudflared is used for remote access to the POC environment. In the fully air-gapped deployment, cloudflared is not used.


Bond Not Forming / NIC Not Detected

Severity: High (if encountered)

Symptoms

  • cat /proc/net/bonding/bond0 shows fewer than 4 slave interfaces
  • Network connectivity is available but without full bandwidth/redundancy
  • ip link shows some NICs in DOWN state

Possible Causes

  1. Switch-side LACP not configured on all 4 ports
  2. NIC firmware requires update
  3. Cable issue on one or more ports
  4. NIC naming mismatch (interface names differ from expected)

Diagnostic Steps

Check bond status
cat /proc/net/bonding/bond0

# Check all network interfaces
ip link show

# Check for link on each NIC
ethtool enp134s0f0np0 | grep "Link detected"
ethtool enp134s0f0np1 | grep "Link detected"
ethtool enp175s0f0np0 | grep "Link detected"
ethtool enp175s0f0np1 | grep "Link detected"

# Check systemd-networkd status
systemctl status systemd-networkd
networkctl status bond0

Solution

  1. Verify switch LACP configuration on all 4 ports per node
  2. Verify cable connections
  3. If NIC names differ, update the systemd-networkd config files and rebuild the ISO
  4. Check journalctl -u systemd-networkd for errors

Portworx Pool Initialization Failure

Severity: High (if encountered)

Symptoms

  • Portworx pods are running but no storage pool is created
  • pxctl status shows 0 pools or pool in error state
  • Drive errors in Portworx logs

Diagnostic Steps

Check Portworx status
# Get Portworx pod name
PX_POD=$(kubectl get pods -n portworx -l name=portworx -o jsonpath='{.items[0].metadata.name}')

# Check PX status
kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl status

# Check drives
kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl service drive show

# Check PX logs
kubectl logs -n portworx "$PX_POD" --tail=100

Common Fixes

  1. Drives not wiped -- Re-run sgdisk --zap-all on all data drives
  2. Boot drive in pool -- Verify Portworx is not trying to use /dev/sda (the boot drive)
  3. Insufficient metadata space -- Ensure at least 64GB is available for PX metadata on the fastest NVMe drive
  4. KVDB failure -- Check internal etcd logs: kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl service kvdb members

General Troubleshooting Commands

Cluster health checks
# Node status
kubectl get nodes -o wide

# All pods across namespaces
kubectl get pods -A -o wide

# Events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Palette agent status (SSH to node)
systemctl status stylus

# Palette agent logs (SSH to node)
journalctl -u stylus --since "1 hour ago"

# etcd health
kubectl exec -n kube-system etcd-<node-name> -- etcdctl endpoint health

# Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status

# VMO/KubeVirt status
kubectl get kubevirt -A
kubectl get vm -A
kubectl get vmi -A