Known Issues and Troubleshooting¶
This page documents issues encountered during the Toyota TMNA POC deployment, their root causes, and the solutions applied. These are specific to the NX-8150-G7 (Supermicro) hardware and the air-gapped appliance mode deployment.
UEFI Boot Failure on Supermicro Hardware¶
Severity: Blocking (resolved with Legacy mode)
Symptoms¶
- Custom appliance ISO mounts via IPMI virtual media but does not appear in the UEFI boot menu
- ISO fails to boot in UEFI mode on all 3 NX-8150-G7 nodes
- Windows Server 64-bit ISO boots and installs in UEFI mode on the same hardware -- confirming UEFI is functional
Root Cause¶
The Spectro Cloud appliance ISO's EFI partition structure is not compatible with the Supermicro BIOS version on the NX-8150-G7 chassis. The IPMI firmware may be an older version that has limited UEFI support for non-Windows ISOs. The exact failure is related to the x86/x64 EFI loader handoff -- the ISO's EFI boot path is not recognized by the Supermicro UEFI implementation.
BIOS Settings Tested¶
| Setting | Values Tested | Result |
|---|---|---|
| CSM | Disabled (UEFI only) | ISO not visible in boot menu |
| CSM | Enabled (Legacy + UEFI) | ISO boots in Legacy mode only |
| Secure Boot | Disabled | No change |
| Boot Order | CD first, then hard disk | Correct but UEFI path fails |
| Legacy-to-UEFI Handoff | Enabled | Not confirmed effective |
Solution¶
Use Legacy boot mode with CSM enabled. The ISO boots and installs correctly in Legacy/CSM mode.
- Enter BIOS setup (DEL or F2 during POST)
- Navigate to Boot --> CSM Configuration
- Set CSM Support to Enabled
- Set Boot option filter to Legacy only or UEFI and Legacy
- Set Launch Storage OpROM policy to Legacy
- Save and exit
- Boot from the virtual media ISO
All Nodes Must Match
Ensure all 3 bare-metal nodes have identical BIOS settings. Legacy boot mode must be consistent across the cluster.
Drive Wipe Errors from Prior Platform Metadata¶
Severity: Blocking (resolved with sgdisk wipe)
Symptoms¶
- Portworx fails to initialize storage pools on drives from the prior platform deployment
lsblkshows unexpected partition tables on 7TB SSDs and NVMe drives- Drive initialization errors referencing prior metadata or partition UUID conflicts
Root Cause¶
The NX-8150-G7 nodes were previously part of an existing cluster (STG-P-AHV001). The local SSDs and NVMe drives contain metadata, partition tables, and filesystem signatures from the prior deployment. Portworx requires clean, unpartitioned drives to build its storage pool.
Solution¶
Wipe all 7TB+ drives during the install stage of the appliance ISO using sgdisk --zap-all:
# Wipe all 7TB drives (SSDs and NVMe), skip boot drive
for disk in $(lsblk -dno NAME,SIZE | awk '$2 == "7T" {print "/dev/"$1}'); do
echo "Wiping $disk..."
sgdisk --zap-all "$disk" # Destroy GPT and MBR
wipefs -a "$disk" # Clear filesystem signatures
dd if=/dev/zero of="$disk" bs=1M count=100 # Zero first 100MB
done
This is baked into the install ISO user-data under stages.install and stages.reset so it runs automatically during both initial install and any future node reset.
Data Destruction
This wipe is destructive and irreversible. It is intentional for the POC -- these drives contained data from the decommissioned prior platform cluster.
IPMI Virtual Media Mount Issues¶
Severity: Moderate
Symptoms¶
- ISO fails to mount via IPMI virtual media
- Virtual media shows "Connected" but the drive is not visible as a boot device
- Intermittent disconnection of virtual media during ISO boot/install
Root Cause¶
IPMI firmware on the NX-8150-G7 may have limitations with large ISO files or specific virtual media configurations. Network latency between the management station and the IPMI interface can cause disconnections.
Solutions¶
- Use a smaller ISO when possible -- The production install ISO (~1.8 GB) is much more reliable than large ISOs via virtual media
- Upload to local storage -- If virtual media is unreliable, upload the ISO to a local USB drive or the node's IPMI SD card slot
- Verify IPMI firmware -- Check if a firmware update is available for the Supermicro BMC
- Use HTML5 console -- The IPMI HTML5 console may handle virtual media more reliably than the Java-based console
Site User Data ISO
The site-user-data ISOs are very small (~1 MB each) and mount reliably via virtual media. Issues are primarily with the larger install ISO.
Cloudflared Tunnel Catch-All 404¶
Severity: Low (development/testing only)
Symptoms¶
- Cloudflared tunnel configured for remote access returns 404 for all routes
- Tunnel is connected but no traffic reaches the target service
Root Cause¶
The config.yml file at /etc/cloudflared/config.yml has a catch-all rule that intercepts all traffic before it reaches the intended ingress rules.
Solution¶
Move or rename the default config file:
# Move the default config out of the way
sudo mv /etc/cloudflared/config.yml /etc/cloudflared/config.yml.bak
# Restart cloudflared to pick up the tunnel-specific config
sudo systemctl restart cloudflared
POC Context
This issue only applies if cloudflared is used for remote access to the POC environment. In the fully air-gapped deployment, cloudflared is not used.
Bond Not Forming / NIC Not Detected¶
Severity: High (if encountered)
Symptoms¶
cat /proc/net/bonding/bond0shows fewer than 4 slave interfaces- Network connectivity is available but without full bandwidth/redundancy
ip linkshows some NICs in DOWN state
Possible Causes¶
- Switch-side LACP not configured on all 4 ports
- NIC firmware requires update
- Cable issue on one or more ports
- NIC naming mismatch (interface names differ from expected)
Diagnostic Steps¶
cat /proc/net/bonding/bond0
# Check all network interfaces
ip link show
# Check for link on each NIC
ethtool enp134s0f0np0 | grep "Link detected"
ethtool enp134s0f0np1 | grep "Link detected"
ethtool enp175s0f0np0 | grep "Link detected"
ethtool enp175s0f0np1 | grep "Link detected"
# Check systemd-networkd status
systemctl status systemd-networkd
networkctl status bond0
Solution¶
- Verify switch LACP configuration on all 4 ports per node
- Verify cable connections
- If NIC names differ, update the systemd-networkd config files and rebuild the ISO
- Check
journalctl -u systemd-networkdfor errors
Portworx Pool Initialization Failure¶
Severity: High (if encountered)
Symptoms¶
- Portworx pods are running but no storage pool is created
pxctl statusshows 0 pools or pool in error state- Drive errors in Portworx logs
Diagnostic Steps¶
# Get Portworx pod name
PX_POD=$(kubectl get pods -n portworx -l name=portworx -o jsonpath='{.items[0].metadata.name}')
# Check PX status
kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl status
# Check drives
kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl service drive show
# Check PX logs
kubectl logs -n portworx "$PX_POD" --tail=100
Common Fixes¶
- Drives not wiped -- Re-run
sgdisk --zap-allon all data drives - Boot drive in pool -- Verify Portworx is not trying to use
/dev/sda(the boot drive) - Insufficient metadata space -- Ensure at least 64GB is available for PX metadata on the fastest NVMe drive
- KVDB failure -- Check internal etcd logs:
kubectl exec -n portworx "$PX_POD" -- /opt/pwx/bin/pxctl service kvdb members
General Troubleshooting Commands¶
# Node status
kubectl get nodes -o wide
# All pods across namespaces
kubectl get pods -A -o wide
# Events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Palette agent status (SSH to node)
systemctl status stylus
# Palette agent logs (SSH to node)
journalctl -u stylus --since "1 hour ago"
# etcd health
kubectl exec -n kube-system etcd-<node-name> -- etcdctl endpoint health
# Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status
# VMO/KubeVirt status
kubectl get kubevirt -A
kubectl get vm -A
kubectl get vmi -A