Nightmare In The NOC: The Perfect Storm of Failures
While I have dealt with many network nightmare scenarios, ranging from simple switch port errors that cascaded throughout a stack, failed power systems causing widespread outages, and even some disaster recovery / business continuity scenarios, the network nightmare I will talk about here is one of the longest lasting and most widespread one I have seen in a long time.
There have been maintenance or network issues that have taken down more devices, however this was one of the stranger ones I have seen. Sort of a perfect storm scenario.
This occurred in an SOHO (small office home office) environment, but with a larger network and more servers / services on it. The equipment in question was a Citrix Xenserver cluster with a Brocade FC backend storage network, other file and DFS servers, as well as four network managed UPS units providing power to the various systems.
A Quiet Day
It all started on a day that was, other than the outage, pretty quiet. We had just done some upgrades to the environment and were monitoring various systems. No work was supposed to be scheduled that day, however at about 4 PM, I received an indication that there are rising amounts of login problems to the Exchange OWA (email) system.
Before I could troubleshoot the first issues, the network management system started throwing numerous errors, most of them related to the storage network at the local site. Thinking it was an issue with the SAN controller, I logged into it, and sure enough, the dynamic array which was striped together in Windows, showed health status critical. OK, no big deal, drives fail sometimes. I swapped out some drives that were showing as missing or failed and rebooted the machine since it was only about 2–3 replacements in total.
Bad to Worse
This is when everything started to go haywire. For some reason, after the DFS machine restarted, 3 / 5 nodes had dropped out of the cluster with status:
HA STATUS ERROR— NETWORK UNAVAILABLE
On top of that, the array now showed a health warning, and the storage fabric was in an error state (most likely due to the loss of supervisory control on the fibre channel network, as the DFS machine was now completely locked up).
I knew I was in over my head so it was time to call in the troops. Within 20 minutes, we had 2 network engineers (storage network), an EMC2 (HBAs / software) rep and a Dell (servers) engineer either on call or on site. During this time, we had lost about 60 % of production and development VMs on the entire cluster with some machines losing their networked hard drives and BOSDing / kernel panicking. There was definitely something wrong.
Our priority was to move all of the production VMs off-site or to storage that was not affected. After doing that, we proceeded to start what was to become hours of tedious troubleshooting and disaster recovery.
Now I know the “proper” thing to do was shut down any systems that we thought were bad and go down the basics that worked and work my up from there. Except, nothing worked. We were bombarded with seemingly unrelated failure after failure, including Xen nodes that were stuck in their +1 startup status, seemingly random errors on the FC network, and what looked like a broken storage array.
But What Actually Happened?
After hours of troubleshooting, support requests, and expert opinions, I found the issue: a cascading bug in the FC / converged storage adapters on the Xen cluster that were incompatible with the version of Xenserver we were running. But, if it was a compatibility issue, why had the cluster remained stable from the point of the build? (Roughly 2 months). Also, why was this not mentioned in an of the HCLs?
The answer was the failover of a failed FC fiber link between one of the Xen nodes and it’s FC switch port. (Which by the way, we believe was also caused by the improper cards). Because of the issues with the cards, the failover resulted in a cascading error that eventually brought down one whole side of the storage column, failed over to the other side, with the same bad cards, and caused the node to drop. The sudden loss of I/O, and the cascade of resulting errors caused the DFS server to lock up and the connected array’s controllers to crash, leading in a total failure of the entire cluster. Fortunately, once the problem was identified, we migrated the data onto a spare QNAPP (awesome storage which we still use the production version of to this day), restored the broken VMs, and brought the cluster back online. I have since made numerous changes to the environment (including tossing out Xenserver and going with Hyper-V, and getting rid of fibrechannel storage entirely) to ensure something like this never happened again.
This was a story I originally posted on my private blog, but I decided to share it here instead. This happened about 2 years ago now and I have since rebuilt that entire environment.
Here are some highlights of the new deployment:
- I no longer use storage servers. All storage is now handled either by Ceph or by the QNAP systems themselves.
- The environment no longer runs Hyper-V. It now runs ESX Enterprise.
- The storage has been replaced with SSD accelerated QNAP arrays running on a 10G SAN.
- Shortly after that failure, I tossed the FC infrastructure. Everything now runs on 10G IP based storage fabrics.
- Kubernetes now hosts a majority of the apps on the cluster.
- And most importantly, that site is no longer a primary production site, I have since moved critical services away to new facilities due to unrelated issues with our power provider.
Thanks for reading!