Nightmare In The NOC: The Perfect Storm of Failures

Overview

A Quiet Day

Bad to Worse

Disaster Recovery

But What Actually Happened?

Update

  • I no longer use storage servers. All storage is now handled either by Ceph or by the QNAP systems themselves.
  • The environment no longer runs Hyper-V. It now runs ESX Enterprise.
  • The storage has been replaced with SSD accelerated QNAP arrays running on a 10G SAN.
  • Shortly after that failure, I tossed the FC infrastructure. Everything now runs on 10G IP based storage fabrics.
  • Kubernetes now hosts a majority of the apps on the cluster.
  • And most importantly, that site is no longer a primary production site, I have since moved critical services away to new facilities due to unrelated issues with our power provider.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Pavel Glukhikh

Pavel Glukhikh

I am a senior solution architect and CEO of two tech startups. I enjoy all things tech, security, and physics. My background is in security and CI/CD.