Looking for help on Facebook

I’m going to share part of a conversation I had while attending KubeCon last fall. We sapped stories on our experiences with esoteric hardware, but there were two that came to mind. The first looked at a regional bank that happened to have a VAX based system running their daily transactions. There was an issue that required rebooting the system resolve the issue as it was critical to their day to day operations, but as you can probably guess, nobody knew how to do it.

Ultimately, it was taken care of, but not before they reached out to my conversation partner by way of… Facebook. He happened to live in the area and had an interest in the devices. Luckily for them he was able to help, but it does raise several questions such as how did things get so bad that not only was the day to day operations impacted, but they were dependent on trying to find someone on Facebook to attempt to resolve it. What would’ve happened if they couldn’t find someone, or he said no?

What if this happened under your watch? What would you have done, or how could you have prevented it?

The thing is once something becomes critical to the business, they usually aren’t touched unless absolutely necessary. It is critical that they continue to function because the business depends on it, which means it doesn’t require the care and feeding required to continue. Additionally, even now in a time where software effectively drives all business, there are companies that are streamlining their IT operations by converting their in-house team to an IT consultancy to fill the gaps.

Of course, you know all of this. The fact that an outside IT firm doesn’t know your organization quite like you or your team, and yet can’t build it. That, unlike what the vendors promised decades ago, the system they sold and built is not going to last forever, but may end up lasting longer then themselves.

For my prior clients, the first question I always ask is how much of this is documented? There are stories that are passed down that build part of the lore of the ecosystem, but I don’t mean that. Without having a solid understanding of what each system does and why it is important to the business, then a case cannot be made for making them more resilient or transitioning away from them. Granted there will always be at least one server hiding in a sealed room.
Additionally, documenting the failure scenarios, just like an airline pilot’s emergency checklist, can go a long way towards helping your team gain confidence, provide training for new members, all while ensuring that you have at least one way to handle emergency scenarios without having to resort to putting out an ad for help on Facebook.

Now that you know what they are, and why they matter, a case can be made. Perhaps not for a full replacement, but at least to lay the groundwork by building in flexibility to mitigate the worst case scenario. Looking at how they can fail, and what it would take to recover is a decent starting point for maintaining the status quo. When you are ready to start the transition, leveraging public and open standards can go a long way towards giving your organization the flexibility to continue. Another case for those going into cloud based offerings, sticking to the core infrastructure offerings will give you more flexibility in scaling to multiple cloud providers. Using something like the Strangler Pattern will require setting up a shim to act as not just monitor the traffic passing through a system, but shunt traffic to the new system when the time is right and to ensure a smooth transition.

Hopefully this will provide an idea or two on how to take the first step towards slaying the monster under your organization’s bed.