IT: Let’s Learn From Mistakes

Jun 20, 2017 Posted by: Dwills Uncategorized

The aviation sector is very good at learning from mistakes. Eric Hughes, EMH Technology’s Director has first-hand experience of this: “When I was learning to fly I read a (paper!) magazine in which there was a monthly section called ‘I Learned About Flying From That’. It was basically a short story about bad things that had happened, or nearly happened, in a bid to inform others and prevent accidents. It strikes me that a similar approach within IT might be beneficial.”

IT professionals each have their own opinions on the best way to do things. They are sure to have a story to tell – one from which they have learned something. In today’s online environment it’s very easy to laugh at, demonstrate one’s superiority to, and generally put down others. It’s not an encouraging environment in which to own up and genuinely say we had a learning moment.

“Experience is something you gain directly after you need it,” says Eric. So here are his learning stories, which have helped his IT journey (before and since establishing his technology and infrastructure business in Hitchin) and can help others too.

“I Learned About IT From That…”

Several years ago, I was working for a bank in London and we were performing a disaster recovery (DR) test. It was the culmination of many months of work for a lot of people and a lot of money had been spent.

The solution allowed us to deploy 400 PCs in a remote location complete with around 60 applications, within 4 hours of an incident. A significant part of the solution revolved around a large and suitably expensive piece of IT kit called a Storage Area Network (SAN). Think of it as a very big box that contains all the hard disks for every server. The data in it were mirrored to an identical secondary box in our remote location to protect the business in case of failure. The theory being when everything is working normally you point your servers at the primary SAN and they see their hard drives and data and off you go. In a disaster, you point some separate physical secondary servers at the secondary SAN and then those servers look and behave exactly like the primary servers had seconds before the disaster.

As part of the testing we had fired up the secondary servers and SAN and cut the links to the main office. About 100 traders and a similar number of IT staff gave up their weekend to come in and test it all and document their findings.

At the end of the weekend we reconnected the head office and replicated any changes back to the main office before reverting to the primary hardware. There was a nice graphical interface telling us that the old primary SAN was up to date and had all the changes from the weekend so we swapped everything back and brought all the main servers online.

About five minutes after this we had a couple of people pop their heads into the server room asking if we had finished yet as they couldn’t see any of their documents. My colleague and I had what I heard wonderfully described as an ‘ohnosecond’. That small unit of time where you realise something very bad has happened. The very bad thing was that the SAN software had lied and we’d just deleted a weekend’s work for a lot of people. Popular we were not. Subsequent investigation showed a glitch in the software that runs the SAN, which lead to an incorrect reading. The bug was easily fixed and the testing was repeated successfully.

The points I took away from this are:

test any DR solution regularly, and
take as much care restoring your business to normal running after a disaster as you do getting it up and running during a disaster.

Learn From Others – Not Mistakes

Make the most of the support around you before disaster strikes. EMH Technology has a proven track record of offering bespoke IT support for business which spans over 12 years. From single users to global businesses, you will receive a fast response, fixed fee and no contractual lock-ins. Find out more by contacting us for an informal chat without obligation.