Often times have I found myself pulled into a really big outage bridge only to find out a cast of numerous characters on the call and unfortunately most of them have no positive things to contribute to solve the outage but end up doing the opposite, slowing restore of service. I am sure many of you have found yourselves in similar situations and think behind muted phones how stupid some of the suggestions from the so called experts are. When you have 30-50 people on a call, how can anything get done? And the truth is nothing gets done and progress is slow for most of the call until some brave soul steps in with some conviction in his voice and carries the lantern. So it comes to pass that organizations who pride themselves in following one standard or the other (be it ISO or ITIL or CMM or whatever) end up looking like a bunch of chicken running around with their heads cut off!
The problem is not that there are no talented people to solve outages, the problem is being unable to create an environment during an outage to make sure the organization knows who the "go to guys" are and leave them alone so that they can get the job done. It sounds so simple yet for all the gusto in the IT departments achievements and certifications they simply cannot let commonsense prevail. Part of the blame rests with the so called "leadership" or upper management whose reputation and jobs are on the line with the business breathing down their necks. And unfortunately all they can do is just bark orders like "what is the ETA to get this thing fixed? or "when should we expect service to be restored?", or "who can give me a summary of where we are?" or worse still " who can get this fixed, have we engaged the vendor?". The so called "leadership" feels frustrated because for one they lack the technical leadership to at least ask the right questions, second they have no clue who in their organization is best suited to solve the problem, third at the back of their minds they have all the motivation to pawn the root cause to someone external their organization and save their skin. I am not saying all are in the same category but more often than naught you will come across outage calls seemingly paralyzed by mindless questions to the extent that people who are close to the problem are scared stiff to even volunteer solutions and the doubts created to execute seemingly well understood procedures by countering every forward pushing solution with questions like "are we sure this is fool proof and will work?" or "do we to test this in production first?" makes the outage call go on forever. And in many instance these same guys who ask questions like "should we test in prod first?" don't know that non-prod not mirroring prod is probably the reason why they have an outage and testing in non-prod does nothing to reduce risk because it is so far apart from prod!
So, my suggestion is prepare for outages before they happen. Yes, prevention is better than cure and that is where all those ITILs/CMMs/ISOs etc come in handy, but outages cannot be eliminated. As long as there are changes, there is bound to be a risk of an outage. So make sure your organization has a SWAT team if you will, made up of people who are usually the brightest and understand the wide breadth of the stack from the browser to the storage. They may not be experts in all components but they can ask the right questions to attract the right response from the people who are closest to the problem which usually are your application engineers, production support specialists, etc. And during peacetime you can use the same SWAT team to be part of a gate that all delivery has to go through so that they can catch deficiencies in design, testing, performance, platform currency, architecture and life cycle environment issues, etc and reduce risk. This way you don't have to defend to the bean counters why you need to spend some money on the SWAT team, tell them they are your insurance against outages! And for heaven sake don't push all the buttons available in desperation during an outage, too many people only will add to your woes!
댓글