Update to Major Memphis Outage

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Here is an update to the following thread:

MAjor Computer Outage Right Now

I spoke to Richard White, Director of telecomunications for the Memphis Airport (901-922-8031).

He confirmed that there was an outage of one hour to long distance service due to a "card in a major switch downtown at AT&T going down."

Now what I'd like to know, is was it really a card, or some other problem. We know that switching equipment going into 2000 was very suspect for vulnerability to problems.

I believe there have been quite a few threads here recently concerning long distance service problems. Is there anything to this?

I called the Memphis newspaper to give an update on this, and the reporter told me he hadn't had time to follow up on it. I guess it wasn't "newsworthy" in there opinion.

Funny, sometimes things like this are blown out of proportion, while other times.....nothing.

It is interesting, IMHO, that the pollies have not even touched this thread along the way. They are so quick when people report "friend of mine" stories. But here is a live, first-hand account, and they are nowhere to be seen! I even provided names and phone numbers so they could independently verify the story.

I guess they know the truth.

-- Duke1983 (Duke1983@aol.com), January 25, 2000

Answers

"He confirmed that there was an outage of one hour to long distance service due to a "card in a major switch downtown at AT&T going down."

Duke,

It was an hour long outage and it got fixed. I bet it wasn't a "card" though, it was probably Squirrel King at it again.

-- (I'm@pol.ly), January 25, 2000.


Exactly. This is a very clear example of a Y2K problem. Of course the pollies won't touch it.

-- (bernie@refdan.org), January 25, 2000.

Yup, it got fixed. It's a good thing we didn't have simultaneous failures across sectors....that was the worst case scenario.

If we think back to the Naval War College (NWC) work, I think what we're seeing is the Tornado Model.

Incidents of short duration in random sectors of the economy. Now, the other possibility, which I sincerely hope is not the case, is that we are also experiencing the Flood Model.

If the flood waters are rising, they're coming on a wave of oil. And look out for anything residing in the "low lying areas", as the NWC called it. Especially if we still have tornadoes out there!

-- Duke1983 (Duke1983@aol.com), January 25, 2000.


Sorry, but the 'card' statement is not enough to draw any conclusions from. Systems today are designed so that functionality is encapsulated into isolated units - often this means separate 'cards'. One reason this is done is to save trouble-shooting time. If I'm not certain what the specific problem is but the symptom is loss of a certain type of functionality, it is far more efficient to swap with a known working 'card' and see if that fixes things rather than to attempt trouble shooting at the component level (bad resistor? capacitor? EPROM?, processor? microcode?, scratched/corroded circuit?).

So they replace a card and that fixes the problem. Well, that gives you a ball park range of where the problem is but most times, the front-line technician doesn't care which low-level component failed. If the 'card' is cheap, it will simply be replaced and the bad one will be thrown away. If the card is expensive, it may be returned to the supllier for additional diagnostics and repair.

The fact that replacing a card fixed the problem is more indictative of a hardware failure on that specific card than of a software problem.

This functional encapsulation approach is used throughout industry both in software and in hardware. When I worked on GPS for military apps, many components in a fighter plane were divided into "Line Replaceable Units" (LRUs) and "Shop Replaceable Units" (SRUs). If a given box in an aircraft did not function, the front line technicians typically did not waste time doing advanced diagnostics on the unit, they simply replaced it with a known working LRU and sent the failed LRU to the shop. Back at the shop, technicians took this same approach as each LRU was typically constructed from several SRUs (i.e. 'cards'). The second tier technicians worked at the SRU-level and rarely diagnosed component-level problems.

Most communication equipment today is built using this approach, so just knowing that the techs replaced 'a card' really doesn't tell you very much.

-- Arnie Rimmer (Arnie_Rimmer@usa.net), January 25, 2000.


Couldn't agree more Arnie. thanks for your post. The real questions as I see it are:

Was the stated cause of the failure the true cause? Probably

Does AT&T have to pay anyone for the outage? Possibly, but not probably

If They do need to pay for loss of service, does insurance cover it? Yes, if it isn't a Y2K failure.

If the card failure (or whatever the true cause of the failure is)is in any way related to Y2K, would the company disclose that? Not on your life!

Are cards easy to replace? In most cases, yes.

Are replacement cards available? Depends on the card doesn't it.

What I conclude is that we don't, and won't, know the entire story. But I do know, that the rate of failures is most certainly manageable at this point, no matter what the cause.

-- Duke1983 (Duke1983@aol.com), January 25, 2000.



Duke1983..... Thanks for the post.

-- kevin (innxxs@yahoo.com), January 25, 2000.

Agree with ARnie on this one.

Duke asked:

Was the stated cause of the failure the true cause? Probably

Does AT&T have to pay anyone for the outage? Possibly, but not probably.

Most likely, no, they don't. AT&T is a service provider. Try holding them responsible for minor (and that's all it is) equipment failures.

If they do need to pay for loss of service, does insurance cover it? Yes, if it isn't a Y2K failure.

If the card failure (or whatever the true cause of the failure is)is in any way related to Y2K, would the company disclose that? Not on your life!

Are cards easy to replace? In most cases, yes.

Yes (unqualified). They are in racks of equipment. In every equipment rack I've seen or worked on it's easy to slide the card out and slide a new one in. Each card mates with a backplane power and signal plug.

Are replacement cards available? Depends on the card doesn't it.

Hardly. If there are 1000 cards in a system, probably there are only a handful of different cards. Sparing is easy....keep a few of each type on hand. A major supplier that failed to keep spares available is out of business in a hurry.

And, Bernie said,

This is a very clear example of a Y2K problem.

It really is more likely to be exactly what AT&T stated -- a failure in a module or card. Outage of one hour = notice failure, get tech away from coffee pot, have tech check things out, find failed card, go to stockroom and withdraw spare, put it into the system, check things out, put back on line. About right.

-- rocky (rknolls@no.spam), January 25, 2000.


Treat it all like a black box now. Forget trying to gues the 'cause', just recognize the lessons about interdependency and what this means when things go bad. Y2K is probably in the mix somewhere and possibly chewing the guts out of systems as we speak, but what really matters is the outward impact on you and yours in an operational sense.

-- ..- (dit@dot.dash), January 25, 2000.

Moderation questions? read the FAQ