MCI WorldCom blames Lucent software for outage

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Monday August 16 07:38 PM EDT
MCI WorldCom blames Lucent software for outage
John Rendleman, ZDNet
MCI WorldCom Inc.'s 10-day frame relay outage was technically due to a new software upgrade from Lucent Technologies Inc. that the carrier began using four weeks ago, MCI WorldCom officials said today.
The intermittent outage, which began Friday, Aug. 5, was finally resolved yesterday, but only after MCI WorldCom shut down the affected portion of its frame relay network Saturday for nearly 24 hours of troubleshooting.
Customers left without service during the disruption will receive two days' service credit for every day they were without data links, a credit that in most instances will equal 20 days' worth of free service, said MCI WorldCom President and CEO Bernard Ebbers during a press conference today.
During the outage, "we did not always meet our customers' expectations," Ebbers said. Nor did MCI WorldCom fully explain the nature of the outage and its own efforts to resolve the problem.
"The fact is that we did not always have those answers, and we still are investigating some of those software issues" that resulted in the network crash, Ebbers said.
So far, the company has determined that the outage began as the result of flawed Lucent software being loaded onto hardware on one of MCI WorldCom's four distinct frame relay networks. The one specific network affected is one that MCI WorldCom, of Jackson, Miss., uses to service customers with requirements for international data circuits.
That specific network serves 3,000 customers, including America Online Inc. and the Chicago Board of Trade, and consists of 300 different switches and switching nodes. Before the outage, the network had been certified by engineers at Lucent, of Murray Hill, N.J., as meeting all necessary operating parameters for a network of its scope, said MCI WorldCom officials.
"Lucent has acknowledged full responsibility and confirmed that it was a software upload problem," Ebbers said, related to a network upgrade intended to allow the network to expand during the next several years to meet growing customer traffic demand.
"We have gone back to a software load that we had run successfully for a continued period of time, and we have no plans to risk our network again," at least until the underlying cause of the software error is rooted out and fixed, Ebbers said.
========================================== End
Let the games begin!!
Ray

-- Ray (ray@totacc.com), August 16, 1999

Answers

So they "solved" the problem by reloading the old software!! Hmm, makes me think they haven't really diagnosed the new load - just know that its at fault. Y2K related? Maybe - maybe not. It does show how fragile "upgrades" can be. What happens when the old software doesn't work and the new upgrade has multiple bugs from being rushed into production?

-- RDH (drherr@erols.com), August 16, 1999.

Yeah, like when this happens everywhere come Jan 1, 2000 for instance.... Gosh, maybe we have a problem! A big problem!! A very, very serious problem!!!!

-- King of Spain (madrid@aol.com), August 16, 1999.

Speaking of upgrades, the University of Wisconsin is having problems with a new software program called ISIS. I recently received a letter from UW-Madison telling me we would not be getting our usual August tuition bills. They would be delayed a month. According to a newspaper article in the Wisconsin State Journal, the UW is calling the ISIS software... Crisis software. :)

-- Mom (ParentOfStudent@UW.com), August 16, 1999.

Is "certified by engineers" any kin to "y2k compliant?"
Diane

-- Diane J. Squire (sacredspaces@yahoo.com), August 16, 1999.

King of Spain, Do you like to mudwrestle?

-- shellie (shellie01@hotmail.com), August 17, 1999.

Shellie:::: HARDLY FAIR!!! Particularly after what I removed from another thread!
chuck

-- Chuck, a night driver (rienzoo@en.com), August 17, 1999.

When in doubt, IPL...
If dat don't fix'em, fallback...
works every time ... except in 4 months...

-- Andy (2000EOD@prodigy.net), August 17, 1999.

FOF.
Boy is this term going to get old quick.

-- John Galt (jgaltfla@hotmail.com), August 17, 1999.

I expect the upgrade had to have somebody's Compliancy certification. We always knew that "compliant" was not a 100% guarantee.
I also imagine when the dust clears, the bottom line will be that they should have pulled software the first day, and simulated a closer situation to real time traffic on a test system to increase the quality of the testing.
(It may take a long time to find and fix all the timing contention and error correcting routines.....)

-- living in (the@real.world), August 17, 1999.

I hope all you'se guys realise that this is exactly the sort of thing that is going to hit banking on rollover, apart from the imported data problem the worldwide data path networks for worldwide banking are mind-boggling... satellites, land-lines, all the various carriers that will drop the ball (e.g. MCI) - remember, if one link goes down the fallback link may work but it will not have been designed to cope with the ENORMOUS data spikes due to hose-ups elsewhere in the network... throw in power outages to the mix, embedded chips, ups systems failing, diesel generators running out of fuel, malicious hacking, cyber-terrorism, MURPHYS LAW... well, you all get the picture except probably Chiefy, Al, Y2K Pro - the usual suspects...
I'm tellin' ya - it won't be pretty...

-- Andy (2000EOD@prodigy.net), August 17, 1999.

Thanks Ray. What you posted explains this:
Subject: *IMPORTANT* Notice of Emergency Network Maintenance Date: Sat, 14 Aug 1999 17:44:14 -0400 (EDT) From: feedback@bellsouth.net
Valued Customers:
At 11:15 pm 8/13/99 WorldCom, our Global Service Provider, notified our Network Operations Center of the need to perform emergency maintenance on their Frame Relay network beginning at 12 Noon (EDT) Saturday 8-14-99 and finishing at approximately 12 Noon (EDT) Sunday 8-15-99.
During the course of this emergency maintenance, you may or may not experience the following: congestion over the network, latency and potentially, loss of connectivity. The work being performed by WorldCom necessitates the complete shutdown of all frame relay switches within the WorldCom network, and a controlled, one by one, reinstatement of each frame relay switch back onto the network.
We have been assured by WorldCom that every effort will be made to reduce the impact to our network and to resolve the issue necessitating the emergency maintenance as expediently as possible.
We will notify you once we have received confirmation from WorldCom that all work has been completed.
Thank you for your patience and continued business,
BellSouth.net Consumer Communications & Relations feedback@bellsouth.net
[And the follow-up message here:]
Subject: Emergency Network Maintenance - Completed Date: Mon, 16 Aug 1999 22:51:31 -0400 (EDT) From: feedback@bellsouth.net
Dear BellSouth.Net Customers:
We would like to take this opportunity to inform you that the emergency maintenance that was being performed by MCIWorldcom has been completed. All issues related to this outage have been resolved. Customers should be able to connect, retrieve email, and read newsgroups as well as surf without problems.
We would like to personally thank you for bearing with us during this time and once again express our appreciation for your patience.
BellSouth.net Consumer Communications & Relations feedback@bellsouth.net

-- J (jart5@bellsouth.net), August 17, 1999.

Not that it proves anything, but my AOL system never so much as burped during all this mess.

-- Gordon (gpconnolly@aol.com), August 17, 1999.

Ooops - - not only was this failure due to one software cahnge, it was installed at only one switch system: << The one specific network affected is one that MCI WorldCom, of Jackson, Miss., uses to service customers with requirements for international data circuits. >>
So, last summer, one program failed, one satellite failed, and about half the ATM's and gas machines, and pagers nationally went down until a replacement satellite was re-configured to accept the other loads.
Here, one program (from Lucent Labs - just about two weeks after the Bell Labs loudly proclaimed their y2k fixed software were tested in the labs under simulated loads, passing all lab tests between machines simulating company-company connections) gets installed in one switch center, and a network goes down.
One transformer in one substation goes down, and San Francisco loses power to the pennisula. One transformer blows in Chicago, and the business district is blacked out. One valve closes at the wrong time in LA, and 4 million gallons of sewage are dumped in the river. One fire occurs in one cable tunnel in NZ, and Aukland is without power for weeks. One program affecting one date in London utilities fails to reset itself, and thousands of customers begin getting their gas turned off because the actual date is wrong.
___
Does anybody see a pattern here? Now, just how big is this "bump" in the road we're aproaching?

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), August 17, 1999.

Robert!!!
That's excellent - that's a real keeper! I have bookmarked it for Flint the next time he starts spewing his crapola!!!

-- Andy (2000EOD@prodigy.net), August 17, 1999.

POPOCATEPETL!
Thanks, Robert, excellent pattern picker.

-- Ashton & Leska in Cascadia (allaha@earthlink.net), August 18, 1999.

Robert A. Cook, shouldn't your title be PPE? Pattern Picker Extraordinair.
Lets add one refinery glitch that affects a week-long gasoline shortage Colorado Texaco stations.

-- Chris (%$^&^@pond.com), August 18, 1999.

Now Robert see what you've done - you've scared Flint and Hoff and Maria away!

-- a (a@a.a), August 18, 1999.

Naw, 'a', just discussing it currently on the "debate" forum.
Believe me, you'll hear my complete view on the MCI outage, since you seem so interested.

-- Hoffmeister (hoff_meister@my-deja.com), August 18, 1999.

Robert Cook isn't picking a pattern, he's picking cherries. He's found the six worst points of failure over a year, and he's trying to pass them off as typical. Now, we've seen estimates of 50 billion chips out there. Let's say the mean time between failure is a (*very*) generous 1000 years on average per chip. At that rate, we should have 50 MILLION chip failures per year. Note I'm not even counting bugs in software, just hardware failures. That 50 million failure estimate is extremely conservative.
Next, let's toss in programming bugs. It's a truism that no useful, nontrivial program is bug-free. Metrics we have over the years run on the order of a few hundred errors per million lines of code. We have trillions of lines of code. That's a few hundred million code bugs out there under normal circumstances.
Add all these failures together, you have hundreds of millions PER YEAR happening in all systems worldwide. And Robert finds SIX of them (count 'em -- SIX!) and thinks he's found a pattern? Well, maybe he has. A pattern of success, of redundency, of recoverability.
Oh, and Robert -- this is a pleasant change for you. Rather than dismiss all known data points as anomalies and extrapolate from the unknown, you're now taking guaranteed anomalies and extrapolating from them. The only problem with your change in technique is, you give every indication of picking whatever invalid technique is required to support a foregone conclusion. As an engineer, you should be ashamed.

-- Flint (flintc@mindspring.com), August 18, 1999.

Well it appears Flint is BACK with a VENGENCE accompanied by his faithful partner Hoffmeister.
Does anyone have the FEELING that the end is near for these folks?? Grasping at straws is about all that is left for them. Oh well, a few more days maybe a couple of weeks, not a long time!!
Ray

-- Ray (ray@totacc.com), August 18, 1999.

"Add all these failures together, you have hundreds of millions PER YEAR happening in all systems worldwide. And Robert finds SIX of them (count 'em -- SIX!) and thinks he's found a pattern? Well, maybe he has. A pattern of success, of redundency, of recoverability." Flint
Flint Flint Flint! You missed the point of his pattern examples. He picked six, off the top of his head which have been discussed here, to show how one glitch can cause a huge mess. That's all. He's not trying to list all the Y2K problems that have been happening so far.
One little glitch's ripple effects. Try to imagine 10 little glitches like these happening at once in your town.

-- Chris (%$^&^@pond.com), August 18, 1999.

Sorry Flint... I wasn't yelling at you really...pesky tags.

-- Chris (%$^&^@pond.com), August 18, 1999.

Chris:
You should check out Las Vegas sometime. What they do is, they pick the biggest winners over the past year or so, and hype the bejeezus out of them, and do everything they can to create the impression that these Big Winners are *typical*, that they could be YOU, if you'd just keep gambling. In other words, they create the same false impression Robert is is creating, using exactly the same technique. Nor is this technique at all unusual in any kind of propaganda -- you pick the very far tip of any curve and ignore ALL the rest. Robert is literally selecting one-in-a-hundred-million.
Yes, there are single points of failure that can lead to problems way out of proportion to the actual failure. This has *always* been true, in every system of any kind ever devised, even by cave men. Y2K doesn't increase the probability of KEY failures, it just increases the probability of failures generally. For every hundred million such failures, ONE will be newsworthy. And yes, there may be hundreds of newsworthy failures, at least for local newspapers. Will civilization collapse as a result? Chances are, the only way you'd know about them would be to comb the newspapers and the net. Otherwise you'd miss them. Hardly the end of the world, all in all.

-- Flint (flintc@mindspring.com), August 18, 1999.

Who's talking end-of-the-world here? Did I mention it?
If I can't get gas for my car for over a week, and if my tap water is contaminated, and my electricity is down, all at the same time because of 3 little glitches...It's not the end of the world for sure, but geez, in the middle of winter it sure isn't fun to have to walk to the nearest grocery store that has electricy to buy water and ready-to-eat food, and sleep with my winter coat on.
You know what Flint? I never fell for casino gimmicks. I live 2 hours from Atlantic City and went to a casino once to see how it was. You still didn't get the gist of Robert's post.

-- Chris (%$^&^@pond.com), August 18, 1999.

And the SAD thing is, he never wil.......

-- Andy (2000EOD@prodigy.net), August 18, 1999.

Ray,
I've been keeping Flint and Hoff bloody busy over at the "Debate: Round 1" and "Re: Debate" threads. But I guess that after taking dozens of heavy blows and running out of breath, they've both decided to move on to safer ground over to this thread, without success, of course. Maybe they regret their decision to come over here and decide to come back to the H-H Debate for some more. I'm waiting...
Flint, the problem with your brilliant mind is similar to y2k non- compliance.
Listen Flint, actually you may be right on the mark: currently the world experiences 50 million chip failures per year, and redundancy and/or resilience take care of that. It amounts to o.1% chip failures per year using your own figures as example. And because of this we get some problems. And you say, so what? THE GRANDE PROBLEMO Flint is that the embedded chips strategy worldwide has been mostly FOF, so the consequence is that y2k will mean not o.1% chip failures but rather 2% or 3% or 5% or as high as 12-15% (PLCs) which means twenty, thirty, fifty, a hundred and twenty to a hundred and fifty TIMES HIGHER FAILURE RATE respectively, depending upon location, circumstances, countries, industry sectors, etc. Redundancy and resilience will probably be maxed out way before that.
Furthermore, "there are lies, damn lies, and statistics". Averages don't count at all if the iron triangle doesn't hold both here and ABROAD. And the iron triangle isn't really enough if Federal, provincial, and municipal governments don't work functionally well both here and ABROAD. If government personnel in Venezuela don't get paid Flint, they block off the roads and you don't get their oil, which you need badly, whether you know it or not. 60-80% of Fortune 500 revenue is FOREIGN. If the international banking system doesn't get Y2K-ready soon, we are toast Flint, no matter that Wachovia bank may claim victory. And we wouldn't need billions of lines of bad code to screw up the international financial system, a couple of thousand would do.
But Andy is right, so...

-- George (jvilches@sminter.com.ar), August 19, 1999.

Moderation questions? read the FAQ