"We'll be ready in time" documents

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Don't you get tired of statements like this one from Telcom CEO's posted today on de Jager's site:

"...AT&T and many other carriers have long said they expect to be ready for year 2000, with fixes in place by either Jan. 1, 1999, or June 30, 1999, to allow time for testing. The Federal Communications Commission has said that the top 20 carriers with more than 95% of the land-line voice and data traffic in the U.S. should be ready in time..."

What does "ready in time" mean? If AT&T gets 99% of their code remediated (a remarkable feat) and so does Bell Atlantic 99% and Sprints 99% and GTE's 99%, etc., just what do 1% errors do when they are exponentially multiplied against all the other 1% errors?

In example is in the field of surveying. If you make a 0.01 degree error in an angle in a survey, it may not be significant if your survey is a very short line. However, if it a mile long, when you get to the end you introduce a error of about 11 inches. Now imagine that other survey companies will use your survey end line as their starting point and make more surveys. The errors in their surveys is multiplied (not added) by your error. It can get out of hand real fast.

A 1% error in your program may not seem like much, but wait until the effects are played out. Wait until all your data exchanges with others 1% errors begin multiplying themselves out.

Assurances like "we'll be ready in time" are based on looking at your own organization like an island and thinking that 95 to 99% is the goal.

We really do need to depend on a higher power than man.

-- James Chancellor (publicworks1@bluebonnet.net), October 13, 1998

Answers

If I could play Devil's Advocate here for a moment:

One of the techie guys who I have an ongoing debate with on another forum would say something like - "Most big systems deal with a much greater than 1% failure rate right now, you just don't notice it because there's so much redundancy built in. Failures happen everyday, and they keep the system honest."

I don't know, maybe there's something to it. I personally think the failure rates will be much greater than 1%.

-- pshannon (pshannon@inch.com), October 13, 1998.


I would reply to your techie friend that:

-y2k errors will be more randomly generated than conventional day- to-day errors (errors that occur regularly in corporations tend to develop a pattern that make troubleshooting easier) -it will be more difficult to prevent the transfer of y2k errors to other electronic links than conventional errors (unless you isolate yourself which usually defeats the purpose of doing business) -your errors may be imported errors and more difficult to locate the source -repairing systems after Jan. 1, 2000 will be much more difficult than before (ie telecommunication and power interruptioins) -a significant portion of your programming staff may have left do to social chaos or a host of other reasons

-- James Chancellor (publicworks1@bluebonnet.net), October 13, 1998.


This is one of the most troubling aspects of this whole situation for me, especially because I am a "techie."

The problem is whether you can believe statements coming from companies that should know what they are talking about or do you believe the speculation coming from a few who may or may not have enough of a technical background to even have a clue.

Ordinary "common sense" would have you believe statements coming from the likes of AT&T, Intel, Microsoft, power companies, etc. They should know what they are talking about, right?

You are right about the "island" concept, James. Most of the public still doesn't have a clue how broad and complicated this issue is, and most of the employees of the corporations that should know are simply part of that public. They can't see beyond the "island."

The other problem is that even if AT&T and other telecoms, along with the power companies, do know what they are saying and will be ready, we are still left with all of the companies in other industries, not to mention governments, that probably don't know and won't be ready.

-- Buddy Y. (DC) (buddy@bellatlantic.net), October 13, 1998.


James, lets be fair on the meanings of these statements (although I am not defending them):

"If AT&T gets 99% of their code remediated (a remarkable feat) and so does Bell Atlantic 99% and Sprints 99% and GTE's 99%, etc., just what do 1% errors do when they are exponentially multiplied against all the other 1% errors?"

We are talking about remediating 99% of the systems. You seem to take this as the systems will be 99% effective/accurate and this is not the case. If the systems which do not work (say 1% of the routers and switches) shut down, this does not affect the entire system. Obviously making a 911 call will be made a higher priority fix than your voice mail and call forwarding systems are.

Although I don't believe for a minute that they will have 99% of their systems tested and remediated, this would not in itself create a cascade event as you describe. It could however cause a decay in effectiveness, such as 30% of the phone switching systems being inoperable. Working around these will cause an overload and not all data/voice will get through.

Unlike power, it seems unlikely that a lack of telephone resources will cause a system-wide shutdown, but rather a bad situation which will cause a great deal of difficulty, lost business and inability to contact emergency services.

Brad

-- Brad Waddell (lists@flexquarters.com), October 13, 1998.


Brad, you stated that you don't think that there is a potential cascading problem from these 1% errors:

This is from the Senate Special Committee on Year 2000 website:

...The susceptibility of the current generation of switching equipment to software based disruption was demonstrated in the collapse of AT&T's long distance service in January 1990. A line of incorrect code caused a cascading failure of 114 electronic switching systems.* * * [Again] the potential for software-based disruption of common channel signaling was demonstrated in June 1991, when phone service in several cities, including 6.7 million lines in Washington, DC, was disrupted for several hours due to a problem with the network's Signaling System 7 protocol. The problem was ultimately traced to a single mix-typed character in the protocol code.\2\ \2\Critical Foundations: Protecting America's Critical Infrastructures. The President's Commission on Critical Infrastructure Protection; October 1997...

-- James Chancellor (publicworks1@bluebonnet.net), October 14, 1998.



But it only takes one critical failure to count. People can probably live with many less critical erros, but which "one" will kill a computer? Which "one" will affect home mortgage data or tax data or salaries or stock market trading if 4500 systems are exchanging varying percentages of "bad" data continuously for several days undetected?

Now, where is the bad data, where is the good data? Now, you are looking for a 20,000 10 carat gold needles and amid 20,000 12 carat gold needles in a haystack of 20,000,000 yellow needles.

Another example:

One switch (of two in the telephone "substation" (?)) failed about 6 weeks ago on the east side of Atlanta.

It wiped out 911 service for 9 counties and 14 smaller cities for 3 days. Regular phone service was okay, you had to dial te non-emergency number to get to the police.

Failure in these cases is sequential: so probability of success in a sequential situation is expontially calculated: If 2% randomly fail in a sequence of 8 machine transactions, probability of success is

(.98)^8 = .850

If 3% randomly fail, probability of success getting the right answer is

(.97)^8 = .784

-- Robert A. Cook, P.E. (Kennesaw, GA) (cook.r@csaatl.com), October 14, 1998.


James,

I understand they have made changes to avoid cascading (and embarrasing) problems like that in the future, have setup agreements with competing carriers to improve their reliability guarantee, and besides, that was not a complete USA shutdown, just several cities.

I'm not defending them, and nobody can know which failure will cause how much shutdown, but I was clarifying the point between 100 systems all at 99% complete (the original question) and 100 systems, 99 of them (99%) complete - which is what they mean when they say 99% complete. I also said the odds of meeting this goal are 0.000000013 percent.

Brad

-- Brad Waddell (lists@flexquarters.com), October 15, 1998.


Moderation questions? read the FAQ