Y2k Metrics and Error Rates - II

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Rather than continue the discussion on the original thread, Y2k Metrics and Error Rates, I've started a new thread.

Basically, threads reaching 60 posts become to long to really follow. Plus, the discussion had really gone to a secondary aspect.

Quick review, the point of the first thread was an attempt to quantify the number of bugs or errors that could be expected at rollover, using a set of Metrics developed by Howard Rubin. The result was an expected .6 bugs per 1000 LOC, which compares to the existing bug density of between 2.3 and 5.3 per 1000 LOC. Realizing, of course, that this is in addition to the existing bugs. The general point was to see if the expected errors would fall somewhere within what I would call the "fault tolerance level" of the existing systems.

No one really questioned the above. However, the basic argument became that, unlike today, where "most" critical errors have already been dealt with, the rollover for Y2k will introduce a large number of critical errors. (I'm sure someone will correct me if I misstated the basic idea here).

OK, then. Some of this will be SWAG, so bear that in mind.

First, how many will occur in mission-critical systems? One measure could come from the Gartner Group, which for the US estimates that 15% of US businesses will have at least one mission-critical failure. That certainly does not directly imply that 15% of the bugs will be in mission-critical systems; the number will be somewhat less, for a variety of reasons. One being that, as compared to normal maintenance, at least for mission-critical systems much more testing is being done. My first guess would be 10% of those, leaving 1.5% being in mission-critical systems. But to be conservative, I'll use 5%. Again, mostly SWAG, but 5% of the .6 per 1000 LOC will be in mission-critical systems, or .03 bugs per 1000 LOC.

OK, but how many of those will be what someone referred to as "showstoppers"? Again, I believe Y2k errors will run the gamut from trivial to showstoppers. I believe Ed Yourdon used 1%, but again, not strictly in this context. So once again, being conservative, I'll use 10%. So 10% of the .03 per 1000 LOC will truly be showstoppers, or .003 per 1000 LOC, meaning processing cannot continue until fixed.

But finally, the last question is how many can be fixed in relatively short time, basically minimizing the impact? Again, partially SWAG, but the Gartner Group estimates 90% of the mission-crtitical failures will be fixed within 3 days, meaning 10% will require longer. In and of itself, the length of time is not a true measure, but the best available. For example, one client went 8 days with an error that stopped them from making any electronic payments thru their house bank. In fact, it was a full 3 days before the bank even reported the error back, that the account number being passed was wrong. This I consider a truly mission-critical failure, that lasted 8 days, but other than quite a few ulcers in AP, and some additional late charges, caused no real havoc for the company.

But using Gartner Groups numbers, 10% of the .003, or .0003 per 1000 LOC, will be mission-critical, showstopper type errors, that will take longer than 3 days to fix. So for Ed Yourdon's example of a 100 million LOC inventory, 30 would be expected to meet the above.

Once again, most of the above is SWAG. Some of the errors will have temporary work-arounds; some will not. At best, this represents an average. The earlier someone starts, with more testing, the smaller the number becomes. Conversely, the later start, with less testing, will have higher. As well, this represents code requiring remediation; a substantial subset of code has been implemented that is compliant.

To be honest, I truly don't know if the above, given the lack of real metrics, is worth much. But I thought I'd take a stab at quantifying these errors again. For what it's worth.

-- Hoffmeister (hoff_meister@my-dejanews.com), May 22, 1999

Answers

And this is what Hoff will be content to do through Dec 31, 1999 -- play games like this. The lull in Y2K awareness on the part of Joe Sixpack will not last -- count on it. Where do YOU stand with your preparations for Y2K? You may intellectually believe that Y2K is going to be a disaster, but are you actually doing anything abot it? Reponding to polly tripe like this may be challenging, but honestly, if you are not actively preparing at this late date, you are effectively a pollyanna, regardless of what you profess.

-- King of Spain (madrid@aol.com), May 22, 1999.

Nice job, King. Keep up the good work. Maybe someday, you'll "Get It", that understanding the nature of the problem, its extent, and what can be expected is called "risk analysis", and is basically essential to determine what preparations are needed, beyond the basic preparations everyone should make.

Till then, continue to respond by simply ignoring the post. You do it so well.

-- Hoffmeister (hoff_meister@my-dejanews.com), May 22, 1999.


Well, I don't know _exactly_ what systems Ed has worked with, but the figure of 100 million LOC seems ridiculously high to me. In my career I have worked on hospital information systems and core banking systems - and I am not talking about little bitty accounting systems either. Rather systems that hold all the patient records for large hospitals (HIS's) and retail banking systems that hold and process customer accounts and send electronic transfers. These systems have, on average, about 1000 to 5000 routines with an average of about 300 lines per routine. I would say then, that more modern systems built over the past 10 years would have an average of about 1 million lines. This would give an average 1 in 3 chance of a system failing totally without remediation.

-- a person (a@person.com), May 22, 1999.

................$000.00 hundreds

................$000,000.00 thousands

................$000,000,000.00 millions

................$000,000,000,000.00 billions

................$000,000,000,000,000.00 trillions

................$3,000,000,000,000,000.99 potential in lawsuits as has been reported

................ divided by population of earth = +- $600.00 from every man, woman, child on earth PLUS

....................$650,000,000,000,000.99 to fix the non problem, as has been reported

.................divided by population of earth = =- $130.00 from every man, woman , child on earth EQUALS

.................... $600.00 + $130.00 = $730.00 from every man, woman, child ON EARTH

NO PROBLEM??? UH.... HELLO

We won't even attempt to figure lost wages, medical expenses, (insert here - ad infinitum)

GO FIGURE!!!

PS Not responsible for the accuracy of these REPORTED estimates, or of course the actual population of the ENTIRE EARTH but I think you get the idea (plus or minus).

-- unspun@lright (mikeymac@uswest.net), May 22, 1999.


Well, now, two for two. Can we go for the hattrick?

-- Hoffmeister (hoff_meister@my-dejanews.com), May 22, 1999.


I'll try. Hoff, this stuff is too complicated. It takes too much thinking. There are too many details. You're asking us to concentrate. And for what? If we go through all that effort, we only arrive at conclusions we already know are false, and we reached those conclusions without having to think at all. You call this a reward for our efforts? It's just not worth it all the way around.

-- Flint (flintc@mindspring.com), May 22, 1999.

Wow. Hoff, I hope you did not hurt yourself reaching for that straw. As I have been wont to say, you are trying to rationalize chaos. What causes the problems at the rollover may be logically explained but I am inclined to think that the results will not stand up to the same scrutiny. Too many variables.

-- Mike Lang (webflier@erols.com), May 22, 1999.

Mike:

I agree that the details cannot be known. I expect problems we'd never have dreamed of. We'll be amazed.

But predicting details and predicting scope and duration are different things. We can't know what will go wrong, but maybe we can guess how much will go wrong, within an order of magnitude or so.

-- Flint (flintc@mindspring.com), May 22, 1999.


Repost from Nabi's earlier Euro thread -- which Hoff and Flint conveniently ignored:

The Return of Inflation Or the Beginning of Y2K?

By Martin A. Armstrong Princeton Economics International (London Office) ) Copyright May 14th, 1999

[snip]

The more important issue behind the Euro weakness is the dirty little secret that is being kept from the general media at all costs - the Euro clearing system still does NOT work! Merchants who take Visa or Master Card normally receive instantaneous cash when they deposit your transaction with their bank. This is now true for all currencies EXCEPT the Euro! Euro credit cards are taking days to clear and as such the Merchants have been the first to feel the effects of a clearing system that still does not work. Between banks, all currency transactions settle at the end of every day. Euro settlements are also taking days. Banks in London are putting Euro checks on a 4-week clearing status. The net effect, many are starting to discount the Euro in order to accept it. Even American Express has issued only 5,000 Euro based cards. This is not such a good story for a currency that was going to knock the dollar off this planet. Most central banks are still unofficially not accepting Euros as a reserve currency, which has been told to us on a confidential basis. If publicly confronted on this issue, everyone would naturally deny it, but the failure of the Euro has been expressed in its near perfect swan dive since January 1st.

The Europeans are having extreme difficulty solving the problems of the Euro. Most computers cannot calculate fractions of a currency and therein lines a far worse problem than merely Y2K. China's work around for Y2K is to simply turn their computers back 20 years. That trick will work, but calculating fractions of a currency remains impossible when such functionality never before existed. For this reason, your taxes in Germany are still payable in DMarks - not Euros.

-- a (a@a.a), May 22, 1999.


A,

I cannot believe you continue to post the Princeton guy's comments, when must of what he says has already been debunked.

Oops ... I forgot. His Artificial Intelligence told him to say that. :)

-- Stephen M. Poole, CET (smpoole7@bellsouth.net), May 22, 1999.



Mike, Flint "got it". No, there is no real way to predict the details at rollover. This was an admittedly imprecise attempt at quantifying the magnitude, as a comparison to what is "normal".

Too many of these discussions end up as "well, I just know it will be huge, too much to be handled".

-- Hoffmeister (hoff_meister@my-dejanews.com), May 22, 1999.


Where was "most of what he says" debunked Stephen -- by the 18 year olds on Biffy?

-- a (a@a.a), May 22, 1999.

Has anyone got a percentage rate on what software uses date functions? Whether they be small, medium or large?

I ask this because there would be some common date related functions amongst nearly all software that uses dates. An example is reading the date for the first time in the program from the BIOS or whatever. Another one is writing to a log file, which typically has the date and time at the beginning of a log entry.

A fairly common date related function I imagine would be also in most software is the comparison between events. Has anyone got any others?

I say this because if the majority of showstoppers happens on a common function in every program then from there it may be reasonable to assume that the rate of showstoppers would be higher amongst all software.

You have a 1 million LOC program, one date function in it is a showstopper. Now what if we deal with smaller programs but the same amount of lines of code in total? 10,000 LOC per program comes to 100 programs. Imagine that the showstopper in the 1 million LOC program is in every single one of the hundred. I come to this theory because they may be all using a common date function.

In my programming I have a small library of functions that I have written myself and use frequently in software I write. So I use the same code over and over again in all my programs.

So if we take Hoffmeister's 100 million LOC program, from that it has been extrapolated that there would be 30 showstoppers within it. What if 10% of those showstoppers are common amongst 10,000 pieces of software each having 10,000 LOC? This would mean that each piece of software would have 3 showstoppers each.

The actual rate of failure may be higher than expected because of this commoniality. Anyone else have thoughts on this?

I tend to agree with Flint about his comments about some things will be unexpected and weird.

If a company suffers several showstoppers amongst a range of software then things could turn nasty as management will be screaming from the top to get everything fixed at once when a programmer can only reasonably work on one thing at a time. Knowing management they'll keep changing priorities at the drop of a hat. :-)

Personally in my situation several pieces of software failing at the same time is going to cause no end of stress (actually is it worth starting a thread about the impacts of stress from trying to fix multiple failures at once or knowing that you have a whole range of failures to contend with?), if the boss breathes down my neck, like he usually does when something fails, the time to fix actually is anything up to twice as long. He constantly asks me how this works and wants explanations and at the same time is interjecting his own thoughts on how to fix it or where to look. Hey I want to knuckle down and concentrate so I can fix the damn problem.

Just before I go, a little note about that Stress thread idea. What are the legal and medical ramifications from an employer placing to much stress onto an employee?

Regards, Simon Richards

-- Simon Richards (simon@wair.com.au), May 22, 1999.


Don't be fooled by think that y2k just effects systems that depend on dates. One of the projects I am working on is upgrading a system to use a y2k compliant OS. There is actually little problem with the system as is, since none of it really depends on date or year. But the mandate that has been handed down by TPTB is that it must be upgraded. This has set off a string of problems that have absolutely nothing to do with y2k, and would normally be considered routine maintenance. But the difference is, instead of being able to schedule these problems when time permits, we, and undoubted a large portion of the computing world, are being forced to deal with them all at once, and with an immovable deadline.

-- a (a@a.a), May 22, 1999.

Putting some of the stats in context:

Given: a 15 year old 1 million LOC system, which at 2.3 bugs per 1000 LOC should have 2,300 existing old non Y2K bugs.

In the past 2 years, how many of these bugs have not even been bothersome enough to notice?

Most, if there are 2,300 such bugs in the system.

In the past 2 years, how many of these bugs have been noticed but not considered worth spending time to fix?

Roughly, from one to three hundred. Not worth a close count.

In the past 2 years, how many of these bugs turned out to be less than serious, but worth the bother? i.e. ranging from nuisance to requiring a workaround.

Roughly 20 to 30.

In the past 2 years, how many of these bugs turned out to be serious, critical or showstoppers?

Two!

One of these two was in a subroutine used by several main programs. The subroutine had been changed and the change tested in some of the main programs that use it, but not all. When another main program that uses it was modified, and that modification was tested, the program was put in production with the defective subroutine, which did not not break in the tests. In production, it broke. The programmer who made the change spent most of the night looking in the wrong place: the code he changed.

Did the system previously have other old non Y2K bugs that surfaced and turned out to be serious, critical or showstoppers? Yes, but they were FOF (fix(ed) on failure) prior to two years ago.

So, what does the statistic of 2.3 bugs per 1000 LOC mean to this system? After years of serious bugs having been fixed on failure: practically nothing, nada, zilch.

Continuing:

Now let's take the estimated average of .6 Y2K bugs per 1000 remediated LOC to be introduced at rollover. Also, let's assume that the remediation team did an above average job, and got that down to .2 Y2K bugs per 1000 LOC. That would translate to 1 bug per 5000 LOC, or in this system, 200 new bugs ready to rock and roll on 1-4-2000. Let's even assume that none of these are "showstoppers" (I'm not sure who means what by that term, but I am guessing that it means that the system crashes).

Furthere, let's assume that 90% of them have absolutely no adverse effects, other than being there, and being among things that may need to be checked when a problem is noticed, and therefore taking up programmer's time looking in the wrong places.

That would leave us with about 40 bugs that do not crash the system, but have some adverse effects. Some of the adverse effects will quickly be noticed, others will take longer. Some will be minor, some will be serious. In this system a problem that is minor at 10 in the morning can be critical by 3 in the afternoon, and disasterous by 9 at night. Any loss or corruption of data is usually serious, and gets more so by the hour.

Then there are the inputs to this system from a variety of sources. One or more of these can introduce erroneous data.

Jan 4, 2000

The day starts with everyone, excepting those still hung over, wired; avidy watching for signs of a problem, worried that they might find some. Things start smoothly, but then someone notices symptom A, while someone else notices symptoms B and C. The common phenomena that multiple new symptoms are usually, but not always, results of the same cause, is not a safe rule of thumb on that day, so triangulation is of limited use. Each symptom needs to be regarded as likely to be a result of a unique cause, until diagnosis indicates otherwise. Backing out recent changes is not an option.

Diagnosis continues amidst continuing phone interruptions. New symptoms are noticed before dignosis of any of those noticed earlier has gotten very far.

The programming manager of this system asks his management for some help from other programming groups, but they are also up to their elbows, and he has to make a case for keeping his programmers from being loaned out to other groups. He tries to bring in consultants from "quality" consulting firms, but they either are booked, or shy of liability problems.

The Y2K remediation team anxiously pours over remediated programs, looking for errors. Late in the day an erroneous remediation is found, but it does not explain any of the symptoms. One symptom is tracked back to a missed Y2K bug, but that bug does not account for any of the other symptoms.

By 7 the next morning panic buttons are worn out all over the place. Several levels of mangement, both IT and users, are getting hoarse. The programmers are getting bleary eyed, but no manager is willing to tell them to go get some sleep. No one can remember a problem in this system that could not be fixed, or at least backed out, before the next business day.

Jan 20, 2000

Many of the programmers have bailed out in spite of bonus offers to work 18 hour day 7 day weeks. Many mangers have resigned or been fired. The board of directors has anounced an interest in a merger, with anybody. :-)

Back to today

But, could it actually get that bad? It could. Is it likely? I hope not. From the remediation effort of which I'm aware, this system would be presumed to have a better chance of coping fairly well with Y2K than other systems which have had more rushed remediation efforts, or no remediation at all.

I would not want to be anywhere near any critical systems that have had below average quality remediation. Some of them will break hard and be slow to fix, and the programmers supporting them will be asked, begged, pressured, etc. to work bizarre hours until the system is working, or until the programmer quits. How many of these will be critical dominos, and in what parts of the domestic or international economies, remains to be seen.

Statistics are not adequate to enable us to forecast which systems will get how bad for how long. Gartner has published some estimates of future outcomes, but we, and they, have yet to learn how well their methodology works at forecasting Y2K events. Recent statistics of increasing slippage of intermediate remediation deadlines lead me to wonder whether Gartner had anticipated them, and had them in the forecast model, or whether the model needs to be revised.

So, as has been said before, you take your guesses and place your bets.

Jerry

-- Jerry B (skeptic76@erols.com), May 24, 1999.



Agreed that this analysis was a real stretch, which I basically said at the top. You can put in conservative guesses to try and get some "bounds" on the problem, but they are still guesses.

One point. I think you're assuming the 2.3/KLOC is entirely within stable systems, and I don't know that I agree. Any bug density stat would have to include newly implemented systems, as well, which increases the level of severe errors present.

-- Hoffmeister (hoff_meister@my-dejanews.com), May 24, 1999.


I think another factor that gets lost in the translation is that when the number of failures goes up linearly, the time to solve them goes up exponentially. I have no data on this, but I base it on the availability of key resources, both man and machine.

-- a (a@a.a), May 24, 1999.

Moderation questions? read the FAQ