My Thoughts on FOF and Crisis Centers

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

I am not a computer type. Let me say that first. It may become apparent that it takes me a while to grasp some of these concepts.(might as well flame myself first).

From the information available it seems that there have been two important factors which are becoming growth industries. Neither gives me a great deal of confidence about the state of compliance in the IT industry.

Last fall(I think)I heard the term fix on failure for the first time. At the time I was still learning about the problem and did not understand what this implied. Now I have come to know this means that whichever company or government etc. that employs this as a valid policy has already admitted defeat. They are unable to remediate and become compliant on time.

In addition to this, it will not work in many cases. By the time that some of these failures are discovered, sometimes well down the road, there will be too much data corruption to correct unless a LOT of time is invested. Meanwhile, the system functions but the data is kaput.

The second wonder term of the year is "crisis center" or one of the variations that corporations are using to describe the place may become their Alamo. Given that these problems are hard to fix, these centers may become a long term operation if the company is fortunate enough to survive.

Also acknowledged is that many of these problems will not surface for a period of time and then the problem becomes magnified. At this point the data is thoroughly corrupt and may have corrupted numerous other systems.

So to my question. Even if the original problem is fixed after a few days, weeks, or months, just how long will it take to fix all the data?

I think I needed to write this to understand it all better. It is very troubling. Am I missing anything?

-- Mike Lang (webflier@erols.com), July 13, 1999

Answers

You are absolutely correct. Fix on failure is the equivelent of driving around with no oil in your car. You know that you're going to have a problem. You just hope that there's a tow truck near by when your car dies.

Well, some people are going to be stopped right next to the Standard station, others are going to be in the bad lands...

-- simple (notTooCompex@really.duh), July 14, 1999.


When I used to work for a railroad, "Fix on Failure" was called a derailment, which really means an awesome amount of scrap metal lying where a perfectly track used to be. Broke up the usual labor/managemnt intercine warfare. We used to look forward to one (as long as we weren't headed for the investigation).

-- Sure M. Worried (SureMWorried@about.Y2K.coming), July 14, 1999.

Mike:

Fix on failure is a perfectly legitimate method for some small corporations to operate. There is NO correlation between whether a company is unable to remediate OR become compliant on time. If you're talking LARGE corporations, or utilities, you're still not discussing inability, but unawareness.

My personal opinion is that this data corruption concept has been WAY over-stated, but what do I know? I only have 30+ years in this field. Data corruption will occur during testing. We actually DO ask folks to sit down and look at results before we ASSUME they're correct.

-- Anita (spoonera@msn.com), July 14, 1999.


Anita,

I appreciate your response to Mike. It leads me into similar concerns.

I don't question that testing is verified, my concern however is two fold.

1. What is the propensity for error? (Either in the remediated code or in the failure to catch all the old code?)

2. How difficult is it to identify corrupted data?

Two questions that I find all to easy to be pesimistic about, but am interested in an expereinced opinion.

Father

-- Thomas G. Hale (hale.tg@att.net), July 14, 1999.


Mike- (1) You are basically right in your analysis...but... (2) Fix on failure is how (almost) EVERYBODY runs when the system is put into production. It is recognized that some (15%?) errors will be missed in normal component and system testing... The hope is that the errors will occur at a slow enough rate to bumble through. Companies have survived some pretty major errors...and others have gone quietly into the night (and bankruptcy)! (3) Personal experience (33 years in IT, including a lot of EDP auditing). (4) Hang in there. You sound like you are on the right track.

-- Mad Monk (madmonk@hawaiian.net), July 14, 1999.


Mad Monk, your opinion matches other knowledgeable posts on this and other fora, but Hoffmeister believes that your 15% figure is way overblown. Could you please expand on this. Thank you

-- George (jvilches@sminter.com.ar), July 14, 1999.

Fix on failure is a perfectly legitimate method for some small corporations to operate. There is NO correlation between whether a company is unable to remediate OR become compliant on time.

Polly wanna cracker?

If you're talking LARGE corporations, or utilities, you're still not discussing inability, but unawareness.

That's a good one.

You're not discussing "unawareness"; you're discussing "we don't want to spend a dime we don't have to spend; now that we're finally getting protection from liability, wait to see what doesn't work right, cause we'll have 90 days to fix it."

Geesh.

-- Lane Core Jr. (elcore@sgi.net), July 14, 1999.


Thomas:

You asked:

1. What is the propensity for error? (Either in the remediated code or in the failure to catch all the old code?)

Some systems are more complex than others and are fed by more inputs than others. As I stated in another thread, the standard approach to remediation for large systems is to define the system to be remediated, scan for all file/databases used by that system, and then scan for all programs that touch those files/databases. Of course these feeding programs need to then be scanned for any files/databases that THEY use, etc. etc. backwards throughout a chain. On the other hand, a system may be stand-alone, in which case remediation would be much easier and the chance of failure much less.

I've seen more problems at sites that had poor or no source control than at sites using software like (UGH!) Changeman. At the sites with poor or no control over source, I've seen programs remediated and moved into production and someone come along later and make a change to an unremediated version of the program and wipe out the correct results. I'd find it pretty hard to estimate a propensity for error given all the different environments in which companies work. 2. How difficult is it to identify corrupted data?

Here again, you're looking for a blanket statement, ignoring complexity of a system. An error in an online system will typically be found either immediately (when a program abends) or when the data is fed to its batch counterparts....typically in the nightly run. In a test environment, the online system can be shut down whenever we feel like it and the batch jobs run and verified by folks who know what the results should look like. I've seen other systems, however, that fed files/databases every day for a week and then the weekly sum would be fed into another system. This system may feed another system that runs monthly. The weekly sums may look quite on mark, but not match up to the monthly run.

Lane:

Many small companies simply use OTS software in their businesses. They may very well be in a position to simply get an upgrade if failures occur. I'd like a sardine on that cracker, please. [grin]

You then said: "You're not discussing "unawareness"; you're discussing "we don't want to spend a dime we don't have to spend; now that we're finally getting protection from liability, wait to see what doesn't work right, cause we'll have 90 days to fix it."

I doubt very much that (for instance) Russian utility companies have been waiting for the American Congress to pass legislation before determining that they will fix on failure. I doubt as well that American firms ignored the problem for years KNOWING that mid-1999 legislation would be passed giving them 90 days to fix a problem.

-- Anita (spoonera@msn.com), July 14, 1999.


I think that what a lot of CIO/CEO's miss, is that with the FOF mentality, they are setting themselves up for a big fall. It is well known that a problem can be analyzed, modified, and tested for a fraction of the cost it will take to fix a problem AFTER is has occured. What could take $100 to fix before hand, may cost the company hundreds of thousands dollars after it has failed.

-- Carlie (carlie_scott@yahoo.com), July 14, 1999.

According to recent surveys (Cap Gemini, Gartner, etc.), about 22% of Fortune 1000 companies now concede they won't get even all their mission-critical systems fixed in time. That's why you see almost all major corporations (over 90%) now setting up crisis centers. It's not "unawareness"--it's simple lack of time to get everything properly fixed. The old bugaboo plaguing most such IT projects.

The 15% residual error rate has been often cited by Dick Mills, a rather "middle of the road" guy on the whole Y2K issue (as am I, incidentally); he says he got it from Capers Jones, though as I recall some of Jones's magisterial books on software metrics, this is a high, "outside" number. I believe the actual Y2K residual error rates will be much lower, at least among those firms that do conscientious, thorough testing (that's another big point of contention, of course); the data I've seen, and which Yourdon cited in an article some months ago, based on actual analysis of "remediated and tested" code, showed residual errors of 450-900 per million LOC. But yes, residual errors are going to be a significant problem next year. Anybody who says otherwise has not done his homework.

Re FOF policies in "small" companies: the latest national surveys (NFIB, etc.) indicate that 20-25% of small companies don't plan to do anything about Y2K until next year; some still think it is a "hoax" and some don't want to mess with costly inventory and assessment procedures. Part of the concern for analysts here is in defining just what a "small" company is; definitions vary, but one common definition is that a small company may have anywhere up to 2,000 employees; a medium-sized enterprise is defined as having 2,000 to 20,000 employees. (In Capers Jones' books, a small company is defined as having up to 1,000 employees, and a medium-sized company as having 1,000 to 10,000 employees.) One hopes that most of those outfits adopting an FOF policy are toward the "mom-and-pop" end of the scale but I have not really seen any conclusive data one way or the other on this score.

If you caught my earlier thread on Rep. George Grindley's response to me, you know that Grindley (R-35th District, Marietta, Georgia) wrote that he has been told by a leading remediation expert in Georgia that most SMEs (small and medium-sized enterprises) in Georgia are doing nothing about Y2K; Grindley, who heads Georgia's Y2K task force, considers the potential impact of this to be "devastating." I consider some of this to be exaggerated, but given that many SMEs are suppliers to big corporations, there probably will be some major trouble.

-- Don Florence (dflorence@zianet.com), July 14, 1999.



P.S. I belatedly realized that I needed to connect some dots in my post above. (I seem to have a lot of belated realizations these days, hence my number of P.S. posts!)

Perhaps this will make the "residual error" rate issue clearer. As I recall, GartnerGroup originally estimated that an average 2-5% of all LOC (lines of code) worldwide have Y2K date problems. Now, this estimate is obviously open to all kinds of debate (don't ask me how Gartner derived it in the first place), but for the sake of peace in the family let's accept it. Now, of course, your mileage may vary: some programs may have much fewer (if any) lines with date problems, while other programs (esp. in banking, insurance, etc.) could be chock-a-block full of Y2K-plagued dates. But we are talking here about that mythical beast (never seen in captivity but once allegedly interviewed by Geraldo Rivera), the "average" software program.

OK, take the data supplied in my previous post: testing (actually post-testing testing!) firms have found, on average, 450-900 residual errors (missed Y2K problems plus inadvertently introduced "bad fixes") per million LOC in programs already pronounced "remediated and tested" for Y2K. (N.B. to purists: yes, LOC is a rather lousy software measurement anyway, given the way languages, etc., vary, as compared to the "function point" metric introduced many moons ago by IBM, SPR, and Capers Jones; but I don't have function point data here, and, besides, as a nontechnical outsider, I have enough trouble with LOC!). I presume that these programs used for these post-testing testing samples are supposed to be typical or "average," else the data doesn't mean much. (Perhaps it doesn't anyway!) So, let's take the lowest Gartner estimate, to be conservative, for the number of lines of code that would originally be expected to have date problems before any remediation: 2% (.02) x 1,000,000 LOC = 20,000 LOC with date problems originally. We will pleasantly assume (the sun is shining outside) that there is only one Y2K problem per each of these affected lines of code, so we are talking about 20,000 Y2K errors in our million LOC before any remediation was done. After remediation and testing, our post-testing testing firm found 450-900 errors still lurking per every million LOC, or a residual error rate of 2.25% (450/20,000 = .0225) to 4.5% (900/20,000 = .045). That's why I think the figure Dick Mills uses, a 15% residual error rate, is perhaps too high.

However (qualification/concession time; sack cloth and ashes) I have seen other reports of post-testing testing that were finding much higher residual error rates that the ones I used above; I don't know the details or how reliable those studies were, or what types of programs were being sampled (e.g., if they were date-intensive banking and insurance programs, you might expect considerably higher residual error rates). Furthermore, if you'll recall Mr. Yourdon's discussion of such matters in his "TB 2000" book and various articles, the efficiency of firms in testing and de-bugging software easily varies by a factor of 100x or more, from the best outfits (e.g., Lucent, Motorola, etc.) to the really mediocre and downright bad ones (which hopefully don't include your vendors!). So, when you see the phrase "remediated and tested," you have to think about the mother of Forrest Gump and her observation about life and a box of chocolates: you never know what you are going to get. Come next January, we might get a lot of goo.

-- Don Florence (dflorence@zianet.com), July 14, 1999.


Moderation questions? read the FAQ