Fix on Failure  Not Really Possible

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Many communities and companies, especially smaller ones, are anticipating employing a Fix on Failure (FOF) strategy for dealing with Y2K failures. As the name implies, FOF involves waiting for failures to occur before taking action. The inference is that mitigation and pre-deployment of resources is not primary to this type of response. This is adequate only if all mitigation efforts are undertaken prior to the event and some breakthrough failures still occur.

In the emergency management community a similar strategy seems to be in vogue, one called Respond on Failure (ROF). This strategy is based on the premise that all disruptions will have a viable and timely response. Such a strategy might be possible for normal disaster situations, but is problematic for Year 2000. There are many activities that might be required of a community that cannot be done by an ROF activity. In these situations, failure to deploy resources prior to the event could be catastrophic.

Exercises and tests have demonstrated that communities can respond to routine failures or disruptions (such as flashing traffic signals, computer viruses, stuck elevators, and disruptions in the delivery of prescriptions). What they fail to demonstrate is the ability to prevent or alleviate more serious threats, those affecting special populations. There are three special populations in every community that are extremely vulnerable to Y2K-related failures. They are as follows:

1) The frail elderly

2) People with deficient housing

3) People who are medically and electrically dependent

Support for these groups in the case of Year 2000 disruptions must be preplanned, procedures written, and response plans exercised.

Routine failures quite often lead to or precede serious failures. For example, a routine general failure (power disruption) can lead to serious individual failures (cessation of ventilators and heating). For people who are ventilator dependent or who have deficient housing, this cascade of events can be catastrophic. In the case of the serious failures, sheltering for these populations in an appropriate facility must be accomplished. Opening shelters, transporting elderly and dependent populations safely and supporting the people should be the objectives and should be examined in any Year 2000 contingency planning tests.

If any of the following failures occur the ability of a community to respond to serious events (i.e., heart attacks, automobile crashes, and childbirth) will be impacted: 1) failure of the 911 system, 2) inability of EMS to respond, or 3) healthcare facilities are unable to treat patients, . Alternative methods for providing health and medical care must be identified and established now to prevent a failure. There are many strategies that can be implemented by communities depending upon the numbers and geographic distribution of potentially affected populations. A couple of these are establishment of casualty collection points (CCPs) and Community Emergency Response Teams (CERTs).

Establishment and deployment of alternative medical treatment strategies is the goal of these two strategies. CCPs and CERTs should be organized, trained, publicized, and deployed prior to January 1, 2000. Residents must know where they can go if they experience a problem and customary response personnel such as law enforcement, EMS, and Fire are not available. Following Hurricane Andrew, Disaster Medical Assistance Teams (DMATs) were used to staff CCPs, providing austere medical care in the disaster environment. Although not preplanned, the DMATs were quickly established using federal and state resources. For Year 2000 planning, we need to identify a similar response capability that is a local asset as State and Federal assets may not be available.

Community Emergency Response Teams were initiated by a FEMA program and can provide local citizen groups the ability to render first aid, light search and rescue, and fire suppression. The FEMA program provides an excellent vehicle for the establishment of alternative response capabilities in the community. This capability can be tapped for use whenever the normal responders are unable to respond. They will respond on failure but their response must be preplanned and exercised.

We need to keep in mind that response on failure is only acceptable to disruptions that cannot be preplanned or anticipated. Life safety systems, like emergency care, must have alternative strategies from a normal response. Designing and exercising alternative strategies should be community goals for the fourth quarter of 1999

-- G Bailey (glbailey1@excite.com), November 22, 1999

Answers

Fix on failure is a misnomer.

If the code is broken, hitting the reset button will not help.

In the case of industrial equipment, "failure" might result in physical damage, (blown gauges, busted boiler, fires, fried transformers. Nobody has spare parts for the ENTIRE PLANT.

The proper name is "kluge on failure" which usually means emergency surgury on the system, by cutting out the offending subsystems, (assuming that is possible) or just shutting the system off entirely and switching to a manual backup plan.

It should be pointed out that this is the way the electric power industry operates under normal conditions, things break, and they fix it. This works for two reasons.

The fixes are usually "single point failures" The remedy is obvious (eg a transformer blew up- replace it)

FOF is fraudulent as a y2k strategy (though that is how virtually all emergency management systems work) because y2k problems might likely be both systemic (that is to say, a common mode failure like 1000 valves of the same kind going bonkers all at once) and also nonobvious (meaning it could take WEEKS to figure out what happened and how to fix it)

A FOF strategy is basically a corporation saying "we'll have fifty dutch boys with their thumbs ready to stem the oncomming north atlantic."

-- huntchback (quasimodo@belltwor.com), November 22, 1999.


Fix on Failure  Not Really Possible

June 6, 1999

-- G Bailey (glbailey1@excite.com), November 22, 1999.


They also seem to assume that when something breaks, that they will be able to call the manufacturer, or parts warehouse, and have it Fed- exed to them the next day. This may not be possible due to non- availability of the part, or the transportation system may be screwed up too. Or the bank may not be able to process the transaction to pay for the parts (after all, the guy selling you the part wants to get PAID for it before he ships it (unless he is a complete fool)).

Also, there may be alot of companies competing for the same parts, and they will almost certainly be in short supply due to the increased demand.

FOF is a sham.

-- Bill (billclo@msgbox.com), November 22, 1999.


I disagree with flat statements that FOF cannot work. I have seen large systems totally stopped by processing errors, design errors, etc, and then restarted with workable (perhaps not pretty) fixes in a day or so.

The key to FOF is the degree of interlocking failures. The fixes I've seen always had a working infrastructure under them. We could get to work, we had electricity, there were no riots in the streets, etc. FOF will fail when there are too many simultaneous failures for the fixers to do their work. Also, FOF will fail when the severity of the failure takes longer to repair than the organization can survive without it. If it takes 6 months to repair a grocery's ordering system, that grocery chain will have ceased to exist.

FOF is not impossible, but it is highly risky. I would not willingly put myself in a FOF situation.

-- bw (home@puget.sound), November 22, 1999.


Communities? Our entire state gov't (AL) is going to FOF.

-- rolltide (rolltide@bama.com), November 22, 1999.


All --

Fix on Failure will probably *not* work with embedded's. What is *most* likely to be broke with an embedded is the internal code. And simply replacing the prom isn't going to do the trick (which is one of the standard dodges and works fine, as long as the actual code isn't wrong.) If the code is broke, then you have to fix the code, which presumes:

a). The fault has already been found by the vendor and repaired and they can send out a replacement with 'fixed' code. (Although, if this is the case, why the heck didn't they fix it before it broke?)

b). Failing that, then you have to get the original source code. (Assuming it is still available, that it exists on a media that you can access, as it won't help if it is on an 8" Floppy disk, or, for that matter, on a stack of paper)

c). The errors must be corrected.

d). The program must be compiled. (Assuming that you have a compatible compiler, that runs on a platform you still have.)

e). The program must be linked. (Again, assuming that you have a compatible linker that runs on a platform you have.)

f). The program must be loaded. (See above.)

g). The program must be burned into a prom, or flashed or what have you. (Again, assuming that you have a supply of the correct proms, or a supply of compatible replacement parts.)

h). The prom must be tested. (Which assumes that the original test bed for the product either still exists or can be reconstructed.)

i). Now, assuming ALL of the above, and that you made no *new* errors, you can send out a replacement part.

BWAHAHAHAHAHAHAHAHAHAHA!

ROFLMAOPMP!

In the case of stuff that was built in the early to mid 80's, I would be absolutely *FASCINATED* to see somebody try to pull all of that together! I mean, I'd pay *money*!

And, before ole 'FactFinder' comes on, this is not a hypothetical. This is exactly what one of my former clients is attempting to do. (And having 0 success, I might add. The source code is a stack of paper about 6 inches high. The compiler doesn't work on their current development platforms. The linkers and loaders don't either. The tet equipment was scavenged 9 years ago for other projects as it is expensive. And there isn't any replacement part for the microcontroller (which is where the rom lives). So they are trying to hack together a replacement system for something that took the better part of a year to design and implement in about 3 months.)

-- just another (another@engineer.com), November 22, 1999.


Moderation questions? read the FAQ