I'm With Orson, how come the "Dreaded Embeddeds" haven't failed before?

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Need an honest, experienced person to explain, in simple language.

-- Simpleton (justexplain@manytimes.com), December 04, 1999

Answers

Good question!
But, perhaps some *HAVE* failed and we just weren't informed about them? They may have failed because they were "bad" or because of other reasons than a wrong date?

-- Birdlady (Birdlady@nest.net), December 04, 1999.

Chips fail all the time. They just don't do it in massive numbers like the y2k problem has the potential to cause. If I interpreted the articles posted recently correctly many of these chips are running an internal clock which may be isolated from any outside time reference, and the internal clock is not very accuate in many cases. It would seem to me then that many of the plant explosions and such we are hearing about could be due to inaccurate chip clocks reaching rollover prematurely and malfunctioning. Any of the experts have an opinion on this?

-- Nikoli Krushev (doomsday@y2000.com), December 04, 1999.

I will post this again, But what about all the rumor-mongering about the chips in off shore oil? Did they never/ever fail from "whatever"? or have we been "duped again" with "there're all gonna crash because of a date"?

-- Simpleton (justexplain@manytimes.com), December 04, 1999.

I've done some embedded systems programming, and it seems a completely bizarre myth to me that systems with no external clock interface can still fail. The clock on the board has to be set, just like any other clock. Either it gets set from a console/keypad, or it gets the time from some other controller. That controller again must either get the time from a user interface or from another system. Somewhere, there has to be a clock you can set!
The only other alternative would be that the manufacturer sets the clock before installation and then carefully puts it into a board with a clock battery on it, keeping the processor powered all the time. This is kind of unlikely. If nothing else, you are creating a new failure mode when the clock battery finally goes. From my experience, there probably are millions of clocks out there, but they are all reset to some base time at power up (like an old DOS machine coming up at 1/1/80), and their clocks have no relationship to the correct time. They will not fail at 1/1/2000.
So, I expect that on 1/1/2000, only a small minority of systems will fail due to date problems. Almost all of those will be fixable by setting the clock back. There will be three big problems however:

Finding the system that needs to be reset, and figuring it how to do it. A failure has to be traced to a system, and then someone who knows how to operate the thing has to be found to reset the time.

Damage done due to control system errors. This is the (hopefully rare) case like the New Zealand incident, where a bad controller causes a system to physically damage itself (a steal plant, I think, and the damage was in the millions.)

Damage done by operators who misinterpret bad data displays.

None of this helps you decide on the odds of power off or chemical accidents, but at least it isn't 100%. My guess though is that there is enough of this stuff around that we will see at least some accidents.

-- You Know... (notme@nothere.junk), December 04, 1999.

---------------------------------------------------------------------- ----------
Keywords: Embedded Systems, Y2K, embedded chips, microcontrollers, microchips, chips, data acquisition, SCADA, programmable logic controller (PLC), process control, manufacturing automation,
---------------------------------------------------------------------- ----------
Chemical Safety and Hazard Investigation Board - PIPELINE SAFETY ADVISORY BULLETIN July 7, 1999 The Office of Pipeline Safety, U.S. Department of Transportation, has issued a Pipeline Safety Advisory Bulletin following a June incident in Washington State which claimed three lives.
Background: During an Office of Pipeline Safety (OPS) investigation of a recent pipeline incident, OPS inspectors identified inadequate SCADA performance as an operational safety concern. Immediately prior to and during the incident, the SCADA system exhibited poor performance that inhibited the pipeline controllers from seeing and reacting to the development of an abnormal pipeline operation Preliminary review of the SCADA system indicates that the processor load (a measure of computer performance utilization) was at 65 to 70 percent during normal operations. Immediately prior to an upset condition occurring on the pipeline, the SCADA encountered an internal database error. The system attempted to reconcile the problem at the expense of other processing tasks. The database error, coupled with the increased data processing burden of the upset condition, hampered controller operations. In fact, key operator command functions were unable to be processed immediately prior to and during the abnormal operation. It is possible that post installation modifications may have hampered the system's ability to function appropriately. The combination of the database error, the inadequate reserve capacity of the SCADA processor, and the unusually dynamic changes that occurred during the upset condition, appear to have combined and temporarily overburdened the SCADA computer system. This may have prevented the pipeline controllers from reacting and controlling the upset condition on their pipeline as promptly as would have been expected. For further information, contact Chris Hoidal, Director, OPS Western Region at 303-231-5701.

-- G Bailey (glbailey1@excite.com), December 04, 1999.

LINK http://www.y2k-status.org/EmbeddedFailures.htm
The Institution of Electrical Engineers (IEE.org.uk) The Millennium Problem in Embedded Systems - Casebook
Much has been written about Year 2000 problems in embedded systems, but the emphasis has been principally on the process of investigation, with little information about real cases of failure. While the incidence of Year 2000 problems in embedded systems has been found to be relatively low, the impact of the problems has in some cases business been business threatening. Action 2000 in conjunction with The Institution of Electrical Engineers has undertaken a data collection initiative to collate facts about actual Year 2000 failures in a wide range of embedded systems. Action 2000 through the IEE requested leading consultant engineering companies in the UK to list the occurrence of actual faults found in equipment. Because of the range of specialisms and industries worked in by these companies, a good representative sample is thought to have been found. - AEA Technology, BSC Consulting, ERA, IBM, ICS, Real Time Engineering, The Houndscroft Partnership The equipment categories used in the collection of data for the non- computer entries (60% of the total) were: Logging / monitoring Other PLC SCADA Smart Instruments Stand alone instrument The areas reported as having most problems with non-computer based systems are (in decreasing order): Calibration, monitoring, data logging, detectors, analysers Building management, including HVAC, fire and security systems Manufacturing and process systems (SCADA, PLC, DCS) Telecommunications and networking Other The dates which caused the problems were: millennium rollover 71% leap year problem 9% multiple date problems 6% other dates or unknown 14% Tava Technologies : A White Paper that Discusses the Significance of the Effect of the Millennium Bug (Y2K) on Process Control, Factory Automation & Embedded Systems in Manufacturing Companies. Feb 98. (pdf) "
"To date, with plant floor Y2K experience at over 400 sites, the company has yet to find a single site that did not require some degree of remediation; and, to date, having researched tens of thousands of manufacturing automation systems and components for Y2K readiness, the company has found more than 20% to be either non- compliant or "suspect", that is non-compliant under certain circumstances." Problems range from major operational nuisances to erratic production shortages to complete plant shutdowns. But, perhaps the worst case of all will be systems that continue to work but make bad decisions effecting product yields. It may be on January 1, 2000, or it may be days or even months later." Industry Wakes Up to the Year 2000 Menace Fortune article
Ralph J. Szygenda, chief information officer at General Motors, whose staff is now feverishly correcting what he calls "catastrophic problems" in every GM plant. In March the automaker disclosed that it expects to spend $400 million to $550 million to fix year 2000 problems in factories as well as engineering labs and offices. Rob Baxter, Honeywell's vice president in charge of making his company's line of industrial control products "year 2000 compliant" From what he has seen among Honeywell customers, Baxter fears that "some plants will have trouble operating and will have to shut down. Some will run at a reduced scope. I expect considerable system outages during December 1999 through February 2000." Manufacturing's task is compounded by the multiplicity of its computer programs. Below the layers of more or less standard software is a vast range of equipment run directly by built-in chips and programs, which outnumber those in the rest of business by a factor of ten. General Motors - "At each one of our factories there are catastrophic problems," says the blunt-talking executive. "Amazingly enough, machines on the factory floor are far more sensitive to incorrect dates than we ever anticipated. When we tested robotic devices for transition into the year 2000, for example, they just froze and stopped operating." Only a few companies offer software that can deal with factory problems. Among them are Raytheon Engineers & Constructors, Fluor- Daniel, and Peritus Software Services of Billerica, Mass., as well as the service operations of companies that sell industrial controls, such as Foxboro and Honeywell. Tava Technologies. Its Plant Y2kOne software includes a database on 10,000 microprocessors, related control devices, and software from more than 1,000 vendors that is used on the factory floor. Among other things, Plant Y2kOne can check out software in robots, PCs, and PLCs; operating systems such as Unix, DOS, and Windows NT; and embedded software such as a program used to guide automated vehicles. Leap-year snafus damaged production lines when programmers failed to account for the extra day in February 1996. At a small U.S. manufacturer of industrial solutions that prefers to remain unnamed, production ground to a halt on Jan. 1, 1997. Before workers could remedy the situation, the liquids hardened in the pipelines, which had to be replaced at a cost of $1 million. That caused late deliveries and the loss of three customers. A similar leap-year oversight caused $1 million of damage at Comalco's aluminum refinery in Tasmania, when controls at all smelting-pot lines shut down, damaging five pot cells beyond repair. Year 2000 Problem Sightings ( http://info.cv.nrao.edu/y2k/sighting.htm ) Excellent source for general Y2K failures
report Anesthesia machines non-compliant - supplier tries to sell new systems report Congressional Subcommittee survey Phillips Petroleum Y2K test - an oil rig hydrogen sulfide detector system stopped working. Chrysler plant lock out NORAD Y2K - total system blackout Cara Corporation Embedded Systems Specialist David C. Hall stated that there are over 40 billion microprocessors worldwide, and anywhere from one to ten percent may be impacted by the date change. Hall described an oil company that has determined the need to replace thousands of chips controlling an oil dispensation system. The chips, he said, do not fit on the existing motherboards and new motherboards do not fit into existing valves. As a result, the valves themselves will have to be replaced, Hall said report Users Demand Y2K Lemon Aid, Control Magazine Y2K failure rate in semiconductor plants - 3.3 billion micro- controllers embedded in the automation infrastructure, 50 million will have Y2K anomalies. As a reference point, Woll reviewed the Dept. of Defense Year 2000 project inventory report. He said of 3,962 applicable systems, 582 were OK, 623 were being renovated, 628 were retired, and the balance of 1,900 was being assessed. The numbers suggested that about 25% of all the systems would require some level of fixing. Patrick Meehan, Y2K program manager, DuPont Operations, presented the large-user perspective. "Let's face it, there's not much upside and a lot of downside," he offered. He sees that 50% of DuPont's work will be with process control devices and systems and his current estimate is that, while 100% will be examined, 10-15% will need remediation. "Towards the end of 1998, those who haven't yet worried about Y2K will find themselves forced to. If they don't, Y2K becomes the best thing that happened to lawyers since divorce." http://www.xs4all.nl/~zooko/Y2k-real-life.html
full story General Motors tested robotic devices - they "just froze and stopped working" full story control valve for generator cooling integrated over time for smoothing full story Chrylsler plant test locks the doors on testers We're pretty sure our first tier will work," Chrysler President Thomas Stallkamp said of his company's largest suppliers. "It's the second and third and fourth tier who supply not just our industry but others. As you get further down the food chain, you've got a guy making widgets for us as well as for Boeing and Maytag, and those guys are the ones we're worried about." "We got lots of surprises," said Chrysler Chairman Robert Eaton. "Nobody could get out of the plant. The security system absolutely shut down and wouldn't let anybody in or out. And you obviously couldn't have paid people, because the time-clock systems didn't work." http://www.euy2k.com/reallife.htm
a power plant in the United Kingdom - control valve for generator cooling is integrated over time for smoothing ITRON meter reader decks and associated upload/download equipment fail on 2000 NRC-NEI Meeting (If a plant can be shut down because flooding prevents proper emergency response, then Y2K failures of emergency procedures could require shutdowns) details Hawaiian Electric Company Western Power - Many of the control systems represented in power systems, have dates associated with them. These could be reclosers, Voltage regulators, Governors, PLCs etc. The list is endless. You then have a swathe of actual 'applications' involved in the delivery of electicity such as your Distributed Control Systems and your SCADA (System Control and DATA (eg.dates) Acquisition) systems, all of which have dates associated with them. Much of what happens throughout the process of generating and delivering electricity is 'DATE AND TIME STAMPED' http://www.sysmod.com/embexamp.htm
North Sea Expro (Shell-Exxon JV) Platform, Pipeline and Gas Plants - 12% failure rate Alcoa Steel Plants : 50% of control systems will fail BP Refinery - vendor not found for 20, 3 will fail, 2 will cause shutdown Capelrig Millennium Test Centre for Shell demonstates how failing system controlling an oil rig pump would float the platform oil rig typically has 8000-10000 embedded systems details Hawaiian Electric Company energy management system (EMS) failure would haveresulted in HECo's transmission network crashing, and by default, a major power outage and loss of all generating capacity Programmable thermostats fail, one cannot be restarted. Chip failure would cut off cooling system and cause explosion in chemical plant Fossil power plant control and downstream PLC clock mismatch would trip plant Gas pipleline metering failure PLC's locking up due to Year field overflow Sewage controls fail to track tide tables properly http://www.year2000.com/archive/similar.html (Computer problems similar to Y2K)
telephone outage that occurred in New York on September 17, 1991 Gulf War Patriot missile system had an unrecognized clock drift over a 100-hour period - tracking error of 678 meters the software for the F-16 fighter would cause the plane to fly upside down whenever it crossed the equator Berlin 1993, two trains collided - the track was set on the holiday two-way traffic setting Cement factory chip failure drops rocks on cars 99 year old man's blood count judged by infant norms In Colorado Springs one child was killed, another injured - the traffic light systems continued in weekend mode and ignored the school schedule -failure getting the time transmitted to them from the atomic clock in Boulder Several leap year problems noted including aluminum smelter http://www.granite.ab.ca/year2000/incidents.htm
The Tiwai Pt, New Zealand] aluminium smelter, PCMH Biomedical Department - Hamilton ventilator failure UK National Health Service problems Credit card failure "a major, catastrophic problem" in ICBM launch controls Bank merges due to Year 2000 problem Robot has the wrong date Therac-25 X-ray system kills six patients (Details on non- y2k "software" problem More Visa card problems
---------------------------------------------------------------------- ----------
Embedded SystemS Problem (ESSP) Ltd
Embedded systems are used extensively to control and monitor engineering and manufacturing processes. They underpin the whole of the worlds manufacturing and engineering base. Energy (oil, coal, gas, nuclear), planes, ships, pharmaceutical industries .. food, drink and clean water ...car manufacturing, national and international defence, railway networks, telecommunications, medical equipment, broadcast media. Washing machines, microwave ovens, video recorders, alarms/intruder detection systems and central heating controllers. control temperature, lighting, air conditioning and security access in many offices. And they also support point of sale equipment, cash dispensers and traffic management in a typical High Street. During 1995, more than 200 million PCs were shipped worldwide. In the same period, the number of embedded systems shipped exceeded 3 billion. According to research conducted over the past year, around 5% of simple embedded systems were found to fail Millennium Bug tests. For more sophisticated embedded systems, failure rates of between 50% and 80% have been reported (Action 2000 UK Government Taskforce). In our own experience however we have found it closer to 15%-20% in processor intensive

-- G Bailey (glbailey1@excite.com), December 04, 1999.

I want to add that the reason it may have never happened before, is that some of the equipment has never had it's clock function tested with the year as "00". You can kind of feel the problem when you test an old PC with a bad BIOS. I understand that some of the devices have clocks with batteries which keep the clock on even if the system has been turned off (as is the case with PCs). This means that it can count up but won't stop unless the battery is located and removed.
It is my understanding that some devices use a clock function only for the difference in seconds (or perhaps even milliseconds). The year can still affect it's function even thought the year is not used. This is because the year is still a part of the calculation. You might wonder why this is done. I understand this is done because it is cheaper to buy a single clock function and use it for all time issues that the manufacturer may need it for.
One other thing. I have heard that time may be used as a part of a "randomizing" function. A computer doesn't really understand the concept of "random." (Actually, many scientists believe the universe doesn't understand it either.) The possibility exists that if the year is used as a part of a "randomizing" function, it could attempt to devide by zero. This could very well "crash" the device since a division by zero is "undefined."

-- Reporter (reporter_atlarge@hotmail.com), December 04, 1999.

Simpleton --
Okay, fair question.
Answer: (short version); Embedded chips fail all of the time. (You will kindly note that I have just parroted the 'Polly mantra'. And it is true. They do fail all the time.
[Long version, for the non-attention-deficit impaired, thus excluding all Pollies]; Embedded chips fail all the time. These failures are usually due to things like power spikes frying chips, under or over voltage conditions, operators spilling coffee into them (it happens), various and sundry other *physical* problems, and yes, those which are in 'difficult' environments occasionally fail due to environmental factors.
I don't believe anyone has ever disputed this fact. However, and here is the tough part, (most of the pollies read the first paragraph and never get to this part), these failures don't happen all at the same time (I am counting the first week or two as 'the same time' because, on the scale of the *normal* number of failures, to all intents and purposes it *is* the same time.)
Usually, failures occur in a statistically well distributed fashion. There are terms such as 'Mean-Time-Between-Failures' (MTBF), etc, which describe these distributions. And again, note that these are caused by *physical* phenomena, the coffee, the power, the environment in which they operate.
What we are talking about in this case, is a failure of the *software*. (Funny how this goes. On a link a week or so ago, one of the pollies was arguing with the typical hand-waving about this and made the statement 'Well, of course it is *possible* to put non-compliant code into a compliant piece of hardware.' And I refrained, admittedly with great difficulty, from pointing out that this was *EXACTLY* what the problem was.)
Software failures in embedded chips, at least the custom type I am most familiar with, where the firmware was specifically designed for a particular application, such as running an HVAC, or controlling a physical process, such as a helium liquefication plant, are rare to the point of being *non-existent*. Due to the nature of the business, these things get *rigorously* tested before they ever order a production run of the microcontrollers or microprocessors. (This has caused me trouble a number of times, coming off of a contract where I worked on this type of application, then went to a 'pure' software outfit, and made plans for a number of tests, and was shot down in flames, because it was 'too expensive', 'overkill', 'a budget buster', etc.)
One of the reasons why these types of chips get so thoroughly tested is that they are *very* difficult to 'fix'. The 'fix' is to throw away the chip with the bad software, fix the software, and order a *whole new batch of chips* from the manufacturer with the corrected software as the binary image burned into the on-board ROM. (Please note that this is somewhat of an oversimplification. There are a *bunch* of different types of these things, Microcontrollers, Microprocessors, PLC's, PLA's, ASIC's, and probably another bunch I've never heard of. The ones *I* am familiar with were Microcontrollers, and Microprocessors. I posted a discussion about these types, and a little about all of the ramifications, under a thread called 'Clarification on Embedded Chips'. For more detail, and a 'less' simplified version of this, please see that thread. I believe it is still under "New Answers", about 3/4 of the way down.)
The problem we are facing isn't just that the chips will fail (in some cases), but that the 'Mean-Time-To-Repair' or MTTR, is *VERY* long, and in a number of cases, where the product is more than, say, five years old, it will be easier to just design a new one, for a number of reasons, including, but not limited to, lack of source code, lack of documentation, lack of compilers, linkers, loaders, and development platforms, and lack of test hardware.
So, to summarize:
1. Chips fail.
2. They normally do not do so in large numbers all at once.
3. When they currently fail, it is normally due to a physical hardware casualty.
4. The 'repair' process in this type of failure is 'replace the chip with a new one of the same version.'
5. What is fixing to happen in Jan. is failures of the internal *code*.
6. The 'repair' process here is to create brand new chips.
7. The difference in the time required to repair between these two scenarios is *SEVERAL ORDERS OF MAGNITUDE*. That is, it takes a couple of *minutes* to change out a chip. It takes anywhere from (lowest number *I've* ever seen) 18 *weeks* to (longest one *I've* ever seen) 18 *MONTHS.
Does this answer the question? (Think about that last number. Try to envision New York City waiting 'patiently' for 18 months for a chip to get their water system back in operation.)

-- just another (another@engineer.com), December 04, 1999.

Simpleton,
It's simple...Embedded systems do not exhibit the Jo Anne Effect. If they did, they would have failed by now. Instead, they fail during testing.
Incidentally, every embedded scenario in that much maligned "Y2K The Movie" is based on an ACTUAL failure found during testing, mostly done in England.
Once again, since Embeddeds do not exhibit the Jo Anne Effect, come January the Infrastructure will fail!!!

-- K. Stevens (kstevens@ It's ALL going away in January.com), December 05, 1999.

I have been in the HVAC field for about 20 years. Whem I started, most of the building control was acheived with mechanical, and Pneumatic analog control. In the last 10 years this has been mostly replaced by DDC devices. Control consists of a end device (On a fan lets say)sending airflow, temp, and similar information, communication up to a cabinet that holds a processor. These end devices may hold a "canned" routine, but to chacge set points, ect communication with the cabinet is required. In some cases a "Lap Top" can be connected to the end device to set it. This "Cabinet"has the actual programing. If there are multiple "Cabinets" they are networked to a front end PC so it's all seamless. The point of this is that I have never seen anything repaired at the "Chip" level. If a device fails, then the entire device needs replacing. How much relacement hardware is available at any one moment in time? In addition, because of cost, most of the DDC systems do not have mechanical ,ie. thermometers, pressure gages, ect. to allow manually looking "into" the system. These systems definately process dates. The vendor that my orginization deals with has remediated, upgraded, our systems for Y2k compliance but we're not going th really know until the clock strikes 12.
I hope, for my sake, and for those that I serve that Y2K is a BITR.

-- simon (simon5@mail.com), December 05, 1999.

Very good reading but I have a question concerning testing and conpliance statement of embedded processes. Let's say I have 20 chips of the same make. Do I need to test ALL the chips or just one to make them all compliant??

-- y2k dave (xsdaa111@hotmail.com), December 05, 1999.

y2k dave:
Testing doesn't make anything compliant, it just determines what problems, if any, are present.
If one of those 20 is OK, your time and effort are best spent looking for problems with some other system. If that one fails, you're looking at fixing all 20.
Software problems do have the issue Just Another raises -- that if one fails, they'll all fail (if they're identical). However, if one works they'll all work.

-- Flint (flintc@mindspring.com), December 05, 1999.

Moderation questions? read the FAQ