Utility crosspost

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

This is an excellent analysis of PC based embedded systems by Andrew Edgar. I felt it deserved a wider audience. At the end where he talks about "fun" programming, I know exactly what he means. I commonly wrote my own "special" subroutines rather than rely on someone else.
As an aside, the Springfield utility has to replace their SCADA systems despite saying they only found "minor" problems. =================================================================
I can give you this much factual information in a public forum: every single system that I have been personally involved with or about which I have received first hand news from a close colleague, that has involved IBM PC compatible embedded systems or heads, has _REQUIRED_ remediation in order to function correctly. These systems have included: multiple PC based business servers, two telecommunications product lines, 1 telecommunications PC client program, multiple nuclear power plant monitoring systems. And that's just in the last few months. This by no means represents a meaningful statistical sample, but I don't need a Phd in Applied Math to know that five out of five is alarming. The following are hypothetical examples but I guarentee you that a variation of each of these is highly likely to (read: _will_) occur. Example #1: A PC compatible embedded systems board is being used inside a plant's Programmable Logic Controller (PLC). If the PLC goes down the particular plant process that it is part of goes down too. This PLC gives no visible clue that it uses dates in any way. It is almost completely stand-alone. In fact, it is so simple and so obviously doesn't use dates or times that it is not included in the Y2K inventory and assessment. The plant technician conducting the inventory scratches his head and wonders why on earth such an expensive piece of equipment is being used for such a mundane task. All it is does is measure the temperature of a step-up transformer and raise an alarm if the temperature goes out of range. Unfortunately, the technician isn't aware that this PLC is PC based. After midnight 12/31/1999 the software clock happily rolls over to 00:00:01 1/01/2000 and the hardware clock happily rolls over to 00:00:01 1/01/1900. The PLC keeps humming along fine. At 00:15 the power goes out due to a deliberate temporary power outage in order to balance disturbances elsewhere in the power grid. A few minutes later at 00:21 the power comes back on. The PLC starts its boot procedure, during the Power On Self Test (POST) of the Initial Program Load (IPL) the BIOS (Basic Input/Output Services -- in firmware) reads the hardware clock and it sees that the year is set to "00". The BIOS code has been programmed to "know" that this is not a valid date and drops the PLC into the BIOS setup screen. Of course there's no display monitor or serial terminal attached so there's no obvious visual clue as to what's wrong. The PLC is simply not booting up. No amount of re-booting will change the behavior. A PLC (PLC2) further down the line (that has come up successfully) times out on our rogue PLC (PLC1). PLC2 had been programmed to raise an alarm if it had not received a "temperature OK" update from PLC1 within 5 minutes of a power-up reboot (3 minutes for the systems to reboot, 1 minute to initialize and 1 minute to give the first status). At 00:26 the main SCADA computer shuts the plant down in fail-safe mode because it has been unable to get a critical temperature data point (our PLC).
Example #2: We start with the above scenario but make our PLC a little more sophisticated. It is a renovated model that has been declared Y2K compliant by our hypothetical vendor "Surelywell Controls". This enhanced version of our previous PLC integrates the temperature over a period of one minute and then sends the result to the downstream PLC with a timestamp. This system had been inventoried and assessed, the PLC had been upgraded to a Y2K compliant model from Surelywell Controls. The system had even been _tested_ and _passed_. But it is still going to fail. Why? Two reasons: the local time vs UTC problem (see below) and "sleeping" code problem. In this system the PLC operating systems is running using local time with the hardware clock using UTC (or GMT) time. So at 00:00:00 UTC it is actually only 19:00:00 12/31/1999 local time in New York. When this system passed the Y2K test, the tester used the BIOS setup to set the date and time. This sets the hardware clock, not the OS software clock. So when the Y2K test was run the internal software time and date was actually 19:00 12/31/1999, NOT 00:00 01/01/2000. Now this sets up the first part of the failure, insufficient or not well understood testing. The second part of the failure is caused by "sleeping" code. This PLC is only transmitting the temperature and a timestamp, not the year. Furthermore, it's only integrating the temperature over 1 minute so how can the year be a factor? Because the code originally transmitted the year and the time but a code change early in development or an Engineering Change Order (ECO) after deployement required that the year be removed. Fine, the programmer simply changes the piece of the code that used to transmit the date and time to only transmit the time. All the code that did "whatever" manipulation with the date is still there and the code path is still executed it's just that the final result is not transmitted. When the hardware clock rolls over to 05:00 01/01/2000 the software clock will roll over to 00:00 01/01/2000 EST and the Y2K bug in the "sleeping" code is hit and the PLC faults causing it to fail and hence causing the system to fail. In a follow up investigation we find that in the fine print Surelywell Controls only claimed the PLC hardware, PLC firmware and PC BIOS to be Y2K compliant. They specifically spelled out that they could not warrant or be held liable for defective PLC application code. They also gave a polite warning that there was no substitute for thorough end-to-end systems testing after the new Y2K compliant PLC had been deployed.
Variations on a theme: All of the above scenarios are complicated by several subtle factors. If the operating system and/or application is using local time in software with the hardware clock programmed to UTC (GMT) then the "Y2K" problem (in the western hemisphere) will begin to manifest itself some hours _before_ midnight (19:00 12/31 on the east coast of the states and 16:00 on the west coast). If the Y2K test is performed with the hardware clock set to 00:00 01/01/2000 then the local time in software will be something like 19:00 12/31/1999 -- not a valid test by itself. The reverse is also true. If the tester sets the local time to 00:00 01/01/2000 for the test then the internal hardware clock will be set to 05:00 01/01/2000 and the previous condition hasn't been tested. For thorough testing you have to test both scenarios. Also, there is absolutely no accounting for an application programmer's creativity. I have witnessed with my own eyes code where the programmer had bypassed all operating system APIs and BIOS calls and wrote his own "hardwired" library routine that read the hardware clock directly. The routine of course did not correct for the Y2K hardware bug in the RTC. Why did this programmer write this routine instead of using the documented OS or BIOS methods? Who know's? But I can give you two real reasons that happen every day. 1> He couldn't find the OS or BIOS manual that documented the proper call to use, but he did happen to have a copy of the IBM PC/AT tech reference on his shelf which explains -- in gory detail -- how to talk to the RTC hardware. So, it was easier for him to write his own routine (doesn't have to leave his desk) instead of perhaps spending hours trying to locate the proper documentation. 2> Because it's FUN! Programmers get a kick out of developing there own code. In fact, given a choice, many embedded programmers (especially junior ones) would rather code everything themselves than use "someone elses stuff".
Although the above examples are hypothetical they are based on real world systems that I am personally aware of. I know for a fact that there are systems deployed that many people depend on that have problems similar to these and they will NOT be fixed. It's not that there isn't time to fix them, or that they can't be fixed, it's simply that there was an executive decision to not fix them. The systems are obsolete, they will not be repaired. If you want to be compliant you have to upgrade. Conclusions: 1> Lots of stuff is going to break and there's nothing we can do about it. 2> Any scenario you can imagine probably can and will happen.
You may ask, "On what authority do I speak?" I am software systems engineer with over 15 years experience. For 10 of those years I have been a distributed and embedded systems specialist. What does that mean? It means I am not a specialist in desktop applications. It means I am specialist in complex multi-processing, multi-tasking, realtime, distributed network systems. I know what I am talking about.
For more info see my comments in these threads:
http://www.greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=000Da8 http://www.greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=000BzM
Regards, A. J. Edgar Manager, Systems Software Centigram Communications Corp.
Disclaimer: In this forum I speak only for myself based on my own personal experience. I in no way, shape or form speak for my employer.
Answered by Andrew J. Edgar (ajedgar@centigram.com) on November 20, 1998.

-- R. D..Herring (drherr@erols.com), November 22, 1998

Answers

R.D.,
My thanks to Mr. Edgar and you for this information and these examples.
Because I lack experience with embedded systems(*), I have felt handicapped in warning about them even though I've known that the same basic Y2K principles apply to them as apply to mainframes. Now I'm glad to have these excellent examples.
(* - Saaaayyy ... would a bank ATM count as having an embedded system? I know some IBM models had recycled IBM 360 CPUs running them, and other ATMs had x86s.)

-- No Spam Please (anon@ymous.com), November 22, 1998.

While I do not argue with the technical points, there are a lot of assumptions going on in that post to things being ignored.
Rick

-- Rick Tansun (ricktansun@hotmail.com), November 22, 1998.

Rick, you're right, there are a lot of technical assumptions in the post. But I've had to track down even weirder program bugs - these actually make sense, in a certain convoluted way. The ones that are almost impossible to debug are the ones that don't make sense, that aren't easily repeatable, or that are "wild" goofs and end-condition dependent, rather than mid-point iterations.
The first and last printout is bad, but only if the last record from the previous cycle did not get updated in the previous edit session, and if the user did not reload the base drawing on start-up the next session (but it will be okay if the machine rebooted) - then the program hangs the computer because it can't refresh the screen.....that kind of program will drive you nuts.

-- Robert A. Cook, P.E. (Kennesaw, GA) (cook.r@csaatl.com), November 22, 1998.

"Rick, you're right, there are a lot of technical assumptions in the post. But I've had to track down even weirder program bugs - these actually make sense, in a certain convoluted way."
And that is partly what I mean by assumptions. He is assuming that because these items haven't been looked at yet, means they won't be. I do see the merit in this type of hypothetical thinking, but I am also wary of it when I think back to the recent Gary North fiasco with the Australian report of computer terroism that was hypothetical. Even though it was stated as such, it was taken to be real. So it is probably just a lot of my feelings on past thoughts like this kicking in.
Rick

-- Rick Tansun (ricktansun@hotmail.com), November 22, 1998.

A couple of points
1: There are very few embedded controllers that use x86 type system boards. They cost too much compared to other devices. I have some trouble imagining them in common use where virtually none of the functionality of the device is being used - any engineer who threw away money like that on anything but a prototype would get in trouble quickly.
2: Have been running a Y2K test program for quite a while that resets the clock to a couple of seconds before midnight 12/31/99 and reboots the machine. Haven't had one go to the BIOS setup screen yet. BTW, this system does a cold boot, not warm boot.

-- Paul Davis (davisp1953@yahoo.com), November 23, 1998.

Paul, Why do you think there are "very few" x86 embedded systems? I know that Sub-Zero uses a 486 for its units sold to single family homes. I also know of two skyscrapers in Philadelphia whose heating and air conditioning are run by various combos of x86's. These things are and have been quite cheap for the last 4 years or so. In any event, the cost of the embedded programming greatly exceeds raw board costs. Eight hours of your time (programming) will buy a Pentium 166 board today!!

-- R. D..Herring (drherr@erols.com), November 23, 1998.

From example #1:
After midnight 12/31/1999 the software clock happily rolls over to 00:00:01 1/01/2000 and the hardware clock happily rolls over to 00:00:01 1/01/1900.

From example #2:
When the hardware clock rolls over to 05:00 01/01/2000 the software clock will roll over to 00:00 01/01/2000 EST and the Y2K bug in the "sleeping" code is hit and the PLC faults causing it to fail and hence causing the system to fail.

Reading thru this I noticed that the hardware clock rolls over to 1900 in the first example, but to 2000 in the second example. Is this a typo? (2000 presumably would not trigger the Y2K bug??)

-- Tom Carey (tomcarey@mindspring.com), November 24, 1998.

Moderation questions? read the FAQ