TD - Looks like the jury's still out

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

I recieved the following story via email. I haven't registered for the NY Times online version so I've not read it there.
http://www.nytimes.com/library/tech/98/11/biztech/articles/09bug.html
(begin excerpt)
November 9, 1998
Dispute on a Wrinkle in the Year 2000 Problem
By BARNABY J. FEDER
Did a history teacher from Michigan and an obscure Canadian programmer discover a twist to computing's Year 2000 problem that the experts had missed?
That question has ignited a simmering technical debate in which some experts -- including Year 2000 specialists at Compaq Computer's Year 2000 Expertise Center in Albany, N.Y. -- see a real, if imperfectly understood,flaw that could cause many computers to malfunction after seeming to sail smoothly into the year 2000. Many experts, however, see it as a case of unfounded scare mongering.
The controversy turns on the intricacies of how computers keep time. It underscores how the approach of 2000 is stirring doubts among computer users and corporate America about who the true experts are and how much they can be trusted.
The Year 2000 problem stems from a seemingly simple glitch: Computers have long used just two digits to refer to the year, such as 98 for 1998, and often have trouble recognizing that 00 is 2000 rather than 1900. But, as this dispute shows, that simple problem spawns questions that can defy quick, clear-cut answers.
The purported timekeeping flaw was first reported in August 1997 by Jace Crouch, a 46-year-old professor who teaches courses on Western civilization at Oakland University in Michigan. It was subsequently described at length on the Internet(www.intranet.ca/~mike.echlin/bestif) by Michael Echlin, a 35-year-old programmer at Atomic Energy Canada Ltd.
Not only is the timing of the flaw's onset unpredictable, they say, but so are the results, which include malfunctions that can wipe out data, prevent computers from starting up, and lead software programs to make faulty calculations. The problem is said to affect primarily older computers, but some more recent IBM clones are also said to be susceptible. One report of the flaw involved an Apple Computer Macintosh, and the flaw may lie hidden in a wide range of electronic machinery.
But is the flaw real? Or is it, as some critics say, a reckless claim by the two men, who have set up a small company that markets software to identify and neutralize the problem?
"They are selling a fix for something they can't even explain," said Thomas Becker, chief executive of Rightime Co., a Miami-based software company that specializes in products that regulate timekeeping on personal computers. "They are trying to capitalize on fear."
For more than a year, the conflict over what has been variously called the Crouch-Echlin Effect, time dilation, or simply, T.D., has been waged largely on a handful of Internet sites favored by computer buffs. Even critics like Becker say that Crouch and Echlin have undoubtedly encountered something strange that urgently needs further research and explaining.
It all began with Crouch's decision more than a year ago to set his office computer forward to Dec. 31, 1999, to test how it would handle the transition to the year 2000. The rollover happened without a hitch, even though the machine ran on a clone of Intel's aged 286 microprocessor chip -- a relic from the mid-1980s.
Since Crouch was using the computer for word processing in which the date it logged made no difference, he decided not to change the date back. But to his consternation, during the next two weeks the computer's clock jumped ahead to December 2000. Other odd malfunctions cropped up.
Crouch's report of these anomalies on the Year 2000 forum at the comp.software newsgroup on the Internet intrigued Echlin. He set up experiments on several computers, then posted results suggesting Crouch's problems were not isolated.
But it certainly was baffling. While some testers reported computers jumping ahead for minutes or months, others said they experienced leaps backward, while on some machines the clock appeared to simply slow down. Some afflicted computers were unable to locate the pathway to outside phone lines or even their own hard disk, making it impossible to fire up programs.
"The jury is still out on exactly what is happening," said Douglas de Lacey, who oversees computer systems at Cambridge University's School of Arts and Humanities in Britain and has reported encountering the Crouch-Echlin Effect on two aging Toshiba laptops.
Pressure for a verdict is building, though, especially since a recent, widely distributed e-mail announcement from Compaq's Year 2000 office in Albany that said the company would be reselling the software fix created by Crouch and Echlin.
Becker said that he was being peppered by anxious calls from major clients like General Motors and Exxon asking what they should do. If the Crouch-Echlin Effect is real, computer users may have to spend billions of dollars testing and possibly replacing equipment that seemed ready for the next century.
Theories of what is causing the Crouch-Echlin Effect have come and gone. Most Crouch-Echlin believers now suspect that the problem stems from a glitch in the process through which computers can -- each time they are turned on -- update the time from a battery-powered chip called the real time clock, or RTC. The real time clock keeps track of time even during periods when the computer's external power is switched off.
The clock is used as the source of the time and date by a computer's operating system and is also directly accessed by some software programs, unless the computer is part of a network in which a central server keeps time for a coordinated group of devices.
So far, the Crouch-Echlin Effect has only been observed in computers with "nonbuffered" real time clocks, a design not used in today's name-brand computers but common in older devices.
Nonbuffered real time clocks cannot be read for an instant as their own second counter clicks over to the next second. Because computer designers know of this limitation, all computers are programmed to avoid checking nonbuffered real time clocks during that update.
Here's how: A chip known as the Basic Input-Output System, or BIOS, looks to the real time clock for the time and date when the computer user hit the "on" switch.
In the case of a nonbuffered real time clock, the BIOS chip will see the electronic equivalent of a red flag for 244 microseconds before the update is to occur. Seeing this flag, the BIOS waits briefly. If the flag is not there, the BIOS figures it has enough time to complete its reading and proceeds to do so.
All this works fine until the computer reaches the year 2000, according to Echlin. After that, he says, computers with nonbuffered real time clocks may trip up if an unlucky user turns them on at the wrong instant in the update cycle.
The problem, Echlin said, is that at least some BIOS chips actually calculate by relating what they read to elapsed time since a Jan. 1, 1980, start date -- a sort of universally presumed dawn of time in the PC industry.
After 2000, that takes two steps: Time from 1980 until the end of 1999 is added to time since the beginning of 2000, he said. The extra fraction of a second it takes to complete the Year 2000 translation could push the time check beyond the 244 millisecond safety window if the computer happens to be switched on just before the flag goes up, Echlin said. The result is a garbled time check with unpredictable results, he said.
"They make a very solid argument," said Jeff Floyd, a real time clock specialist for Motorola Inc., the giant semiconductor company that manufactured the real time clock chip on Crouch's computer.
But critics like Becker contend that Echlin is being needlessly alarmist. They complain that Echlin and Crouch have declined to provide equipment on which they have seen their effect to critics who have asked to test it.
Becker said his doubts grew last year after he set up his own test. He harnessed two computers with nonbuffered real time clocks to a third device that turned them on and off 25,000 times over a three week period. There was no sign of the Crouch-Echlin Effect, he said. Testing just two machines is hardly definitive, Becker conceded. But he said it was pointless to devote more time and resources without the cooperation of Crouch and Echlin.
"I don't deny there is something going on, but we tried to work with them and couldn't get their cooperation," said James Lott, manager of timekeeping devices at Dallas Semiconductor, a maker of real time clocks and other chips. "We have full lab facilities and would guarantee confidentiality. The cloak of secrecy turns all of us off."
Some of the critics suspect the supposed Crouch-Echlin Effect is a variation on common timekeeping problems that are caused by faults in a computer's power supply.
Other critics challenge the BIOS explanation. David Ross, an engineer with Phoenix Technologies, a leading BIOS manufacturer, said that the process by which a BIOS chip checks the time and date occurred too rapidly to create the situation Echlin had described. A more likely explanation, in his view, is not that the affected BIOS chips are taking too long to identify the year 2000 date, but that they are simply confused by encountering "00."
Echlin insists that power supplies are not the problem because the computers performed flawlessly on dates earlier than 2000. He also bridles at charges that he and Crouch withheld crucial evidence. Early this year, they sent the motherboard from Crouch's computer to Mark Slotnick, a sympathetic technician at what was then Digital Equipment's Year 2000 Expertise Center, to verify their results. Digital was acquired by Compaq last June.
In April, Barry Pardee, manager of the center, told Echlin that Digital had confirmed the Crouch-Echlin findings in its own tests. Shortly after Digital's merger with Compaq, Pardee agreed to resell the software fix developed by Yeovil Systems and Development Research Inc., a venture set up by Echlin, his sister-in-law, and Crouch. Pardee said at the time that the testing had not turned up any problems in Digital or Compaq computers but that the software would be sold as a service to customers whose offices had a mix of brands.
Compaq-Digital also sent along an initial royalty payment that Echlin described as "about $1,000 for each of us." Crouch and Echlin met soon afterward for the first and only time, in Sault St. Marie, Ontario, half-way between their homes, where Echlin hand-delivered Crouch's check.
"We're not planning to get rich," Echlin said, adding that Yeovil gets just $1 for each copy of the software sold. "We were going to give this away but corporate America doesn't trust anything that's free."
Word of Digital's endorsement spread in October via an e-mail from Pardee that Digital salesmen distributed to customers. It included pricing for the software fix. Digital offered the software at prices from $32 a unit for 50 computers down to $3 each for orders of 25,000 or more.
Crouch-Echlin critics were dumbfounded by Pardee's endorsement. Becker alerted Karl Fielder, chief executive of Greenwich Mean Time Ltd., a Year 2000 consulting company, who contacted Compaq officials at the company's headquarters in Houston. Some engineers within Digital also expressed doubts, a Compaq spokesman said.
The flurry of activity may soon bring the issue of whether Crouch-Echlin is real to a head. Compaq officials plan to test the device that produced the results Pardee found convincing and to ship it to Becker, a spokesman said. Pardee and Slotnick did not respond to requests for comments.
Company officials said they were skeptical and would not ship the Crouch-Echlin software fix unless further testing changes their minds. One customer has purchased it, the company said, and will receive a refund if the testing does not confirm the flaw.
But Echlin is not backing down. "Jace and I have been hoping ever since we found this that someone would come along and prove us wrong," he said. "No one who has followed our test procedures has." (end excerpt)
http://www.nytimes.com/library/tech/98/11/biztech/articles/09bug.html

-- Mike (gartner@execpc.com), November 10, 1998

Answers

As a PC BIOS developer, I was initially really interested in what might be causing TD. Last December I began asking Jace Crouch and Mike Echlin for a binary dump of the (possibly) offending BIOS ROM chip(s), so that I could disassemble them and analyze logic paths, timings, etc. (I've done this trick many times in other situations). No response.
I asked again and again. I pleaded. I sent debug scripts to them to generate the dump and create a binary image file I could use. They were glad to talk to me about their theories, but ignored all requests for the code. So did Slotnick.
I've asked them for the make and model of *any* of the computers they've seen TD happen on, so that possibly I could track one of these down on my own and get to the bottom of it. They refuses to do even this much!
I wrote a test program intended to generate the error. Since I cannot make it happen on any computer I have access to, I asked them to run my test code to see what happens. I sent them the source to modify as they say fit. Instead, they saw fit to ignore what I'd written.
Since that time, I notice that they still have no concrete explanation, but they *have* named the symptoms after themselves and started to market some utility that seems to make the problem go away. I can guarantee that some of their theories make no sense -- they are guessing that the BIOS does things I know that no BIOS ever written has ever done. I have tried to tell them this, and been ignored.
I've been in chat rooms with Mike Echlin, and he talks to me until I make yet another request for a ROM dump. Then he shuts up or changes the subject. Others in the chat room ask Mike why he refuses to respond to my request. He ignores them too. I've told him that I don't want any credit or to undermine his precious claim to fame, I just want to know what is really going on. No response. I've finally come to the conclusion that Crouch and Echlin are interested *only* in the glory, and not the answers.
If anyone here ever sees TD happen on any older (286 or early 386) computer, please contact me. I'll show you how to get that dump, and I'll come back in a day or two with a precise explanation of what is happening on that particular computer. I'm fed up with Crouch and Echlin.

-- Flint (flintc@mindspring.com), November 10, 1998.

Testing doesn't sound right:
Turning on/off an older PC takes 2-3 minutes, depending on op system, memeory checks, etc. You guys know the drill. And I'm not sure NT on faster box takes any quicker.....
In 2 weeks, there are 20,160 minutes. So was it a valid test to turn on/off a motherboard (25,000 times as stated "automatically") to check a randomly occuring a "running" symptom. The motherboard would spend more "time" off than on.

-- Robert A. Cook, P.E. (Kennesaw, GA) (cook.r@csaatl.com), November 10, 1998.

Flint, thanks for taking the time to shine the cruel light of truth on Crouch-Echlin B.S.

-- Woe Is Me (wim@doom.net), November 10, 1998.

Flint - I haven't seen CE in action either. I had assumed that the effect might be caused in a small number of very old machines (generally 'who cares' machines) due to BIOS confusion as to whether a clock tick that was not updated on time was a single or a double tick. I had not heard they were claiming a year in two weeks. My assumption would have doubled the clock speed at most. Since my assumption would also have required the CPU to be spending a great deal of its time in 'clock confusion' the PC would have been running so slowly on anything else as to be useless anyway. Running OK and upping the rate the clock is ticking to about 50 times normal? Can't believe that one, pull the other leg.

-- Paul Davis (davisp1953@yahoo.com), November 10, 1998.

By way of negative criticsm of the article By BARNABY J. FEDER, I offer the following information and ask that Flint and Mike confirm it, as I'm sure that they both are aware of the accuracy of the following facts. My point here is that if Feder claims to be technically competent, he's not and if he isn't, what the hell is he doing authoring the article?
BIOS is an acronym for "Basic Input Output System". The BIOS is a PROGRAM (firmware, microcode, whatever) that is STORED in a CHIP called a ROM (Read Only Memory). There are a number of different types of ROM CHIPS (EPROM, EEPROM, etc.) but they are all simple storage devices and perform NO calculations of any sort. Yet Feder claims that Echlin said that the BIOS Chip calculates. Feder himself clearly speaks as if the BIOS was the chip, and reveals in a number of other places that he isn't aware of the technical realities of his subject. ROM chips are a "set in stone" repository of instructions which tell the system how to exchange data with, or issue commands to, the I/O (Input Output) devices such as the keyboard, the monitor, the hard disk, etc. A possible exception is the fact that some processor or control chips have an internal ROM, but even that makes no sense in the context of Feder's article.
The BIOS is a set of instructions that direct the actions of the hardware, including the movement of the binary number that represents the time from the RTC (Real Time Clock) CHIP (which is an I/O device) where it is generated, to the location in the computer's RAM (Random Access Memory) which is made up of yet other CHIPS that hold data. That location in RAM is a constant location where the OPERATING SYSTEM (another program) keeps track of what time it is. Most operating systems only access the RTC once and then update their own copy of the time value in RAM.
Now, in as much as most of the interest in this thread may come from technically literate readers, I may be preaching to the choir here. I mean no one offense (save mild contempt for Feder, and Echlin if he believes that ROM chips calculate), but the inaccuracies of the author destroy any credibility for me.
It is apparent to me that the "effect" (really just a design/timing error) is real, the "hype" is unwarranted (Hey, "Woe"! Here's some evidence for your claim.) and the problem can usually be dealt with by the simple expedient of verifying the correct time with an outside source after initialization of the operating system. In those cases where the application software accesses the RTC directly, a software fix is appropriate unless the application is re-written to use the OS clock instead.

-- Hardliner (searcher@internet.com), November 10, 1998.

I requested a copy of e-mail addresses from the NY Times. They sent me the addresses for about 100 staff members who have made their e- mail addresses public. Guess what? Barnaby J. Feder's was NOT one of them.

-- Gayla Dunbar (privacy@please.com), November 10, 1998.

OK, one more time then. The most likely TD theory goes like this: (NOTE: this is a THEORY. What actually happens depends on code that Echlin doesn't know how to examine and won't let anyone else who does, even look at).
1) At some time during bootup, the OS asks the BIOS for the time and date. Subsequent time/date is maintained by the OS based on the periodic timer interrupt happening 18.2 times a second. This timer interrupt is based on the 8254 timer chip, timer 0, and NOT on the real time clock.
2) As part of the get-date BIOS call (interrupt 1Ah), the BIOS reads a number of registers from the RTC chip directly. To do this, the BIOS must disable interrupts, then check to see if the RTC is safe to read right now, or is in the process of changing to the next second. RTC register values read during the change (the update process) return UNRELIABLE data on most (that is, unbuffered) RTC chips. Buffered RTC chips change the time in shift registers invisible to the bus, then latch all new values all at once. Unbuffered chips don't latch the data, and run the risk of a read operation while the shift registers are in the process of changing, generating garbage.
3) The RTC spec says that IF the update bit is CLEAR, it is guaranteed to be safe to read all registers for 244 microseconds. This grace period is necessary because the update bit might have gone SET immediately AFTER you read it as CLEAR. How would you know?
4) If the BIOS diddles around too long reading these registers, for any reason, it runs the risk of violating that 244 microsecond grace period and reading bad data, resulting in a bad date or time.
5) If the BIOS is written in such a way is to require extra logic after 2000, this can happen. Since the get-date call is when the BIOS looks for a century change, this is possible. The BIOS code MIGHT have logic that says, OK, the year is 00, better check the century. Oops, the century is 19, better make it 20. NOW go read month, day of month, etc. That EXTRA time spent examining and possibly changing the century (which is maintained by software, NOT the RTC) is what pushed the timing past the grace period, and subsequent registers on rare occasion have started the update cycle and are unreliable.
6) All unknowing, the BIOS returns the resulting garbage to the operating system, never having done a sanity check, or having read garbage that just happens to look sane. The OS uses these register values to create a time and date, which is also invalid as a result. Garbage in, garbage out. Applications asking the OS for the time and date get numbers that appear to skip wildly forward or backward from the actual time, depending on what garbage came back.
NOTE that this phenomenon has NOTHING to do with the RTC speeding up or slowing down. The RTC works fine, but whoever wrote the BIOS violated the documented 244us timing window, and never bothered to test for post-2000 dates enough times for this rare violation to show up. This is entirely possible.
Also note that when we talk of the BIOS doing calculations, we aren't talking about some kind of hidden processor inside the EEPROM. This is simply a shorthand way of saying that there are calculations performed by the BIOS code, which is stored in the EEPROM but actually gets executed by the CPU.
Finally, note that this theory explains sudden jumps in the date, and explains why the phenomenon might be rare. It does NOT explain jumps in the time of day very well, and it definitely doesn't explain why this happens ONLY during POST (bootup) and never at runtime. The test code I wrote requested the time and date from the BIOS repeatedly but asynchronously, over and over, in the hopes that sooner or later I'd hit a BIOS violation if there was one. I suspected that Crouch and Echlin were seeing the problem only at bootup because all subsequent get-time-and-date calls were fielded by the OS, which didn't use the BIOS. Maybe if he'd ever run my test on a failing unit, we'd have known the truth. Perhaps Crouch and Echlin were afraid that if my test utility found the problem, they'd have to share the CREDIT!
If anyone has any questions about all this, just give me a holler. I'll tell you what I know; Echlin will sell you something that might fix what he doesn't understand.

-- Flint (flintc@mindspring.com), November 10, 1998.

Special detail for Hardliner about all this:
This TD isn't a problem for PC's in most cases. If the time/date are off, just set them right, who cares how they got off. The real problem is that a whole lot (possibly the majority) of embedded systems are based on PC motherboards running 286 and 386sx processors, with DOS (and sometimes the application) in ROM! These are closed systems you can't easily just change the time/date, and these are slow systems and therefore more likely to violate that 244us grade period window. And it's systems like these that make decisions to shut down power plants, sometimes for date-related reasons.

-- Flint (flintc@mindspring.com), November 10, 1998.

Harlan Smith did an article on this appearing on the Westergaard web-site today:
http://www.y2ktimebomb.com/Computech/Issues/hsmith9845.htm

-- Buddy (DC) (buddy@bellatlantic.net), November 10, 1998.

Flint,
If it walks like a duck and quacks like a duck, etc. I think it highly likely that your analysis is exactly correct.
I suspect that if you had access to all the failure parameters, you'd find that the time of day failure mode frequency had to due with some hardware idiosyncracies. Even on-chip transistors, even of the same spec, do not all switch exactly as scheduled. That fact is one of the reasons that the register-to-latch buffer arrangement was developed originally. "TD" is simply pushing the envelope and without the data you've asked for, and probably then some, I don't think it WILL be explained.
As to the difference between POST and runtime, whatever is different is not apparent but if you had the failing code, it might just leap out at you.

-- Hardliner (searcher@internet.com), November 10, 1998.

Flint,
I understand your point about the closed systems.
What is your evaluation of the degree of congruence between this group (old, closed systems which may exhibit TD) and the group (old, closed systems which do not exhibit TD) which will have "vanilla" century date flaws?
Also, if I may, I'd like to pick your brain a bit and ask that you share of your knowledge of critically time dependent systems that are totally closed (so tight that Daylight Savings Time must be ignored)?

-- Hardliner (searcher@internet.com), November 10, 1998.

Hardliner,
Maybe we'd better take this offline and not chew up Yourdon's server.
Some preliminary points to make are:
1) When the spec sheet on the MC146818 and clones RTC says 244us, they MEAN 244us. Crouch and Echlin have sent me both the source and output of their code to measure this, and it is extremely precise on EVERY clone out there, whether TD is observed to happen or not. I am fully satisfied that this is NOT a case of variable or sloppy hardware in any way.
2) Be careful about implicit assumptions. These embedded systems may be closed, but they are not necessarily old. You only have to cruise some part manufacturers' sites to find 286 logic in kitchen-sink ASICs, along with all the other AT chips that used to be discrete logic - the 6818 RTC, 8237 DMA controllers, 8259 PICs, 8042 keyboard controllers, 8254 timers and the works. All in one chip, all running off a 6MHz oscillator with internal divisors and multipliers (actually, a built in synthesizer) to generate the other clocks. Amazing stuff. And expressly intended for embedded systems.
3) I assume that by 'vanilla' date bugs, you're talking applications? That's a bit different, since most of those (my guess - hard to generalize about embeddeds, they're all customized) are actual software errors running out of RAM and loaded from disk, usually on some server or control computer that uses the embeddeds as data- generating peripherals. Clearly, it's easier to fix those than to rewrite the ROM code, but not always.
4) I don't know about daylight savings being ignored, though I've heard about places that shut down the systems during the change and then restart them afterward. I'd need details to understand why this is necessary. I can tell you that some RTC chips can (if set to do so) handle daylight savings automatically, skipping an hour the first Sunday in April and repeating that hour the last Sunday in October. I'd guess (wildly) that if this change causes problems, either someone doesn't understand their system very well, or someone was very careless when writing *something* somewhere. If the daylight savings changes cause problems, I guarantee if I could see what they wrote I could solve these problems. If this sort of shutdown kludge is necessary, this is an obvious, serious bug. Fix it, dammit.

-- Flint (flintc@mindspring.com), November 10, 1998.

Excellent comments by Flint and I believe he is probably correct in his general theory. I just found an old AT system in the backroom of another department gathering dust. I'm going to try and acquire it for some more testing.
Regarding the daylight savings time comment. That jogged my memory of what Mills has said about SCADA/EMS systems. He was trying to argue they weren't "necessary" as long as you don't require RSA (reliable, safe affordable) power. Anyway, he mentioned in an article that he knew of a utility that would bi-yearly shut down there SCADA/EMS (for a short time) because it couldn't handle DST changes! I regarded that as a serious bug as well and they shouldn't have just muscled there way around it.

-- R. D..Herring (drherr@erols.com), November 11, 1998.

Flint,
We're in violent agreement here, and what's sloppy is those implicit assumptions that I made. You're right of course about the age v. technology point.
I should have made myself a little clearer about the hardware idiosyncracies. I did not say nor did I mean sloppy or variable in the sense that you apparently read. I agree that no RTCs violate the 244us window, but the internal update process (which should be irrelevant) may happen in such a way that some RTCs are more prone to TOD "tardiness" than to month, year, etc. "tardiness" or the other way around. During the time that the flag is set to signal unreliable data, as you've pointed out, a number of internal RTC registers get updated. If the BIOS is moving garbage data because the RTC is in an update, then the sequence of register updates internal to the RTC becomes relevant. If the "garbage" is from the front of the RTC update, certain registers may still be valid, just as if the "garbage" comes near the end of the RTC update window, some registers may already be updated. Since the start point is asynchronous and random (within limits), it seems to me that your analysis explains year and day failures equally as well as hour and minute failures. It may simply depend on when the "garbage" is transferred. Even if the "garbage" came from the exact point in time within the "unreliable window" every time, there's no guarantee that any given transistor will have switched as soon as it did the previous time. The only guarantees are the length of the 244us window and that if you try to use the data during the update, it will almost certainly not be what you want. As IBM says in lots of manuals, "Results are unpredictable."

-- Hardliner (searcher@internet.com), November 11, 1998.

Moderation questions? read the FAQ