Y2k Retrospectives From Power Industry Insiders: Chapter 3--Early Test Results and Encounters with Vendorsgreenspun.com : LUSENET : TB2K spinoff uncensored : One Thread
Hello again, everyone. This is the third chapter in a series of Y2k retrospectives from power industry insiders. NEWS FLASH! "The Engineer" has joined our team with his comments (see below).
Chapter 1 at http://www.greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=002zGc kicked off the discussion of how each of us first encountered the Y2k bug.
In Chapter 2 (at http://www.greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=003107), we discussed our early hypotheses of how Y2k might affect power.
Chapter 3: Early Test Results and Encounters With Vendors (January-March, 1998)
At this time, everyone was desperate for a test procedure. What dates do we test for? Should we try the re-booting tests? How do we avoid damaging equipment or voiding warranties by Y2k testing.
To put it in Alanis-speak, What it all boils down to, is that no ones really got it figured out just yet.
In January, 1998 Atlanta hosted the first big EPRI conference. Many presenters had much to say (several of them were consultants looking for Y2k clients), but none had much Y2k test procedures nor test experience. It was here that a guy named Rick Cowles from Digitial Equipment Corporation shared his knowledge on the subject, including a statistic that 1 to 5% of embedded systems are estimated to have Year 2000 problems.
The only solid information on Y2k testing came from a rather odd source--General Motors. Many of us began adapting the GM model for testing power equipment. Four test types were found to most effectively find out a devices Y2k readiness: Roll-over; Roll-over and reboot; Roll-over with power off; Simple manual date setting tests. The primary date of concern was 1/1/2000, but other dates began to emerge: 1/1/1999, 9/9/1999, 2/29/2000, 12/31/2000, even 1/1/2027! In all, nearly 30 dates could be found in various test procedures.
We decided that our approach would be to test any device we could get our hands on, regardless of its criticality, to get a feel for whether our hypothesis was correct (see Chapter 2). So in early March, we tested an Asea Brown Boveri digital relay. The relay itself passed all tests, but curiously, when it rolled over to 2000, the year was displayed as ;0. No date entry in year 00 was allowed. I contacted the vendor, they had just become aware of the problem, and promised a fix. Within one month, a fix was provided. Thus, the first test, albeit a bit rocky, proved to show that problems are cosmetic in nature, wont cause the device to crash, and vendors are willing to provide fixes.
By the end of March, we had tested four Schweitzer Relays and another ABB relay. All devices passed tests (although the SEL relays displayed the year 2000 as 0).
David (Factfinders) Comments:
In mid-1998 I started working full time on a y2k nuclear plant project, and by that time, this was a "crash" project for embedded systems which was far behind the IT mainframe software y2k work that had been underway a few years. By this time, we had the benifit of some testing experiences of other utilities and equipment vendors, via the EPRI Web, or directly from the vendors.
"You can't trust the vendors" I heard this over and over, and while I am sure that there was a bit of truth in it, a much larger truth was "you better at least CHECK with your vendors before you start y2k testing and before accepting all the rumors of y2k failures!"
One thing that I learned early on, and that was that many of the speculated failures in power generation equipment such as DCS's and PLC's were often contradicted by vendor test information, which indicated that y2k bugs were in many cases nuisance problems, not problems that would affect operation. Examples of such systems having nuisance type problems were the Westinghouse WDPF distributed control system, Westinghouse RVLIS system, and a number of General Electric control systems. Some of the early testing "failures" were actually a result of problems with the testing methodology, in fact, I found many examples of this. Rolling a date far in advance for testing often caused problems that would NEVER have been encountered in the real y2k rollover. And advancing the date and then rolling back often caused problems as well. One thing was quite evident, the vendor had to be consulted for recommendations for testing and to identifiy any precuationary measures recommended for testing. Here's one documented example of "testing methodology" causing invalid y2k "failures":
Subject: SpeedTronic* Mark V Turbine Control System - Reported "LOCKUP" of alarm display during performance of Y2K testing
It has been reported by customers that the Alarm Display "locks up" when the customer performs Y2K evaluation tests of Mark V equipment at its site.
This situation has been duplicated in GE Industrial Systems' Salem location using lab equipment with IDP version 4.2 loaded on the operator interface. Though the screen display does not change, there is actually not a complete lockup. The cause of the apparent "lockup" of the alarm display is the disabling of the MSP task in the , as a result of issuing a TIMESET command after setting the date ahead by more than 28 days. The display is still functional, as demonstrated by the ability to enter and exit the display, and to select optional features using the soft switches located at the bottom of the display. However the underlying communications for alarm functions with the panel is disabled. The reset, acknowledge and status update functions, including the receipt of new alarms, will not operate until the has been reset, thus reinitializing the MSP communications services. Testing of other dates, even those in 1998 and 1999, which were more than 28 days ahead of the previous date, produced the same conditions. This characteristic of the 's MSP services is NOT A Y2K CAPABILITY ISSUE. It is a testing issue only, and is not experienced during normal date progressions.
This condition may be avoided by utilizing the following procedure before conducting an appropriate Mark V evaluation test that involves advancing the date: 1. Exit IDOS, using the IDOSEXIT command 2. Advance date to within 20 days of the future date required for performing tests 3. Resume operations by issuing a RUN_IDP command 4. Perform evaluation test
Another example (and a more "infamous" one) of a y2k testing methodology problem is the lockup of the Peach Bottom Nuclear station Plant Monitoring System (and the associated SPDS). The Licensee Event Report to the NRC for this incident included a root cause analysis that concluded thusly:
"After review of system documentation and discussions with the system vendor, it is believed that the PMS stalled because the time differential between the new inserted time and the existing time caused the clock to become unstable."
Most vendors in the power industry provided excellent and accurate y2k information to me and my associates. There was at least one exception, however, where the vendor indicated minor problems, and my testing revealed more serious problems, fortunately this was a small vendor with a "customized" test system not installed in the plant. It was PC based, and a patch file fixed the problem which originated in the DOS operating system date.
NEW! "The Engineer's" Comments:
We found similar things. We found the BEN (EI) DFR's had a problem in that after the roll over the time wasn't consistent and changed very rapidly (shades of the Time Dilation). However it did not have any effect on the data (currents, voltages) recorded, just the time stamp. A fix was provided by the vendor.
At one of the first (and last) meetings I attended I mentioned the other dates to our computer people. They seemed totally unaware of the Feb 29th or 9 dates (and potential problems). I did find an interesting switch on the 9 date problem. It turned out that one of our vendors typed in 9999 when they had an order for equipment but no firm delivery date. It was their equivalent of putting it off well into the future. If and when a firm order and date came in they would put in the corrected data.
I was talking to one of their executives over the phone (gathering data on Y2K) and mentioned the problem of Feb 29th and the 9's. He became very quite for a period and then mentioned their procedure for putting that date into their order data base. Note that this wasn't a major problem and would not have caused any harm but it is an interesting example of doing things for such a long period of time and then not thinking about them. They corrected it when they went through their own Y2K procedures.
As for testing our desk top computers. I was fortunate enough to have contact with Tom Becker (RighTime) in the past in reference to something else. Through CPR I made connections with him again and I got some good advice on testing the machines. Most of our machines passed except for some laptops used by the field. We did find a few machines the secretaries used that had bad BIOS systems and couldn't be patched. They were replaced.
At one time I traveled a lot and served on a number of committees. I was lucky enough to have made the aquatints of other engineers across the country. I spent a number of days on the phone and writing emails to them asking for information. Basically the information I got back dovetailed with what I found and what Dan and Dave have written. No one really found anything major. Most of it was nascence problems such as incorrect displays. Some of our people did a tour of a local power plant (coal) and were told that "stuff" had been found and that further investigations were ongoing. What stuff? Stuff kind of stuff.
There were modifications made to the SCADA system to make it compliant. But again none of this was major and it was looked at well ahead of time. Most discrepancies were all found to be in record keeping and data gathering. Not in control and protection.
The main problem was that everyone had heard of someone else who had heard from others that they had heard of problems. I spent a lot of time trying to track down and verify the truth of these rumors.
As I explained in the previous chapter, we decided early on that we would contact all vendors prior to testing, and that we wouldn't even bother testing anything that the vendors said was non compliant. However, anything that the vendors couldn't confirm as non compliant, or anything for which we didn't get a vendor response would be tested. And the very first test was to see if the system (hardware or software) used any dates anywhere.
We were not suprised to find that most embedded systems did not use any dates at all, and of those that did the majority were purely for date/time tagging and were not used in any calculations. We found no items that would cause any generator to trip off line on rollover in our group of stations, however two technicians at an old GT station did report a single device, a rate of change relay, that would have caused a loss of generation if the gas turbine had been started within a few minutes of rollover. It would not cause any loss if the gernerator had been on a steady load for any length of time. I have been unable to discover any details about this particular relay other than that it was manufactured by Sta Laval in the late 1960s or early 1970s. It did have the ability to simply count clock pulses, but for some unknown reason had been set up to take the difference in elapsed time.
We did find a number of high level embedded systems that would have failed on rollover, and could have affected generation, but would not have actually caused any direct loss. The main item in this group was our SCADA, (which was due for an upgrade anyway). An updated system was ordered from MITS in Australia, but I'll report more on this system in a later chapter. Other items included a Honeywell Scan3000 control system, A STREATS Macroview control system and an Accusonic flow recording system.
I'll jump ahead a bit here, because of all the vendors that we had to deal with, the New Zealand agents for Accusonic were among the worst. The service that we received was bad enough that not only did the system fail to perform to expectations after remediation, they did not appear to interested enough to even try and make it work. We eventually had to dump the entire system and develop a flow recording system in-house.
One interesting situation arose when a vendor who supplies us with some very critical data assured us that although the dataloggers that they used were fully compliant, the data that we would receive would only have a two digit year and it would up to us to ensure that the software at our end would be satifactory with the data in this format. However the same vendor had also supplied the software at our end. All of our testing indicated that the software would work fine with a two digit year, but then in November 1999 they informed us that due to a previously unknown irregularity that could occur they couldn't guarantee that the software was OK, and that they would write a new program for us free of charge, and install it before rollover. They did this with four days to spare, but that only left me three days to modify another application that used their data to accept the new format that they gave us. The loss of this data would not have caused any loss of generation, but could have caused a loss of efficiency of up to 3%.
In general we found all vendors to be very co-operative, and prepared to assist with data whenever we required it. No systems would have caused a loss of generation on rollover, however some would have caused a loss of efficiency or required manual control, and most that did have data handling difficulties would only have caused cosmetic errors.
Discussion Question: Did you ever Y2k test anything? Any odd results, or issues with vendors?
Chapter 4: Experiences with the Media
-- Dan the Power Man (firstname.lastname@example.org), April 27, 2000
Sorry that I didn't get my comments to you in time. If it's OK, I'll post them here as an addendum.
Developing testing procedures was the first order of business. A big problem was that there was no way to test some systems while they were on line and no way to take them off line without a significant loss of generation. We developed a test lab where we able to set up some system simulations to do the testing.
We found that EPRI was essentially worthless in helping to develop any testing standards. Our lawyers were concerned about trading too much information with other utilities initially for fear of antitrust problems. That fear faded once NERC was established and we then able to use the testing procedures developed by other utilities rather than reinventing the wheel each time.
Relays made up about 30% of total embedded systems. Luckily for us, the majority of our relays were older analog relays so they had no date problems. The digital relays caused us some concern but the testing showed they would rollover correctly but the software had some cosmetic date problems. The manufacturer took much longer to develop software fixes than originally scheduled and we were still updating some of the software in October of 1999.
Our major date problems were with systems that were used for environmental monitoring. These were things like stack emissions, fishwater releases, and the like. We could still generate but we would be out of compliance with environmental regulations. The lawyers wanted to have Congress grant an exemption to these requirements if we couldn't remediate everything in time but we fought that idea since it would be an excuse to slip the schedule and would make the public wonder what else we had not gotten done. As it turned out, this was a good decision because we got everything done before the rollover and would probably still be working on it if we had gotten an exemption.
The strangest part of this period was how our spending was being viewed. Our original estimates were that we would spend much more money than we were actually spending. The testing procedures were not as complicated as we thought they would be and our results showed that testing samples gave as reliable a result as testing the whole population. As a result, we were spending less than some other utilities of comparable size. This caused a lot of concern among the officers because they didn't want to explain why we were spending less since spending seemed to be the only measure of "progress" at that time. We were told to bring on more people and to expand out testing activities, even though we argued this wasn't needed. So, the two lessons I learned were:
1. If engineers and accountants have a disagreement, the accountants will win.
2. If you're told you're not spending enough money, you have to be more creative.
-- Jim Cooke (JJCooke@yahoo.com), April 27, 2000.
I must endorse your comment that If engineers and accountants have a disagreement, the accountants will win.
We managed to get a number of enhancements to some of our equipment by simply stating that it was Y2K related. It saved the endless arguments that we would have had with the bean counters if we had tried to justify them on purely economic grounds.
-- Malcolm Taylor (email@example.com), April 27, 2000.
This is a wonderfully informative series, and I'm enjoying learning what went on under the hood. My only question is, why are only the shills allowed to relate their experiences? Can't you find someone *trustworthy* [grin]
-- Flint (firstname.lastname@example.org), April 27, 2000.
Flint's right. Really, Jim, claiming that your organization was coming in comfortably under budget? I guess you're not cut out to be an empire builder. 8^)
I do have a question. One of the concerns expressed about the Y2K remediation problem was where an enterprise with a complex computing topology but with limited resources, may have difficulty anticipating the impact of remediating (or replacing) certain systems while leaving others unaddressed.
From the segments thus far, it sounds as if this concern didn't really apply in your situations, but I'd be interested in any observations you might have on this. Thanks.
-- David L (email@example.com), April 27, 2000.
Yes, there were some people who got some very nice computers because they "needed" them for Y2K. The amusing aspect was that our own officers were filling the role of Y2K doomers for us - "You're not spending enough money so that means you're not doing enough work so that means you'll never get done in time". I remember one meeting about this where I spent an hour reviewing schedules and test plans, showing them how we were ahead of schedule and developing some economies due to learning how to do testing more efficiently. The response? "But, you're still not spending enough money"!
I wonder if anyone is suprised that all of us shills were actually doing what we said we were doing?
I'm not sure I understand your question. Can you restate it for me? It sounds like you're asking about critical vs noncritical systems and what would get repaired or replaced but I'm confused about exactly what you want to know.
-- Jim Cooke (JJCooke@yahoo.com), April 27, 2000.
Good thing I didn't use Cyrillic letters. 8^)
I think this is what I meant to ask:
1. Was there question early on as to whether all the date affected "stuff" could be remediated or replaced in time?
1a. If the answer to #1 is yes, was the question of how to prioritize remediation/replacement explored in detail?
1b. If the answer to #1a is yes, what plan(s) or methodology(ies) were being considered to assist prioritization decisions?
2. If the answer to #1 or #1a is no, i.e., it was determined early that it would be feasible to remediate or replace all date affected stuff, does this suggest anything about the extent to which date processing and/or computing technology play a role in the power industry relative to industry in general? (Sorry if this sounds "loaded," not meant to be.)
Hope this is a little less muddy. Actually, I just thought of an additional question in a totally different vein. Was the remediation/replacement task at all impeded by cross-organizational tensions or dynamics? Thanks.
-- David L (firstname.lastname@example.org), April 27, 2000.
I believe I can answer these questions as far as our company was concerned, but we may differ slightly in how other power companies acted.
1. Was there question early on as to whether all the date affected "stuff" could be remediated or replaced in time? Yes, there were questions early on, otherwise a Y2K team would never have been established. This section was covered in chapter 1. Our first experiences with Y2K. 1a. If the answer to #1 is yes, was the question of how to prioritize remediation/replacement explored in detail? Yes, all systems were prioritised according to their level of criticality.
1b. If the answer to #1a is yes, what plan(s) or methodology(ies) were being considered to assist prioritization decisions? Our methods were described in Chapter 2. Hypothesizing on how Y2K might affect power. 2. If the answer to #1 or #1a is no, i.e., it was determined early that it would be feasible to remediate or replace all date affected stuff, does this suggest anything about the extent to which date processing and/or computing technology play a role in the power industry relative to industry in general? (Sorry if this sounds "loaded," not meant to be.) Once we had completed the inventory stage, we were then very certain that everything could and would be ready for the rollover. Notice I say ready, and not compliant. We are still not compliant, and never will be, as that requires us to give guarantees on matters that are outside of our control.
Was the remediation/replacement task at all impeded by cross-organizational tensions or dynamics? A very simple answer; No.
-- Malcolm Taylor (email@example.com), April 27, 2000.
Thanks for laying it out in such detail. I think I get it now :^)
What you call prioritization we called criticality. The process was to decide what was Mission Critical (defined as we couldn't generate power or supply natural gas), what was Mission Important (defined as we could still generate or deliver but it might have to be done manually or it would cost us more money), and Non-Critical.
The way we assigned items to the three categories wasn't very scientific. We had three days worth of meetings with the experts in each of the areas and went through each system. The criticality was assigned by agreement. The bias was that, if anyone thought it was critical, we assigned it to Mission Critical status. As it turned out, this was actually a pretty good process since we had buy-in from the people who were ultimately responsible for the work.
I plan to write about some of the other questions you raise for the next installment. I'll say that there was never a question about the Misson Critical items. Those had to be tested and repaired or replaced as needed. We never thought any other course of action was an option.
-- Jim Cooke (JJCooke@yahoo.com), April 27, 2000.
Jim Cooke: Thanks for posting your portion of this chapter, and for following up on the questions (same to you, Malcolm).
David L.: Good questions. I hope the other answers suffice.
Flint: Yep, I was called a shill on the old forum. I was also called a "friend of the Clinton administration" (my wife got a good chuckle about that) and some other names. I'll delve into that later...meanwhile, the robots from the Pentagon are sending me telepathic messages...must...get...back...to rewriting...history...:)
-- Dan the Power Man (firstname.lastname@example.org), April 28, 2000.
Thanks for the responses.
Another question. Could a future chapter give some rough numbers to convey the scope of the task, such as the number of devices and the number of people working on testing/remediation/replacement. Jim, I think you mentioned in a previous chapter that deducting the miscellany resulted in 65,000 line items. (Gadzooks!) I'd be interested in how a task of that size was structured and managed effectively. Thanks.
-- David L (email@example.com), April 29, 2000.
I'm not sure we'll specifically answer your questions in a later chapter, but here are mine:
My company is a medium to large one (about a million customers). Our initial inventory found about 10,000 devices with date awareness. After assessing everything, we found about 800 device types, and only about 300 of those were deemed mission critical.
We had a core group of about 15 working on the power side (generation, Transmission, Distribution). Only three or four of us were working full time. Company-wider there were about 100 people working on Y2k, although more than a thousand people were involved in it in some way.
-- Dan the Power Man (firstname.lastname@example.org), April 30, 2000.
Gettting everything in a database and then making sure everything got tested and fixed was a real pain.
We first put the 65,000 items in major categories like PLC's and relays. We then had to subcategorize by manufacturer and then subcategorize again by both model and serial number since some items had different innards depending on when they were made.
We next had to assign one of the categories of Criticality to each of the items which was made somewhat easier by grouping them. We then had to determine which of the Mission Critical items had dates and set those up for testing. We had to track the test results and any remediation or replacement needed.
We used Access initially for the database but grew out of that quickly and ended up with an SQL database running under Oracle. We tracked all the work using MS Project with resource loaded schedules. All the schedules got rolled up every two weeks up until about April of 1999 and then weekly thereafter. We ran variance analysis on the schedules to determine where were slipping or ahead of schedule and adjusted the work accordingly.
None of this qualified as fun :^)
-- Jim Cooke (JJCooke@yahoo.com), April 30, 2000.
Thanks for the responses.
Jim, from what you describe, it certainly doesn't sound like fun. But look at it this way, in categorizing the items, each of you was like a Linnaeus of the phylum of embedded devices. Are we having fun yet?
-- David L (email@example.com), April 30, 2000.
This series must have really captured my imagination, because questions keep popping into my head. How often (if at all) was it infeasible to obtain a direct replacement for a date dependent device, and therefore another remedy was sought (e.g., replacing a larger scale unit of which the device was one of several elements). Thanks again.
-- David L (firstname.lastname@example.org), April 30, 2000.
David L: Glad to see you've gotten some answers to your questions. Regarding your latest:
The only situation I encountered where it was "infeasible" to do a direct replacement was with PC's and software--in these cases a software upgrade and new operating system was added.
The vast majority of Y2k problems were cosmetic in nature, and it was cost prohibitive to replace them, so we just live with them. I sent out a special memo internally that listed all the "glitches" by model number, so that technicians wouldn't be surprised when they encountered odd things. For example, some Schweitzer relays display the year 2000 as "0", and I was concerned that a field technician might panic if he saw a date of "2/1/0" or something like that.
In one case, we could not upgrade a piece of critical software in time, and because it is always on line and needed, we couldn't thoroughly test it. So we just rolled the clock back to a year that the system had already been through, and it worked fine.
-- Dan the Power Man (email@example.com), May 01, 2000.
Dan the Power Man:
When you roll the clock back like that, do you generally schedule a time to complete the remediation process?
Also, does having different systems operating in "different years" ever create problems?
-- some questions (somequestions@here&now.cum), May 02, 2000.
I don't have any comments on power right now (probably won't). But I did want to say this series is great reading, and I've saved every chapter so far.
Thanks for spending the time to do this.
PS to dan the power man. I thought I would give you a snicker. did you ever see this thread on old Stinkbomb? the one where "perlie sweetcake" warned the forum to ignore dan the powerman?
meme girl didn't know what NERC stands for!
I thought you might get a kick out of that one!
-- Super Polly (FU_Q_Y2kfreaks@hotmail.com), May 02, 2000.
Our inventory and testing stages revealed around 5200 items that had some date functions that were not compliant. Of these just under 10% were classed as level 1 criticality. However, one of the issues that I often tried to make clear to people was that even if a system had some date function in it, the failure of the date didn't always mean that the system itself would fail. In many cases the date was simply used on a display, or for archiving purposes, and it didn't really matter what date was used.
We also had one very important control system that was non compliant, and that could not be remediated in time. This particular system had dual servers, one in service, and one on standby at all times. We rolled the date back on the standby server with the intention of changing servers a few minutes before rollover, and then going back to the main server two weeks later. In the event, the duty controller forgot to change servers and we rolled over on the non compliant server. And, nothing happened. It continued to function without any problem and has worked just fine all along.
-- Malcolm Taylor (firstname.lastname@example.org), May 02, 2000.
Some questions: The one system that we rolled the clock back on is in the process of being replaced now, long before the year 2004 when it would roll over into year 2000. Yes, there is a potential danger in doing this if someone sees the wrong date and tries to "fix the problem". We made sure everyone who has the ability to change dates on this system knew to leave it alone.
Super polly: Thanks for the link; I had never seen that thread. I wonder what thread I had written in April, 1999 that she was referring to?
-- Dan the Power Man (email@example.com), May 06, 2000.
One of my above answers said, "One of the concerns expressed about the Y2K remediation problem was where an enterprise with a complex computing topology but with limited resources, may have difficulty anticipating the impact of remediating (or replacing) certain systems while leaving others unaddressed."
I've attempted to clarify and expand upon this in an answer to the thread entitled Post Y2K Report Dale Way of IEEE Y2K Doomsday Squad Now Polly but Still Clueless
-- David L (firstname.lastname@example.org), May 30, 2000.