Power Grid Engineer Speaks Out

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

I have been swapping email with an engineer at Bonneville Power Administration for the last week. At one point I asked him for his opinion of the IEEE Letter To Congress, and the NIST Report. With his permission, I've posted his response (in <blue), followed by my comments:

These are my opinions NOT BPA's, since I am not an official Spokes Person. With that understanding (all the usual nons, so to speak): The IEEE was basically a CYA letter. It really didn't say anything we don't already know. If you think about it what it said was: "Things have to be looked at. "

The NIST letter was also a repeat of old information. The nub of it was that if you use a time calculation that uses a date the system has to be looked at. We know that and have for several years. In my opinion the reason that all of this has been so overblown is that people think or maybe I should say believe that Time and Date are used much more then they really are. There is an assumption that every chip MUST have a clock in it, the clock MUST use the time and date, and the chip MUST fail come Y2K.

While chips have clock speeds in them they don't have anything to do with time and date. If I can use a rather simplistic analogy. If you have an "old fashion" watch that also gives the date and you forget to set the date forward from November (which only has 30 days) does that mean the time is also incorrect. The answer is of course, no. Most of the things we use the date and time for is just the recording of the date and time. We use it for record keeping, not controlling. Big difference.

There are a whole host of other incorrect assumptions, at least as far as the utility business is concerned. Electricity is provide on a load basis and the load is variable. You can use statistics to calculate approximate values and look at historical data to see how it changes during the day, week, months, etc. You and your neighbors effect it whenever you turn things on and off. And since we have no way of knowing exactly when you will or won't do that the system was built to respond to the load, not the time.

I don't own a generator and I don't plan to go out and buy one.

Thank you.

Three things I would like to comment on:

1. BPA Engineer states, "The IEEE was basically a CYA letter. It really didn't say anything we don't already know. "

I've heard from others on this forum that they feel the IEEE letter had more to do with mainframes than embeddeds, and that it was a CYA job. But this excerpt from the letter troubles me

"The internal complexity of large systems, the further complexity due to the rich interconnections between systems, the diversity of the technical environments in type and vintage of most large organizations and the need to make even small changes in most systems will overwhelm the testing infrastructure that was never designed to test 'everything at once. ' Hence, much software will have to be put back into use without complete testing, a recipe, almost a commandment, for widespread failures."

IEEE is here talking about "complete testing" of software. What about embedded systems? Don't they need end-to-end testing as well? There are elements of the letter that seem to be CYA. But this paragraph in particular seems to be a very frank warning of "widespread failures". Even a small possibility of such a thing should warrant some kind of personal-family-community preparation, IMHO.

2. BPA Engineer states, " The NIST letter was also a repeat of old information. The nub of it was that if you use a time calculation that uses a date the system has to be looked at. We know that and have for several years."

Here are some excerpts from the NIST report Third party testing instruments often do not detect the presence of dates in data transmissions that are encoded in proprietary codes. Hence, if a date is not detected, the embedded device may not be tested

[Among concerns] Date usage that is not apparent and consequently overlooked. This last factor is especially pernicious

Since there is no way to determine what combinations of factors will actually cause a failure, it may be difficult to determine when a failure has actually occurred.

I believe what NIST is saying here is that there may be many systems BPA and others missed. Either because of "date usage that is not apparent" or because third party testing instruments "often do not detect the presence of dates in data transmissions". If BPA has known about this for several years, why is NIST warning that they may have "overlooked" some systems?

If "there is no way to determine what combinations of factors will actually cause a failure", how can any utility, refinery, chemical plant or government possibly declare that Y2K will be a bump in the road? How can anyone tell me that I need only prepare for a 3-day storm?

3. BPA Engineer states, "Most of the things we use the date and time for is just the recording of the date and time. We use it for record keeping, not controlling. Big difference."

Here I must agree with him. We've heard a lot of tongue-wagging around here about all the horrible things that will happen when the chips come crashing down. While it's apparent from IEEE and NIST that the needed end-to-end testing hasn't been done, and that many apparently non-date-sensitive systems may have been missed - this does not automatically guarantee TEOTWAWKI. As BPA Engineer points out, most of these things are used for record keeping, NOT controlling. We ought to be cautious and prepared. But let's not throw out this man's input simply because it doesn't happen to fit out particular doomer framework. We need diversity of information and good, hard data.

I'd like to here Flint's comments on this. I have called him on some knee-jerk obfuscation, but I do value Flint's opinion.

FWIW, I'm at a 6 to 9 in my preps

-- RPGman (tripix@olypen.com), December 08, 1999

Answers

OOOPS! Goofed the color format. BPA Engineer's response ends with "Thank you."

Sorry...

-- RPGman (tripix@olypen.com), December 08, 1999.


That BPA engineer posts here often as the troll The Engineer. We'll see if the power stays up in Washington / Oregon.

Even Koffinsky has sent a memo around alerting the "newfound" dangers of embeddeds.

-- Ashton & Leska in Cascadia (allaha@earthlink.net), December 08, 1999.


o.k.

If his is right, I'm fine and will use my preps during next year.

If he is not, I'm prepd for the winter and some more.

Power outages will have only (only? !)very negative but short term consequences. What's about SME's, oil market, gas shortages, unemployment and interest rates. Is our world only effected by a more polly engineer.

I've prepd

-- Rainbow (Rainbow@123easy.net), December 08, 1999.


A&L:

Why is The Engineer a troll? Because he knows what he's talking about and sees few problems, whereas you don't know squat about power generation and speak from pure ignorant fear? If your power stays up for the next couple of months, then who's the troll after all?

-- Flint (flintc@mindspring.com), December 08, 1999.


This is part of the response I received from my local power company concerning end to end testing: End-to-end testing was performed to the extent possible. This mainly pertained to internal IT systems. All interfaces and core programs were tested together to make sure all of the "hand-offs" or exchanges of data were performed correctly. Where we did not do end-to-end testing was where we have field devices that are queried by a central system and transmitted over the public switched telephone network. We did the end-to-end tests, but not in conjunction with the telephone companies (in other words, the phone companies did not rollover their equipment during those tests) and this was due to the extreme degree of difficulty in doing so. There is very little or no interfacing between systems in the plants and all systems (such as control systems or continuous emissions monitoring systems) have been tested.

Does Flint or the Engineer have a comment on this statement??

-- y2k dave (xsdaa111@hotmail.com), December 08, 1999.



<< Where we did not do end-to-end testing was where we have field devices that are queried by a central system and transmitted over the public switched telephone network. We did the end-to-end tests, but not in conjunction with the telephone companies (in other words, the phone companies did not rollover their equipment during those tests) and this was due to the extreme degree of difficulty in doing so. There is very little or no interfacing between systems in the plants and all systems (such as control systems or continuous emissions monitoring systems) have been tested. >>

Let us then hope that all such systems (and the processes that control them) do operate correctly early next year....remember, he just confirmed exactly the points that:

- are required for normal operations, are subject to failures during abnormal operations

- are used by the various systems (even in a strongly hydro-powered system)

- required absolute operation by other systems not checked by the primary company

- have not been tested - and because of the "expense and complexity" in testing that is cited as reason for NOT testing, they remain subject to failure.

- may or may not work. Certianly, I know of no credible people saying "ALL" embedded systems use dates, nor that anybody credible is saying that "ALL" will fail. At most, experience shows from every list cited over the past year that 2-3% are "affected" by dates/date-time data, or process time information that is affected. Regardless of definition, it's the entire process that must succeed for the companies to avoid losses in productivity, profits, and sales next quarter, next year, and beyond.

Of these 3% (on average) that appear to be affected, how many will fail is unknown. The effect of those failures is unknown. The secondary and tertiary effects as these potential failures are unknown.

BUT, we do know that these have not been tested - even when the process is absolutely KNOWN to rely on such automated links to other systems in other companies. I see no reason for optimism in his comments - but acknowledge that - if every company had remediated as thoroughly as possible, there would be far fewer problems that will actually occur.

Perhaps BPA will succeed/have succeeded. We don't know yet.

-- Robert A. Cook, PE (Marietta, GA) (cook.r@csaatl.com), December 08, 1999.


Gosh here we are 12/8/99 and the answer to most questions is still, "We don't know what will happen." I was so darn sure about a year ago the certainly we would have a better idea what is going to happen the closer we get to 01/01/00. <--- yeah I know non-compliant notation, just trying to make a point here. ;)

-- Ken Seger (kenseger@earthlink.net), December 08, 1999.

Robert:

You genuinely see no reason for optimism? It's true that no business is an island, and failures outside their jurisdiction have *always* threatened them. There are limits to testing, but when you learn that an organization has reached these limits, applying all of the diligence within their own scope, and you STILL find no reason for optimism, your problem doesn't lie with the quality of the effort. Your problem lies with your own abiding pessimism.

Yes, these systems may work and they may fail. And you may be alive tomorrow and you may not. But when you imply that two outcomes are equally probably just because there are two of them, you have abandoned rationality in favor of fear. And you should know better.

I'm absolutely amazed you're willing to drive a vehicle. After all, every single oncoming care MAY veer in front of you or it MAY NOT. And you have no control over this, and no way to "test" to see if it will happen. And your life depends on it. And you drive anyway? Not very consistent, are you?

-- Flint (flintc@mindspring.com), December 08, 1999.


Flint,

To continue your car analogy: I bet Robert and most others drive with an air bag and with their seatbelt firmly fastened JUST IN CASE there is an accident. A prudent man would call that appropriate preparation for the quantifiable risk.

Re: Y2K

With very little solid information from those in the know which makes it impossible to determine what is "appropriate preparation", the prudent man would prep for something slightly greater than his expected probable scenario.

I agree with you that some optimism is warranted - I believe we in the US will not experience TEOTWAWKI but I never did expect or plan for a 9 or 10. As to a 6, 7 or 8; it depends on the aggregate reaction of the multitudes, the populace can not all be buyers or all be sellers of any item, if a critical mass chooses to respond in an identical fashion our system of systems could lock up.

I prep for an 8, expect a 7 and hope for a 6.

Best wishes to you.

-- Bill P (porterwn@one.net), December 08, 1999.


In driving, yes I've noticed that on-coming traffic "could" vear my way and kill me instantly - to date, that hasn't happened. The oncoming traffic, after all, is under the intelligent control of humans who don't want to die. (Drunks and drug users ar the exception.)

To date, 3 deer and two dogs, have "deliberately" crossed into the road in front of me, even after they saw me on-coming and stopped at the side of the road as they watched me approach . Result: five dead animals that could think and avoid danger.

In this case, computers don't think, don't plan, don't do "safe" alternatives - they merely walk across the road in front of you, oblivious as to whether you're coming or not. Whether you want them to avoid killing you or not... Whether you can stop, did stop, or do't stop.

The automated processes don't care...they either work correctly or they fail, or they get overwritten by their users.

Thus, I do slow excessively when I see an animal - or worse, a computer - on the side of the road that cannot think, plan, or has a reason to take precautions. I ahve seen too many bugs get created, take too long to get fixed, get introduced after software changes, and after seemingly innocent modification, seen too many programs simply crash to continue driving as if this will be a "bump in the road."

Bumps can kill - if you hit them at high speed.

-- Robert A. Cook, PE (Marietta, GA) (cook.r@csaatl.com), December 08, 1999.



Bottom line, until any automated process is remediated, it is subject to failure.

After it is remediated, but not adequately tested, it is still subject to failure.

After it is tested thoroughly, it is still subject to failure, but the probability of failure is greatly reduced.

--- Until it runs successfully under real world installed conditions in the actual environment that will occur in the real world next year, it is still subject to failure. After that, it probably will run most of the time when it is needed, as long as conditions don't change.

---...---...---

Be happy when the systems work correctly next year - certainly, n ot all will fail, but don't count on any of them working correctly the first time, or the second, or the third time....

-- Robert A. Cook, PE (Marietta, GA) (cook.r@csaatl.com), December 08, 1999.


Robert:

You're getting warmer. Yes, each stage of the process reduces the exposure a bit. But even systems with no bugs fail, so this exposure can never be reduced to zero. I've seen a lot of failures too, and I've seen "fixes" make problems worse all too often. But I can't fault a company for doing everything within their power, nor can I fault them for recognizing and admitting that not everything is under their control. There will be failures.

Bill P:

Good philosophy. I'm heavily prepared, and expect very little that goes wrong will affect me much. I tend to overinsure anyway.

-- Flint (flintc@mindspring.com), December 08, 1999.


This so called power engineer thinks he knows more than NIST, IEEE, and almost everyone. Perhaps he was has worked as an engineer, but I get the impression that he is more of a PR person than an engineer.

-- Dave (dannco@hotmail.com), December 08, 1999.

Flint:

But I can't fault a company for doing everything within their power, nor can I fault them for recognizing and admitting that not everything is under their control. There will be failures.

Oh, would that it were true!
I think one of the biggest problems has been the denial by your said Y2K resolution/remediation 'implememtors' of the seriousness and scope of the y2k problem. Had this reality actually and adequately been acknowledged, promulgated, and addressed 4-8yrs ago, we would be in much less of a precarious, nay perilous, set of circumstance.

While I truly respect your knowledge and ethic, you seem to blindly trust the purported efforts of others (a sweeping reference I know).
Again, would that it were true, that their efforts be as noble as yours.

I do agree, that there will be failures.
More t'ords reality, I think we are in for some ugliness.
'Fonly it weren't so.

Regards,

-- faith'nhope (y2kaos@home.com), December 08, 1999.


Here is more information about how my utility company is testing embedded itmes. Any comments?? Note they are not testing all embedded items.

Our internal testing program included testing on the front end to determine Y2K compliance and failure mode (inconvenience, severe, or catastrophic failure mode, 95% were inconveniences with no loss of functionality, just date display issues). If we determined it necessary to remediate the device, then post remediation, verification testing was performed as well. Where we had large numbers of identical devices, we performed tests on a sampling of them. If no problems were discovered, then we did not test any more of those type and model (the key is separating larger groups of devices into smaller discrete types by model numbers, date sequences, discrete component type, etc. If problems were found, we tested all of them, and again performed post remediation, verification testing on all of them as well.

Later on, for those devices were no problems were found or only had inconvenience issues, we also consulted the EPRI (Electric Power Research Institute) Y2K database to see if other utilities or users of that equipment were seeing the same results. In all cases, our results were consistent with others in the industry.

We also contacted all the vendors and obtained their information as well for our files, but we learned early on that vendor information was not available, unreliable, and/or inconsistent. We used it more as a check, but definitely not the ultimate or final opinion on Y2K readiness. All too often, vendors changed their finding from Y2K ready or compliant to non-compliant.

When we were performing our tests internally, we had two or three consulting firms involved as well as our own internal company experts, so we had a good mix of experience levels and checks and balances to develop and maintain as thorough of inspection and testing process as possible.

One other thing, even though we have thoroughly tested, remediated and tested as necessary, we have also been through an exhaustive contingency planning process. This process assesses not only the probability of failure but also the impact of failure. Since all of our devices have been remediated or are already Y2K ready, the probability of failure is very, very low or non-existent. However, if the impact of failure (even if it were determined to be very, very low or non-existent) of a device would potentially cause safety issues or customer impacts, we still wrote a contingency procedure (manual work-arounds) for those devices, trained personnel on how to use them, and actually tested them during one or more of our six company-wide Y2K drills.

Oh, and one final item, where ever possible, even though a device has been remediated and even though contingency procedures have been developed, we have taken one final step on a number of devices. Again, based upon impact of failure, as one final step, we have set the dates on some devices so that they are already functioning as if it were the Year 2000 or beyond, and therefore those devices will not experience a 1999 to 2000 date rollover at all.

-- y2k dave (xsdaa111@hotmail.com), December 08, 1999.



Dave (dannco@hotmail.com) wrote: Perhaps he was has worked as an engineer, but I get the impression that he is more of a PR person than an engineer.

This gentleman supplied me with his name and work phone. I checked him out on BPA's phone list. He is indeed an engineer at BPA.

A&L: If he was a troll, would he give me his full name and phone number? Let's take in all the information we can get, and each make up our own minds. Lies and PR spin ought to be denounced. But this is a legitimate opinion from someone working directly on the problem. He deserves our respect and gratitude. Think about it next time you flip a light on.

-- RPGman (tripix@olypen.com), December 08, 1999.


Just a short aside. The IEEE letter was written to encourage passing liability limitations from y2k litigation. (Sorry about the alliteration, there.)

It worked too, as you may recall. But, it's no more predictive of the future than most of the BITR crowd.

Figure out how much you can deal with/afford, prep for that, and if it gets worse.... Well, no one lives forever. C'est la vie.

harl

-- harl (harlanquin@aol.hell), December 09, 1999.


The test sequence described by y2k dave is the first I've seen that was even that thorough - good job. EPRI (a private (read expensive but very effective) association of utilities has long had a good reputation for real data in their database on equipment affected .... they also have found that the same make and model number of processor, but made at different times by different venders, different assembly runs at the same factory, or even different times on the same assembly plant have created different results.

EPRI began its y2k support only for power companies, but several natural gas suppliers found that they had no other source of data other than the electric distribution community - and so had to use the EPRI "electric" info as a reference.

But again, notice the sampling and reliance on surveys and re-testing. It's impossible to test everything - yet everything will be affected. Even in the best possible remediation and retesting - you can't "absolutely" find all problems - you can only hope you've found enough to minimize the troubles to the ones that you yourself (yor own company) can fix.

Otherwise, you're dependent on the remediation and repair efforts of others.

Kinda of like a deer, standing in headlights frozen by assumptions that nothing will happen - simply hoping that the car will stop in time. But this calender has no brakes.

----

Then there are those who have their head in the sand, exposing their rear end without even bothering to look up.

-- Robert A. Cook, PE (Marietta, GA) (cook.r@csaatl.com), December 09, 1999.


Generically, I believe that power is very likely to manage with only irregular failures at irregular times affecting irregular areas - it will be far esier to get powre created than distributed though. So failures are probable, more than 50% in any area.

The news media then will decide that "y2k was nothing" since all areas failed 100% .....

Banks (retail stores/gas stations/any institution) are only as good as the phones/satellites/other power systems at the other end .... even if they themselves are okay and remediated (the banks themselves cetainly are okay) - they cannot be in normal business UNTIL all areas are up and consistently running.

Then the fun really starts.

--- Most at risk?

Local governments, state agencies, schools, medical and health care, any agency that gets/receives/distributes/manages/controls state local or federal programs.

Next most at risk?

Any automated fabrication/distribution/processing/business management....or manufactoring facility that has not spent a proportional amount to the big three automakers.

-- Robert A. Cook, PE (Marietta, GA) (cook.r@csaatl.com), December 09, 1999.


Moderation questions? read the FAQ