We may have seen a Y2K failur in a PLC

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

While operating our spillway gates (these are the gates that allow water to spill through the dam and bypass the turbines) one out the four gates stopped operating, and brought up an impossible alarm condition for the type of operation it was performing.

All four gates were commanded to raise, which three of them did successfully. But #2 gate started to move, then stopped and remained at its position, and after a short time brought up a "control Failed" alarm. This alarm told us nothing that we didn't already know, and although a further command was sent, it had the same result. A technician was called in to fix the problem, and of course he wanted to see for himself just what was happening. A further command was sent to the gate while he was there, and the gate performed perfectly with no sign of any failure. He didn't believe there ever had been a fault, but he did download the last few events from the PLC and see what happened.

To our surprise, the control failed due to an excessive "gate drift" situation. This is a method of detecting if the gate moves when it isn't meant to. On questioning the technician further as to how the software is set up, he commented that the gate position is polled every 600 mS, and if there is any movement in the position, then the PLC checks back to see if there should have been any movement, and that the movement is in the correct direction. I then pressed further as to how the 600 mS timer works, and discovered that it gets its time from an RTC in the PLC. This technician is also the one who tested these controllers for Y2K compliance, so I pressed him further on how the 600 mS time is claculated, and whether it could have been a Y2K issue. He looked startled for a second, then quickly claimed that it couldn't be Y2K as only the millisecond counter was used, and no date had ever been set in the RTC since it was installed in 1991.

However on looking into it I am of the opinion that what has happened is that during its operation, the RTC has rolled over, and for a short period of time, the time calculation was negative, which would make the controller think that the gate was moving in the wrong direction. Within a couple of minutes the whole calculation would have been operating in year 00, and so no further problems.

This may have been a Y2K issue, or it may have been a self correcting random failure of a type that does happen from time to time. Maybe we'll never know. I initially posted this message on the Electricity Industry Y2K forum, but then thought that if it was a Y2K issue then it may have a greater interest than just those concerned with electricity.

Malcolm

-- Malcolm Taylor (taylorm@es.co.nz), December 16, 1999

Answers

I think you're right, and furthermore, I think it's a leading edge event that we'll be seeing an increasing amount of between now and 1/1/0 -- and for a roughly equivalent period after.

I posted something along those lines a week or two ago in a different thread. In a nutshell, my thinking was that for all the things we *don't* know, there are some things we *do* know, and they are incontrovertible.

First and foremost is a universal truth pertaining to clocks: RTCs drift. Some drift forward, some drift backward. And, they drift at different rates.

You can easily verify this by checking the RTC on your computer. Unless it's linked to an accurate time server, it's either fast, or slow, or, you recently *manually* adjusted it.

It's easy to set the date and time on a desktop computer.

It's generally *not* easy to do that on an embedded controller -- and in the case of a controller that is "officially" immune to date issues, I'd wager it's never done at all.

As to the "no date had ever been set in the RTC since it was installed" claim, all I have to say is, "What about *before* it was installed?"

It's completely within the realm of believability for an RTC to drift two weeks since 1991.

Keep an eye on the others!

Of course, if your situation is one where nothing will happen unless you're performing an action *during* the RTC's grief-time, they may have already passed.

And as to the claim that it only uses the millisecond counter, he could be wrong, or, *it* could be basing *its* results on the whole stinkin' date.

I'd like to know exactly what he meant by "millisecond counter" too. Is it a simple count of milliseconds that count from the time the device was first powered up? If so, it's either got a *humongous* bit length, or, *it* has to periodically roll over.

Or, perhaps -- and I suspect, perhaps more *likely* -- it's a millisecond *interval* counter. You give it a value, start it up, and when the elapsed time matches the value you supplied, it fires an interrupt to alert you. So, in your application, you'd tell it give you 600 ms, and when it does, it would "go off" in 600 ms.

If that is the case, I can *easily* see how it could be subject to failure when the RTC crosses the line past 12/31/99.

There are a few ways that I can think of that it can determine elapsed time: it can simply count "hearbeats" from a tick source, or, it can take a snapshot of the date/time, and then each time it receives a "heartbeat", it compares the current date/time against the initial value and see if it matches the target value. Or, it can take a snapshot of a since-power-on heartbeat counter, and compare the initial heartbeat count + the target value against the current heartbeat count.

Depending on the nature of the RTC, and the philosophy (and skill level) of the firmware author, it might be easier to do a date/time comparison rather than a "simple" heartbeat count (of either type).

Why? Because counting the heartbeats means dealing with interrupts, and it's probably "easier" to just loop, and compare once per pass.

Damn, I hate typing into this tiny window. It reminds me of my old TRS-80 Model 100.

-- Ron Schwarz (rs@clubvb.com.delete.this), December 16, 1999.


Malcolm,

Why don't you ask Hawk? He has an IQ of 142. I bet he could give you an answer.

-- (smiley@face.com), December 16, 1999.


Now this could cause problems in any number of places if they occur before many plants and pipelines power down for rollover. Some could be harmless like this one. Some could be catastrophic. Some could be lethal.

Are these fair statements?

-- ghost (fading into the@background.com), December 16, 1999.


This is an enormously significant thread. Thank you, Malcolm.

-- Ashton & Leska in Cascadia (allaha@earthlink.net), December 17, 1999.

This is exactly the situation I was speculating about in my post a little earlier.

LINK

I got a good reply from andy in:

LINK

where I also posted the question.

-- Interested Spectator (is@the_ring.side), December 17, 1999.



Malcolm, type testing assumptions aside, please let us know whether the other three gates malfunction in a similar way over the next few weeks. Thanks for the post and your analysis.

-- Brooks (brooksbie@hotmail.com), December 17, 1999.

Is it correct to infer that the spillway gates were in operation as part of normal functioning, and that this occured at a time when no Y2K testing was ongoing?

-- nothere nothere (notherethere@hotmail.com), December 17, 1999.

Malcolm, could you please post here any responses from the Electricity Forum? Thank you.

-- Ashton & Leska in Cascadia (allaha@earthlink.net), December 17, 1999.

Thanks for the informative post, Malcolm.

I'm curious: Can you clarify if each get is controlled by a separate PLC getting it's time/date from a separate RTC?

I assume they have separate RTCs, but If all 4 gates share a controller, why didn't the others act oddly?

Just trying for a little differential diagnosis...

-- Lewis (aslanshow@yahoo.com), December 17, 1999.


nothere nothere: Yes, the spillway gates were in normal operation, and this was not a Y2K test. However perhaps I should add that the gates are only moved infrequently, and this type of issue would only be noticed if a Y2K fault arose while they were being moved.

Ashton & Leska: I will not copy the responses from the Electricity Forum verbatim, as it is not a good policy to publish items from a private forum onto a public one. But I will summarise the responses in a day or two.

Lewis: Each gate has its own PLC, but they share a common RTU for communicating with the SCADA. And as none of the other gates were affected, this is the reason I suspect it may be Y2K related.

Malcolm

-- Malcolm taylor (taylorm@es.co.nz), December 17, 1999.



Relax everyone, I did jump the gun on this one. I think we are now fairly certain that this failure was NOT a Y2K issue. The PLC concerned is an Allen-Bradley micro- logix 1500 which Rockwell list as being ready. However the reason we can now be certain that it was not Y2K related is that the same controller failed again last night with exactly the same symptoms. We can find no reason for it to measure a gate drift, which requires either negative movement (possible but wasn't happening at the time), or negative time (an impossible situation except during Y2K rollover). We have now passed this issue on to our performance engineer who is usually quite good at trouble shooting, and we'll see what he finds.

Malcolm

-- Malcolm Taylor (taylorm@es.co.nz), December 18, 1999.


Thanks for the update, Malcolm. I greatly appreciate your level-headed approach.

I can visualize a situation where a system gets confused once a day post rollover. It may only use the mm/dd/yy part at "midnite" to track events straddling midnite. The US Dept of Commerce report from Nov 22 referred to a similar situation involving "epoch" dates. If the gate behaved strangely at exactly the same time everyday, could it be a Y2K issue?

I'm probably straining at gnats. This area is not my specialty, so feel free to dismiss my comments as the ravings of a stressed-out and exhausted geek who would very much like to look back on all this and laugh.

All the best-

-- Lewis (aslanshow@yahoo.com), December 18, 1999.


Moderation questions? read the FAQ