Failure at the London Stock Exchange due to bug

greenspun.com : LUSENET : Grassroots Information Coordination Center (GICC) : One Thread

Issue date: 13 April 2000 Article source: Computer Weekly News Emergency procedures fail the stock exchange Tony Collins

The failure of systems at the London Stock Exchange last week was due initially to a software bug - as yet still unidentified - but compounded by weaknesses in emergency escalation procedures, Computer Weekly has learned.

The problem began with a bug in a non-critical overnight trading systems programme that purges old message logs and the previous day's market data. The batch programme usually takes about an hour to run. In the early hours of Wednesday it took four hours.

This was not a disastrous problem in itself, but all of the exchange's 300 overnight batch programmes must run one after another, and not in tandem. This time, while the first batch programme was still running, a second unrelated batch programme started, for reasons that are not yet clear.

This caused a set of problems that had not been predicted, Chris Broad, the exchange's head of service development, said.

With the two programmes running in tandem, rather than sequentially, data was corrupted. Information from the previous day's trading became mixed with that being prepared for the coming day.

One of the main lessons to be learned from the incident appears to relate to the escalation procedures that involve the system operators and developers Andersen Consulting. Escalation procedures define the actions that computer operators must take to cope with a potential emergency.

"The procedures were complied with," said one exe-cutive, "But it was the escalation procedures themselves that were found wanting."

The exchange has now introduced manual and software procedures to prevent the batch programmes overlapping. But some key questions remain unanswered:

What was the software bug and can it be replicated and therefore identified?

Why was the potential seriousness of the problem not realised sooner?

With programmes that must run sequentially, rather than overlap, why had risk analyses not spotted the potentially disastrous consequences of a problem that caused two programmes to run into each other? Andersen Consulting declined to comment on the failure.

Computer Weekly has also learned that a top audit partner at Andersen has flown in from the US to help with the inquiries.

http://www.computerweekly.co.uk/cwarchive/news/20000413/cwcontainer.asp?name=C1.HTML

-- Martin Thompson (mthom1927@aol.com), April 13, 2000


Moderation questions? read the FAQ