[Shake-dev] queue bug?

Peter Lombard lombard at seismo.berkeley.edu
Mon Apr 19 17:33:22 UTC 2010


I have found a bug or "feature" in the queue program, which is used to
schedule ShakeMaps within AQMS (formerly known as "CISN Software"). The
problem can occur only when a ShakeMap is being cancelled, and then only under
certain conditions. This problem exists in versions 3.2 and 3.5, and probably
all versions back to when the mySQL database was added to ShakeMap.

When queue receives a socket message to cancel a ShakeMap, it starts a
sequence of actions. First it removes the event from the "event" queue, the
list of events for which ShakeMap runs are scheduled. Then it removes the
event from the "process". If the event actually was in the "process" queue,
then a 'KILL' signal is sent to the running "shake" program. If there are any
events in the "event" queue that are ready to run, that event is started as a
separate forked process. Finally, queue will run the "cancel" program to
delete the event from web servers and from the ShakeMap database.

The problem is that sending a 'KILL' signal to "shake" will only terminate
"shake"; it will NOT terminate any programs that "shake" has started. And this
may leave the ShakeMap database in a state that does not permit "cancel" to
run. 

This situation happened to me this past weekend. Here are the log entries from
"shake.log", the stdout/stderr entries from queue and all that it runs,
starting with the last entry from "grind":

grind: 0 stations flagged out this iteration
mp 2.8.6 - Peter N. Schweitzer (U.S. Geological Survey)
No errors
transfer: ----- Starting Transfer at 04/17/2010 14:29:59 -----
cancel: Can't run  until 'transfer' has been brought up to date
cancel: Couldn't make new version: sequence needs to be brought up to date
cancel: Error: couldn't set version
transfer: ----- Transfer finished at 04/17/2010 14:30:12 -----

>From this you can see that queue tried to run "cancel" while "transfer" was
running. But since the version sequence was not in the appropriate state,
"cancel" would not proceed with its actions. The result was that the event did
not get cancelled. I later ran "cancel" by hand to get the event deleted.

I see a few ways to fix this problem, none of them easy or particularly
desirable. Perhaps Bruce and I should discuss this by separate email unless
others have ideas or comments.

Pete


More information about the Shake-dev mailing list