[Shake-dev] queue bug?

Tue May 18 00:47:49 UTC 2010

Hi all,

There is a new version of 'queue' that addresses the bug Pete reported  
below. Thanks go to Pete for helping to test the new version. You can  
get the new code through the repository, as usual.

Also fixed is an earlier bug: when an event that was previously  
processed was re-alarmed with a new magnitude, but the new magnitude  
was below the minimum threshold for processing set in queue.conf, the  
event was skipped, and thus the online maps reflected an incorrect  
magnitude. Now these events will be reprocessed. This means that you  
may occasionally have an event online that is below the threshold  
you've set. If it bothers you, you'll need to manually cancel the event.

I've attached the new version below for those of you still using pre- 
v3.5 ShakeMap. I think it will work since that part of the code hasn't  
been changed much, but do test it a bit to make sure it doesn't die on  
startup or with the first alarm...

Cheers,
Bruce

-------------- next part --------------
A non-text attachment was scrubbed...
Name: queue
Type: application/octet-stream
Size: 47309 bytes
Desc: not available
URL: <http://geohazards.usgs.gov/pipermail/shake-dev/attachments/20100517/f0eead9c/attachment-0001.obj>
-------------- next part --------------

On Apr 19, 2010, at 10:33 AM, Peter Lombard wrote:

> I have found a bug or "feature" in the queue program, which is used to
> schedule ShakeMaps within AQMS (formerly known as "CISN Software").  
> The
> problem can occur only when a ShakeMap is being cancelled, and then  
> only under
> certain conditions. This problem exists in versions 3.2 and 3.5, and  
> probably
> all versions back to when the mySQL database was added to ShakeMap.
>
> When queue receives a socket message to cancel a ShakeMap, it starts a
> sequence of actions. First it removes the event from the "event"  
> queue, the
> list of events for which ShakeMap runs are scheduled. Then it  
> removes the
> event from the "process". If the event actually was in the "process"  
> queue,
> then a 'KILL' signal is sent to the running "shake" program. If  
> there are any
> events in the "event" queue that are ready to run, that event is  
> started as a
> separate forked process. Finally, queue will run the "cancel"  
> program to
> delete the event from web servers and from the ShakeMap database.
>
> The problem is that sending a 'KILL' signal to "shake" will only  
> terminate
> "shake"; it will NOT terminate any programs that "shake" has  
> started. And this
> may leave the ShakeMap database in a state that does not permit  
> "cancel" to
> run.
>
> This situation happened to me this past weekend. Here are the log  
> entries from
> "shake.log", the stdout/stderr entries from queue and all that it  
> runs,
> starting with the last entry from "grind":
>
> grind: 0 stations flagged out this iteration
> mp 2.8.6 - Peter N. Schweitzer (U.S. Geological Survey)
> No errors
> transfer: ----- Starting Transfer at 04/17/2010 14:29:59 -----
> cancel: Can't run  until 'transfer' has been brought up to date
> cancel: Couldn't make new version: sequence needs to be brought up  
> to date
> cancel: Error: couldn't set version
> transfer: ----- Transfer finished at 04/17/2010 14:30:12 -----
>
>> From this you can see that queue tried to run "cancel" while  
>> "transfer" was
> running. But since the version sequence was not in the appropriate  
> state,
> "cancel" would not proceed with its actions. The result was that the  
> event did
> not get cancelled. I later ran "cancel" by hand to get the event  
> deleted.
>
> I see a few ways to fix this problem, none of them easy or  
> particularly
> desirable. Perhaps Bruce and I should discuss this by separate email  
> unless
> others have ideas or comments.
>
> Pete
> _______________________________________________
> Shake-dev mailing list
> Shake-dev at geohazards.usgs.gov
> https://geohazards.usgs.gov/mailman/listinfo/shake-dev