[EHPweb] QDDS failure

Tue Jun 2 23:10:18 GMT 2009

I checked a few QDDS clients (leaves) and they are up and running.  However, their notion the last message id# sent by the 3 hubs is incorrect. When the two (3?) QDDS hubs crashed, they reset their counters to zero, so that when I restarted them, their "current message id#" started at zero. Meanwhile, the clients stayed up, and their notion of the hubs' current message id# is 349,682 for qdds1, and 178,940 for iris.  This means that a client will never re-request a message from a hub until the messageid# of hubs exceeds the values before they crashed. The last time this happened was 9/8/2008, so it would probably take place in about 9 months.  That's too long to wait.

The solution is for me to send out an email to all QDDS clients asking them to stop QDDS, delete the file called save_max_received, and restart. I'll do that when Reston rejoins the network.

-David

-------------------------------------------------------
David Oppenheimer                   office:650.329.4792
U.S. Geological Survey              fax:   650.329.4732
345 Middlefield Road.-MS 977    email: oppen at usgs.gov
Menlo Park, CA 94025

-----Original Message-----
From: Christopher J Bidwell [mailto:cbidwell at usgs.gov] 
Sent: Tuesday, June 02, 2009 3:13 PM
To: David H Oppenheimer; EHP Web
Subject: Re: [EHPweb] QDDS failure

I'm getting alerts that graben and ehzeast are unreachable. 
--------------
Thanks,

Chris Bidwell, RHCT
Web Admin
Geologic Hazards Team
303-273-8642
cbidwell at usgs.gov
(Sent via Blackberry)

----- Original Message -----
From: "David Oppenheimer" [oppen at usgs.gov]
Sent: 06/02/2009 02:57 PM MST
To: <ehpweb at geohazards.usgs.gov>
Subject: [EHPweb] QDDS failure

For unknown reasons, all 3 QDDS hubs died. I've successfully restarted QDDS
at qdds1.wr.usgs.gov and dmc.iris.washington.edu. I am unable to ssh into
qdds2.er.usgs.gov. Not sure what to do about that machine. Does anyone have
a contact there who can walk up to the machine?

I don't see anything obvious in the 2 QDDS logfiles that caused their
deaths. This has never happened before.

Thanks to Stan Schwarz and Stan Silverman for notifying me.

-David 

-------------------------------------------------------
David Oppenheimer                   office:650.329.4792
U.S. Geological Survey              fax:   650.329.4732
345 Middlefield Road.-MS 977    email: oppen at usgs.gov
Menlo Park, CA 94025

-----Original Message-----
From: ehpweb-bounces at geohazards.usgs.gov
[mailto:ehpweb-bounces at geohazards.usgs.gov] On Behalf Of Eric M Martinez
Sent: Monday, June 01, 2009 4:06 PM
To: ehpweb at geohazards.usgs.gov
Cc: Earle Paul
Subject: Re: [EHPweb] DYFI/PAGER

I've started both indexers back up at this time.  EHPMaster has seemed  
to stabilize a bit but there is still a massive backup running to  
ehpnas which may continue to cause problems.

Thanks,
	~Eric.

On Jun 1, 2009, at 4:45 PM, Eric M Martinez wrote:

> I'm shutting down both the DYFI and PAGER indexers for the next 15
> minutes to try to stabilize EHPMaster.  Both of these processes have
> been generating quite a significant amount of errors all day and I
> have been fighting to keep them running.  Please let me know if you
> know of any outside factors (config changes etc) that could be causing
> this.
>
> Thanks,
> 	~Eric.
>
>
>
>
> _______________________________________________
> EHPweb mailing list
> EHPweb at geohazards.usgs.gov
> https://geohazards.usgs.gov/mailman/listinfo/ehpweb

_______________________________________________
EHPweb mailing list
EHPweb at geohazards.usgs.gov
https://geohazards.usgs.gov/mailman/listinfo/ehpweb

_______________________________________________
EHPweb mailing list
EHPweb at geohazards.usgs.gov
https://geohazards.usgs.gov/mailman/listinfo/ehpweb