[ANSS-netops] reftek and data stoppages

Philip Crotwell crotwell at seis.sc.edu
Tue Mar 5 12:53:59 UTC 2013


Hi all

Just to follow up on this issue, there are two problems, one seems to
really be a bug, the other is "operator error", ie me.

My error was that I was only running RTPD on the server. It turns out that
you need to have both RTPD and RTPID running all the time to avoid
connection drops. RTPD just sits listening for data to arrive. It is just
as happy if data is not flowing as it is when it is flowing. RTPID is the
thing that yells "wake up, back to work you lazy dogs" if the data flow
stops. It seems a bit strange to me that these two are not one executable,
but that is the way it is.

The actual bug occurs on a restart of the server. There seems to be a case
where the systems restart successfully but without any data flowing and
RTPID doesn't start yelling. The result is that no data arrives unless you
kick the system by hitting "das-discovery" in the web based RTCC. Ian says
that Reftek is looking into making RTPID a little more vocal at startup to
avoid this case in the next version.

So keep RTPID running, be careful on restarts, and all should be well.
Philip




On Fri, Feb 22, 2013 at 11:41 AM, Philip Crotwell <crotwell at seis.sc.edu>wrote:

>
> Hi all
>
> We have four stations with reftek 130s on cell modems going into earthworm
> via rtpd. I recently moved my server to new hardware and rediscovered an
> old problem. About every day or two some of the stations stop sending data
> even though the link is ok. I don't know what the initial cause is, maybe
> the cell link briefly died, but at the time we check on things, the cell
> link is fine but data is just not flowing.
>
> The odd thing is the clicking the "das discovery" button in the web admin
> tool RTCC causes all the stations to start flowing again. The rediscovery
> part is that some years ago when we first noticed this, I put in a cron job
> to hit the "das discovery" url once every 15 minutes and the problem went
> away. Given limited brain cells, I promptly forgot about it. Not until I
> switched server machine, and forgot to transfer the cron job, did I
> remember the issue.
>
> It is very puzzling to me that if getting the stations back on line is as
> easy as clicking a url, then why in the world can't rtpd do it itself!?!??
> Have any of you seen this issue? Any suggestions on ways to deal with it
> other than a cron based das discovery? I should say we run a mixed network
> with other statins using either q330s or guralps, and only the refteks seem
> to have trouble noticing that the cell link is working.
>
> One other puzzle is that my understanding is that the rt130s will cache up
> to 99 minutes of data in the case of a lost connection. My experience is
> that you get the benefit of the cache only in cases of the outage lasting
> less than 99 minutes. If the outage is longer, then when the link comes up
> the rt130 starts sending real time data and never sends the previous 99
> minutes. If however the outage is less than the cache time, it will start
> sending the old cached data first. Seems weird that a 98 minute outage
> results in no data loss, but a 100 minute outage results in a 100 minute
> data loss.
>
> We have recent, but not the absolute latest versions of firmware, so I
> should probably upgrade those just in case. We have stations showing this
> issue with firmware at recent as 3.3.1 and I don't see anything in the
> release notes that would suggest newer firmware addresses this. RTPD on the
> server is the latest version, 2.1.9.0b.
>
> Here is some output of me running rtpid around the time I hit the "das
> discovery" button. I hit the button at 11:29:00 and all the stations had
> come back to life and sending data within 12 seconds of the discovery
> action.
>
> thanks,
> Philip
>
> 2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
> 2013:053-11:27:56 earthworm rtpid[3545] Options:
> 2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
> 2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
> 2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
> 2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
> 2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
> 2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
> 2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
> 2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
> localhost:2543
> 2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
> localhost:2543
>
>     ---- hit "DAS-DISCOVERY" at 11:29:00 ----
>
> 2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
> 2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
> 2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
> 2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130305/2bfaf69c/attachment.html>


More information about the ANSS-netops mailing list