[ANSS-netops] reftek and data stoppages

Fri Feb 22 16:41:11 UTC 2013

Hi all

We have four stations with reftek 130s on cell modems going into earthworm
via rtpd. I recently moved my server to new hardware and rediscovered an
old problem. About every day or two some of the stations stop sending data
even though the link is ok. I don't know what the initial cause is, maybe
the cell link briefly died, but at the time we check on things, the cell
link is fine but data is just not flowing.

The odd thing is the clicking the "das discovery" button in the web admin
tool RTCC causes all the stations to start flowing again. The rediscovery
part is that some years ago when we first noticed this, I put in a cron job
to hit the "das discovery" url once every 15 minutes and the problem went
away. Given limited brain cells, I promptly forgot about it. Not until I
switched server machine, and forgot to transfer the cron job, did I
remember the issue.

It is very puzzling to me that if getting the stations back on line is as
easy as clicking a url, then why in the world can't rtpd do it itself!?!??
Have any of you seen this issue? Any suggestions on ways to deal with it
other than a cron based das discovery? I should say we run a mixed network
with other statins using either q330s or guralps, and only the refteks seem
to have trouble noticing that the cell link is working.

One other puzzle is that my understanding is that the rt130s will cache up
to 99 minutes of data in the case of a lost connection. My experience is
that you get the benefit of the cache only in cases of the outage lasting
less than 99 minutes. If the outage is longer, then when the link comes up
the rt130 starts sending real time data and never sends the previous 99
minutes. If however the outage is less than the cache time, it will start
sending the old cached data first. Seems weird that a 98 minute outage
results in no data loss, but a 100 minute outage results in a 100 minute
data loss.

We have recent, but not the absolute latest versions of firmware, so I
should probably upgrade those just in case. We have stations showing this
issue with firmware at recent as 3.3.1 and I don't see anything in the
release notes that would suggest newer firmware addresses this. RTPD on the
server is the latest version, 2.1.9.0b.

Here is some output of me running rtpid around the time I hit the "das
discovery" button. I hit the button at 11:29:00 and all the stations had
come back to life and sending data within 12 seconds of the discovery
action.

thanks,
Philip

2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
2013:053-11:27:56 earthworm rtpid[3545] Options:
2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
localhost:2543
2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
localhost:2543

    ---- hit "DAS-DISCOVERY" at 11:29:00 ----

2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130222/1586a6ed/attachment.html>