[ANSS-netops] reftek and data stoppages [USGS]

Ian Billings i.billings at reftek.com
Fri Feb 22 17:31:54 UTC 2013


Philip,



Firstly the issue with RTP/RTPD links sleeping due to outage.  This points
to something at the RTPD server end.  If a RTP/RTPD is declared down the
130 will go into a sequence of sending server discovery packets to the RTPD
host address every 6-8secs, for a 300sec on/120sec RTP sleeping, cycle
until the link is re-established.  RTPD when sensing this unconditional
sync, also known as server discover, will start to negotiate with the units
RTP to bring the link up.  I would need to look at the RTPD log file to see
if the unconditional syncs are coming into RTPD and if so what RTPD
responds with.  I suspect there is a time out in the local firewall or
router handling the traffic from the 130 units to the RTPD server.  And
sending the Das discovery resets this time out.  A server discovery from
RTPD to a 130 is different from what RTPDID sends out and what RTPD sends
to a DAS if it receives an unconditional sync from it.  If you could send a
RTPD log file I could at least confirm that server discoveries from the
130’s are being seen by RTPD.



Secondly your issue with 99minutes of buffered data.  The firmware in the
130 is written in such a way that when the RTP/RTPD is declared down the
RTPD thread data will be saved to RAM for the thread’s TOSS threshold.  In
your case 99mins.  However if the link remains down for more than the TOSS
threshold this RTPD thread data since the link down declaration is all
deleted from.  Again in your case this happens at 99mins.  REF TEK’s logic
to do this has been explained to allow RAM to be freed up to handle other
thread data link Disk Thread because it is unlikely the link will
re-establish if the TOSS threshold has been met and this old RTPD thread
data will continue to get older etc, and therefore of lesser value when the
link is re-established.  If you find you have link outages lasting on
average 110mins then simply increase the TOSS threshold so most data will
be recovered.



Again please send me a RTPD log file that has a time window of known 130 to
central link stability but no RTP/RTPD connection as well as the part of
the log that shows before and after result of the user issued server
discovery.



Thanks,

Ian Billings

Field Technician

*REF TEK – A Division of Trimble Navigation*

http://support.reftek.com

Skype ian_billings1

PH 214 440 1265

PH 214 440 1289 (Direct)





*From:* ANSS-netops [mailto:anss-netops-bounces at geohazards.usgs.gov] *On
Behalf Of *Philip Crotwell
*Sent:* Friday, February 22, 2013 10:41 AM
*To:* anss-netops at geohazards.usgs.gov
*Cc:* Thomas J. Owens
*Subject:* [ANSS-netops] reftek and data stoppages





Hi all

We have four stations with reftek 130s on cell modems going into earthworm
via rtpd. I recently moved my server to new hardware and rediscovered an
old problem. About every day or two some of the stations stop sending data
even though the link is ok. I don't know what the initial cause is, maybe
the cell link briefly died, but at the time we check on things, the cell
link is fine but data is just not flowing.

The odd thing is the clicking the "das discovery" button in the web admin
tool RTCC causes all the stations to start flowing again. The rediscovery
part is that some years ago when we first noticed this, I put in a cron job
to hit the "das discovery" url once every 15 minutes and the problem went
away. Given limited brain cells, I promptly forgot about it. Not until I
switched server machine, and forgot to transfer the cron job, did I
remember the issue.

It is very puzzling to me that if getting the stations back on line is as
easy as clicking a url, then why in the world can't rtpd do it itself!?!??
Have any of you seen this issue? Any suggestions on ways to deal with it
other than a cron based das discovery? I should say we run a mixed network
with other statins using either q330s or guralps, and only the refteks seem
to have trouble noticing that the cell link is working.

One other puzzle is that my understanding is that the rt130s will cache up
to 99 minutes of data in the case of a lost connection. My experience is
that you get the benefit of the cache only in cases of the outage lasting
less than 99 minutes. If the outage is longer, then when the link comes up
the rt130 starts sending real time data and never sends the previous 99
minutes. If however the outage is less than the cache time, it will start
sending the old cached data first. Seems weird that a 98 minute outage
results in no data loss, but a 100 minute outage results in a 100 minute
data loss.

We have recent, but not the absolute latest versions of firmware, so I
should probably upgrade those just in case. We have stations showing this
issue with firmware at recent as 3.3.1 and I don't see anything in the
release notes that would suggest newer firmware addresses this. RTPD on the
server is the latest version, 2.1.9.0b.

Here is some output of me running rtpid around the time I hit the "das
discovery" button. I hit the button at 11:29:00 and all the stations had
come back to life and sending data within 12 seconds of the discovery
action.

thanks,

Philip


2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
2013:053-11:27:56 earthworm rtpid[3545] Options:
2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
localhost:2543
2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
localhost:2543

    ---- hit "DAS-DISCOVERY" at 11:29:00 ----


2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130222/61d46944/attachment.html>


More information about the ANSS-netops mailing list