[ANSS-netops] ANSS-netops Digest, Vol 44, Issue 7

Sat Feb 23 16:17:41 UTC 2013

Philip does your RT130 traffic go through NAT on the cell modem or server
end? both?

Have you tried using tcpdump on the server side to get a packet trace?

e.g.
  sudo tcpdump -i (serverinterface) host (dasiporhostname)

Its really helpful for debugging these kinds of problems.

I don't run NAT on any of our 8 cell routers (VZW 3G, dynamic IP), its all
pure routing and stable except for the antenna ice :>)
-Dave

On Sat, Feb 23, 2013 at 4:00 AM, <anss-netops-request at geohazards.usgs.gov>wrote:

> Send ANSS-netops mailing list submissions to
>         anss-netops at geohazards.usgs.gov
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://geohazards.usgs.gov/mailman/listinfo/anss-netops
> or, via email, send a message with subject or body 'help' to
>         anss-netops-request at geohazards.usgs.gov
>
> You can reach the person managing the list at
>         anss-netops-owner at geohazards.usgs.gov
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ANSS-netops digest..."
>
>
> Today's Topics:
>
>    1. reftek and data stoppages (Philip Crotwell)
>    2. Re: reftek and data stoppages [USGS] (Ian Billings)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 22 Feb 2013 11:41:11 -0500
> From: Philip Crotwell <crotwell at seis.sc.edu>
> To: "anss-netops at geohazards.usgs.gov"
>         <anss-netops at geohazards.usgs.gov>
> Cc: "Thomas J. Owens" <owens at seis.sc.edu>
> Subject: [ANSS-netops] reftek and data stoppages
> Message-ID:
>         <CAGFrVcWJ+fWpRyqaPVp9nbaR2-=
> fcbFA9+8EKNUoyymQ0G-P8A at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi all
>
> We have four stations with reftek 130s on cell modems going into earthworm
> via rtpd. I recently moved my server to new hardware and rediscovered an
> old problem. About every day or two some of the stations stop sending data
> even though the link is ok. I don't know what the initial cause is, maybe
> the cell link briefly died, but at the time we check on things, the cell
> link is fine but data is just not flowing.
>
> The odd thing is the clicking the "das discovery" button in the web admin
> tool RTCC causes all the stations to start flowing again. The rediscovery
> part is that some years ago when we first noticed this, I put in a cron job
> to hit the "das discovery" url once every 15 minutes and the problem went
> away. Given limited brain cells, I promptly forgot about it. Not until I
> switched server machine, and forgot to transfer the cron job, did I
> remember the issue.
>
> It is very puzzling to me that if getting the stations back on line is as
> easy as clicking a url, then why in the world can't rtpd do it itself!?!??
> Have any of you seen this issue? Any suggestions on ways to deal with it
> other than a cron based das discovery? I should say we run a mixed network
> with other statins using either q330s or guralps, and only the refteks seem
> to have trouble noticing that the cell link is working.
>
> One other puzzle is that my understanding is that the rt130s will cache up
> to 99 minutes of data in the case of a lost connection. My experience is
> that you get the benefit of the cache only in cases of the outage lasting
> less than 99 minutes. If the outage is longer, then when the link comes up
> the rt130 starts sending real time data and never sends the previous 99
> minutes. If however the outage is less than the cache time, it will start
> sending the old cached data first. Seems weird that a 98 minute outage
> results in no data loss, but a 100 minute outage results in a 100 minute
> data loss.
>
> We have recent, but not the absolute latest versions of firmware, so I
> should probably upgrade those just in case. We have stations showing this
> issue with firmware at recent as 3.3.1 and I don't see anything in the
> release notes that would suggest newer firmware addresses this. RTPD on the
> server is the latest version, 2.1.9.0b.
>
> Here is some output of me running rtpid around the time I hit the "das
> discovery" button. I hit the button at 11:29:00 and all the stations had
> come back to life and sending data within 12 seconds of the discovery
> action.
>
> thanks,
> Philip
>
> 2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
> 2013:053-11:27:56 earthworm rtpid[3545] Options:
> 2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
> 2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
> 2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
> 2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
> 2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
> 2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
> 2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
> 2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
> localhost:2543
> 2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
> localhost:2543
>
>     ---- hit "DAS-DISCOVERY" at 11:29:00 ----
>
> 2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
> 2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
> 2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
> 2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130222/1586a6ed/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Fri, 22 Feb 2013 11:31:54 -0600
> From: Ian Billings <i.billings at reftek.com>
> To: Philip Crotwell <crotwell at seis.sc.edu>,
>         anss-netops at geohazards.usgs.gov
> Cc: "Thomas J. Owens" <owens at seis.sc.edu>
> Subject: Re: [ANSS-netops] reftek and data stoppages [USGS]
> Message-ID: <aec7190bf6de0fefacaf958f291bd3a3 at mail.gmail.com>
> Content-Type: text/plain; charset="windows-1252"
>
> Philip,
>
>
>
> Firstly the issue with RTP/RTPD links sleeping due to outage.  This points
> to something at the RTPD server end.  If a RTP/RTPD is declared down the
> 130 will go into a sequence of sending server discovery packets to the RTPD
> host address every 6-8secs, for a 300sec on/120sec RTP sleeping, cycle
> until the link is re-established.  RTPD when sensing this unconditional
> sync, also known as server discover, will start to negotiate with the units
> RTP to bring the link up.  I would need to look at the RTPD log file to see
> if the unconditional syncs are coming into RTPD and if so what RTPD
> responds with.  I suspect there is a time out in the local firewall or
> router handling the traffic from the 130 units to the RTPD server.  And
> sending the Das discovery resets this time out.  A server discovery from
> RTPD to a 130 is different from what RTPDID sends out and what RTPD sends
> to a DAS if it receives an unconditional sync from it.  If you could send a
> RTPD log file I could at least confirm that server discoveries from the
> 130?s are being seen by RTPD.
>
>
>
> Secondly your issue with 99minutes of buffered data.  The firmware in the
> 130 is written in such a way that when the RTP/RTPD is declared down the
> RTPD thread data will be saved to RAM for the thread?s TOSS threshold.  In
> your case 99mins.  However if the link remains down for more than the TOSS
> threshold this RTPD thread data since the link down declaration is all
> deleted from.  Again in your case this happens at 99mins.  REF TEK?s logic
> to do this has been explained to allow RAM to be freed up to handle other
> thread data link Disk Thread because it is unlikely the link will
> re-establish if the TOSS threshold has been met and this old RTPD thread
> data will continue to get older etc, and therefore of lesser value when the
> link is re-established.  If you find you have link outages lasting on
> average 110mins then simply increase the TOSS threshold so most data will
> be recovered.
>
>
>
> Again please send me a RTPD log file that has a time window of known 130 to
> central link stability but no RTP/RTPD connection as well as the part of
> the log that shows before and after result of the user issued server
> discovery.
>
>
>
> Thanks,
>
> Ian Billings
>
> Field Technician
>
> *REF TEK ? A Division of Trimble Navigation*
>
> http://support.reftek.com
>
> Skype ian_billings1
>
> PH 214 440 1265
>
> PH 214 440 1289 (Direct)
>
>
>
>
>
> *From:* ANSS-netops [mailto:anss-netops-bounces at geohazards.usgs.gov] *On
> Behalf Of *Philip Crotwell
> *Sent:* Friday, February 22, 2013 10:41 AM
> *To:* anss-netops at geohazards.usgs.gov
> *Cc:* Thomas J. Owens
> *Subject:* [ANSS-netops] reftek and data stoppages
>
>
>
>
>
> Hi all
>
> We have four stations with reftek 130s on cell modems going into earthworm
> via rtpd. I recently moved my server to new hardware and rediscovered an
> old problem. About every day or two some of the stations stop sending data
> even though the link is ok. I don't know what the initial cause is, maybe
> the cell link briefly died, but at the time we check on things, the cell
> link is fine but data is just not flowing.
>
> The odd thing is the clicking the "das discovery" button in the web admin
> tool RTCC causes all the stations to start flowing again. The rediscovery
> part is that some years ago when we first noticed this, I put in a cron job
> to hit the "das discovery" url once every 15 minutes and the problem went
> away. Given limited brain cells, I promptly forgot about it. Not until I
> switched server machine, and forgot to transfer the cron job, did I
> remember the issue.
>
> It is very puzzling to me that if getting the stations back on line is as
> easy as clicking a url, then why in the world can't rtpd do it itself!?!??
> Have any of you seen this issue? Any suggestions on ways to deal with it
> other than a cron based das discovery? I should say we run a mixed network
> with other statins using either q330s or guralps, and only the refteks seem
> to have trouble noticing that the cell link is working.
>
> One other puzzle is that my understanding is that the rt130s will cache up
> to 99 minutes of data in the case of a lost connection. My experience is
> that you get the benefit of the cache only in cases of the outage lasting
> less than 99 minutes. If the outage is longer, then when the link comes up
> the rt130 starts sending real time data and never sends the previous 99
> minutes. If however the outage is less than the cache time, it will start
> sending the old cached data first. Seems weird that a 98 minute outage
> results in no data loss, but a 100 minute outage results in a 100 minute
> data loss.
>
> We have recent, but not the absolute latest versions of firmware, so I
> should probably upgrade those just in case. We have stations showing this
> issue with firmware at recent as 3.3.1 and I don't see anything in the
> release notes that would suggest newer firmware addresses this. RTPD on the
> server is the latest version, 2.1.9.0b.
>
> Here is some output of me running rtpid around the time I hit the "das
> discovery" button. I hit the button at 11:29:00 and all the stations had
> come back to life and sending data within 12 seconds of the discovery
> action.
>
> thanks,
>
> Philip
>
>
> 2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
> 2013:053-11:27:56 earthworm rtpid[3545] Options:
> 2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
> 2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
> 2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
> 2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
> 2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
> 2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
> 2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
> 2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
> localhost:2543
> 2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
> localhost:2543
>
>     ---- hit "DAS-DISCOVERY" at 11:29:00 ----
>
>
> 2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
> 2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
> 2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
> 2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130222/61d46944/attachment-0001.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ANSS-netops mailing list
> ANSS-netops at geohazards.usgs.gov
> https://geohazards.usgs.gov/mailman/listinfo/anss-netops
>
>
> ------------------------------
>
> End of ANSS-netops Digest, Vol 44, Issue 7
> ******************************************
>

-- 
Sent from my iNTERNETS!!!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130223/de51d063/attachment-0001.html>