[ANSS-netops] ANSS-netops Digest, Vol 44, Issue 7

Sun Feb 24 13:07:29 UTC 2013

No nat on the station end. Harder to say on the server end as I am not
completely sure how the network people for the university do things. We
have a static, routable IP, which would seem to say no nat on the server
end, but what happens in the university firewall is mysterious to say the
least.

I have had some back and forth with Ian from Reftek, and he thinks there is
likely a bug in rtpd. He is going to post something with his hypothesis
shortly. But you are right that some judicious tcpdumping might be useful.
Thanks for the tip.

Philip

On Sat, Feb 23, 2013 at 11:17 AM, David Slater <davideslater at gmail.com>wrote:

> Philip does your RT130 traffic go through NAT on the cell modem or server
> end? both?
>
> Have you tried using tcpdump on the server side to get a packet trace?
>
> e.g.
>   sudo tcpdump -i (serverinterface) host (dasiporhostname)
>
> Its really helpful for debugging these kinds of problems.
>
> I don't run NAT on any of our 8 cell routers (VZW 3G, dynamic IP), its all
> pure routing and stable except for the antenna ice :>)
> -Dave
>
> On Sat, Feb 23, 2013 at 4:00 AM, <anss-netops-request at geohazards.usgs.gov>wrote:
>
>> Send ANSS-netops mailing list submissions to
>>         anss-netops at geohazards.usgs.gov
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://geohazards.usgs.gov/mailman/listinfo/anss-netops
>> or, via email, send a message with subject or body 'help' to
>>         anss-netops-request at geohazards.usgs.gov
>>
>> You can reach the person managing the list at
>>         anss-netops-owner at geohazards.usgs.gov
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of ANSS-netops digest..."
>>
>>
>> Today's Topics:
>>
>>    1. reftek and data stoppages (Philip Crotwell)
>>    2. Re: reftek and data stoppages [USGS] (Ian Billings)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Fri, 22 Feb 2013 11:41:11 -0500
>> From: Philip Crotwell <crotwell at seis.sc.edu>
>> To: "anss-netops at geohazards.usgs.gov"
>>         <anss-netops at geohazards.usgs.gov>
>> Cc: "Thomas J. Owens" <owens at seis.sc.edu>
>> Subject: [ANSS-netops] reftek and data stoppages
>> Message-ID:
>>         <CAGFrVcWJ+fWpRyqaPVp9nbaR2-=
>> fcbFA9+8EKNUoyymQ0G-P8A at mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hi all
>>
>> We have four stations with reftek 130s on cell modems going into earthworm
>> via rtpd. I recently moved my server to new hardware and rediscovered an
>> old problem. About every day or two some of the stations stop sending data
>> even though the link is ok. I don't know what the initial cause is, maybe
>> the cell link briefly died, but at the time we check on things, the cell
>> link is fine but data is just not flowing.
>>
>> The odd thing is the clicking the "das discovery" button in the web admin
>> tool RTCC causes all the stations to start flowing again. The rediscovery
>> part is that some years ago when we first noticed this, I put in a cron
>> job
>> to hit the "das discovery" url once every 15 minutes and the problem went
>> away. Given limited brain cells, I promptly forgot about it. Not until I
>> switched server machine, and forgot to transfer the cron job, did I
>> remember the issue.
>>
>> It is very puzzling to me that if getting the stations back on line is as
>> easy as clicking a url, then why in the world can't rtpd do it itself!?!??
>> Have any of you seen this issue? Any suggestions on ways to deal with it
>> other than a cron based das discovery? I should say we run a mixed network
>> with other statins using either q330s or guralps, and only the refteks
>> seem
>> to have trouble noticing that the cell link is working.
>>
>> One other puzzle is that my understanding is that the rt130s will cache up
>> to 99 minutes of data in the case of a lost connection. My experience is
>> that you get the benefit of the cache only in cases of the outage lasting
>> less than 99 minutes. If the outage is longer, then when the link comes up
>> the rt130 starts sending real time data and never sends the previous 99
>> minutes. If however the outage is less than the cache time, it will start
>> sending the old cached data first. Seems weird that a 98 minute outage
>> results in no data loss, but a 100 minute outage results in a 100 minute
>> data loss.
>>
>> We have recent, but not the absolute latest versions of firmware, so I
>> should probably upgrade those just in case. We have stations showing this
>> issue with firmware at recent as 3.3.1 and I don't see anything in the
>> release notes that would suggest newer firmware addresses this. RTPD on
>> the
>> server is the latest version, 2.1.9.0b.
>>
>> Here is some output of me running rtpid around the time I hit the "das
>> discovery" button. I hit the button at 11:29:00 and all the stations had
>> come back to life and sending data within 12 seconds of the discovery
>> action.
>>
>> thanks,
>> Philip
>>
>> 2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
>> 2013:053-11:27:56 earthworm rtpid[3545] Options:
>> 2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
>> 2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
>> 2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
>> 2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
>> 2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
>> 2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
>> 2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
>> 2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
>> localhost:2543
>> 2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
>> localhost:2543
>>
>>     ---- hit "DAS-DISCOVERY" at 11:29:00 ----
>>
>> 2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
>> 2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
>> 2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
>> 2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130222/1586a6ed/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Fri, 22 Feb 2013 11:31:54 -0600
>> From: Ian Billings <i.billings at reftek.com>
>> To: Philip Crotwell <crotwell at seis.sc.edu>,
>>         anss-netops at geohazards.usgs.gov
>> Cc: "Thomas J. Owens" <owens at seis.sc.edu>
>> Subject: Re: [ANSS-netops] reftek and data stoppages [USGS]
>> Message-ID: <aec7190bf6de0fefacaf958f291bd3a3 at mail.gmail.com>
>> Content-Type: text/plain; charset="windows-1252"
>>
>> Philip,
>>
>>
>>
>> Firstly the issue with RTP/RTPD links sleeping due to outage.  This points
>> to something at the RTPD server end.  If a RTP/RTPD is declared down the
>> 130 will go into a sequence of sending server discovery packets to the
>> RTPD
>> host address every 6-8secs, for a 300sec on/120sec RTP sleeping, cycle
>> until the link is re-established.  RTPD when sensing this unconditional
>> sync, also known as server discover, will start to negotiate with the
>> units
>> RTP to bring the link up.  I would need to look at the RTPD log file to
>> see
>> if the unconditional syncs are coming into RTPD and if so what RTPD
>> responds with.  I suspect there is a time out in the local firewall or
>> router handling the traffic from the 130 units to the RTPD server.  And
>> sending the Das discovery resets this time out.  A server discovery from
>> RTPD to a 130 is different from what RTPDID sends out and what RTPD sends
>> to a DAS if it receives an unconditional sync from it.  If you could send
>> a
>> RTPD log file I could at least confirm that server discoveries from the
>> 130?s are being seen by RTPD.
>>
>>
>>
>> Secondly your issue with 99minutes of buffered data.  The firmware in the
>> 130 is written in such a way that when the RTP/RTPD is declared down the
>> RTPD thread data will be saved to RAM for the thread?s TOSS threshold.  In
>> your case 99mins.  However if the link remains down for more than the TOSS
>> threshold this RTPD thread data since the link down declaration is all
>> deleted from.  Again in your case this happens at 99mins.  REF TEK?s logic
>> to do this has been explained to allow RAM to be freed up to handle other
>> thread data link Disk Thread because it is unlikely the link will
>> re-establish if the TOSS threshold has been met and this old RTPD thread
>> data will continue to get older etc, and therefore of lesser value when
>> the
>> link is re-established.  If you find you have link outages lasting on
>> average 110mins then simply increase the TOSS threshold so most data will
>> be recovered.
>>
>>
>>
>> Again please send me a RTPD log file that has a time window of known 130
>> to
>> central link stability but no RTP/RTPD connection as well as the part of
>> the log that shows before and after result of the user issued server
>> discovery.
>>
>>
>>
>> Thanks,
>>
>> Ian Billings
>>
>> Field Technician
>>
>> *REF TEK ? A Division of Trimble Navigation*
>>
>> http://support.reftek.com
>>
>> Skype ian_billings1
>>
>> PH 214 440 1265
>>
>> PH 214 440 1289 (Direct)
>>
>>
>>
>>
>>
>> *From:* ANSS-netops [mailto:anss-netops-bounces at geohazards.usgs.gov] *On
>> Behalf Of *Philip Crotwell
>> *Sent:* Friday, February 22, 2013 10:41 AM
>> *To:* anss-netops at geohazards.usgs.gov
>> *Cc:* Thomas J. Owens
>> *Subject:* [ANSS-netops] reftek and data stoppages
>>
>>
>>
>>
>>
>> Hi all
>>
>> We have four stations with reftek 130s on cell modems going into earthworm
>> via rtpd. I recently moved my server to new hardware and rediscovered an
>> old problem. About every day or two some of the stations stop sending data
>> even though the link is ok. I don't know what the initial cause is, maybe
>> the cell link briefly died, but at the time we check on things, the cell
>> link is fine but data is just not flowing.
>>
>> The odd thing is the clicking the "das discovery" button in the web admin
>> tool RTCC causes all the stations to start flowing again. The rediscovery
>> part is that some years ago when we first noticed this, I put in a cron
>> job
>> to hit the "das discovery" url once every 15 minutes and the problem went
>> away. Given limited brain cells, I promptly forgot about it. Not until I
>> switched server machine, and forgot to transfer the cron job, did I
>> remember the issue.
>>
>> It is very puzzling to me that if getting the stations back on line is as
>> easy as clicking a url, then why in the world can't rtpd do it itself!?!??
>> Have any of you seen this issue? Any suggestions on ways to deal with it
>> other than a cron based das discovery? I should say we run a mixed network
>> with other statins using either q330s or guralps, and only the refteks
>> seem
>> to have trouble noticing that the cell link is working.
>>
>> One other puzzle is that my understanding is that the rt130s will cache up
>> to 99 minutes of data in the case of a lost connection. My experience is
>> that you get the benefit of the cache only in cases of the outage lasting
>> less than 99 minutes. If the outage is longer, then when the link comes up
>> the rt130 starts sending real time data and never sends the previous 99
>> minutes. If however the outage is less than the cache time, it will start
>> sending the old cached data first. Seems weird that a 98 minute outage
>> results in no data loss, but a 100 minute outage results in a 100 minute
>> data loss.
>>
>> We have recent, but not the absolute latest versions of firmware, so I
>> should probably upgrade those just in case. We have stations showing this
>> issue with firmware at recent as 3.3.1 and I don't see anything in the
>> release notes that would suggest newer firmware addresses this. RTPD on
>> the
>> server is the latest version, 2.1.9.0b.
>>
>> Here is some output of me running rtpid around the time I hit the "das
>> discovery" button. I hit the button at 11:29:00 and all the stations had
>> come back to life and sending data within 12 seconds of the discovery
>> action.
>>
>> thanks,
>>
>> Philip
>>
>>
>> 2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
>> 2013:053-11:27:56 earthworm rtpid[3545] Options:
>> 2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
>> 2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
>> 2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
>> 2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
>> 2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
>> 2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
>> 2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
>> 2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
>> localhost:2543
>> 2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
>> localhost:2543
>>
>>     ---- hit "DAS-DISCOVERY" at 11:29:00 ----
>>
>>
>> 2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
>> 2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
>> 2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
>> 2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130222/61d46944/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> ANSS-netops mailing list
>> ANSS-netops at geohazards.usgs.gov
>> https://geohazards.usgs.gov/mailman/listinfo/anss-netops
>>
>>
>> ------------------------------
>>
>> End of ANSS-netops Digest, Vol 44, Issue 7
>> ******************************************
>>
>
>
>
> --
> Sent from my iNTERNETS!!!
>
> _______________________________________________
> ANSS-netops mailing list
> ANSS-netops at geohazards.usgs.gov
> https://geohazards.usgs.gov/mailman/listinfo/anss-netops
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://geohazards.usgs.gov/pipermail/anss-netops/attachments/20130224/4dea6335/attachment-0001.html>