[ANSS-netops] reftek and data stoppages

mwithers at memphis.edu mwithers at memphis.edu
Tue Mar 5 13:18:28 UTC 2013


I didn't know about rtpid either Philip.  It looks like it might be a useful
thing to run.  The documentation says that if there's no data after 5 restart
requests, rtpid removes that DAS from its list of units to monitor.  It might
also be useful to have something that monitors the rtpid log and sends a
notification when that happens (though there are other ways to detect down
stations).

Mitch

Center for Earthquake Research and Information (CERI)
University of Memphis                Ph: 901-678-4940
Memphis, TN 38152                   Fax: 901-678-4734


On Tue, 5 Mar 2013, Philip Crotwell wrote:

> Hi all
>
> Just to follow up on this issue, there are two problems, one seems to
> really be a bug, the other is "operator error", ie me.
>
> My error was that I was only running RTPD on the server. It turns out that
> you need to have both RTPD and RTPID running all the time to avoid
> connection drops. RTPD just sits listening for data to arrive. It is just
> as happy if data is not flowing as it is when it is flowing. RTPID is the
> thing that yells "wake up, back to work you lazy dogs" if the data flow
> stops. It seems a bit strange to me that these two are not one executable,
> but that is the way it is.
>
> The actual bug occurs on a restart of the server. There seems to be a case
> where the systems restart successfully but without any data flowing and
> RTPID doesn't start yelling. The result is that no data arrives unless you
> kick the system by hitting "das-discovery" in the web based RTCC. Ian says
> that Reftek is looking into making RTPID a little more vocal at startup to
> avoid this case in the next version.
>
> So keep RTPID running, be careful on restarts, and all should be well.
> Philip
>
>
>
>
> On Fri, Feb 22, 2013 at 11:41 AM, Philip Crotwell <crotwell at seis.sc.edu>wrote:
>
>>
>> Hi all
>>
>> We have four stations with reftek 130s on cell modems going into earthworm
>> via rtpd. I recently moved my server to new hardware and rediscovered an
>> old problem. About every day or two some of the stations stop sending data
>> even though the link is ok. I don't know what the initial cause is, maybe
>> the cell link briefly died, but at the time we check on things, the cell
>> link is fine but data is just not flowing.
>>
>> The odd thing is the clicking the "das discovery" button in the web admin
>> tool RTCC causes all the stations to start flowing again. The rediscovery
>> part is that some years ago when we first noticed this, I put in a cron job
>> to hit the "das discovery" url once every 15 minutes and the problem went
>> away. Given limited brain cells, I promptly forgot about it. Not until I
>> switched server machine, and forgot to transfer the cron job, did I
>> remember the issue.
>>
>> It is very puzzling to me that if getting the stations back on line is as
>> easy as clicking a url, then why in the world can't rtpd do it itself!?!??
>> Have any of you seen this issue? Any suggestions on ways to deal with it
>> other than a cron based das discovery? I should say we run a mixed network
>> with other statins using either q330s or guralps, and only the refteks seem
>> to have trouble noticing that the cell link is working.
>>
>> One other puzzle is that my understanding is that the rt130s will cache up
>> to 99 minutes of data in the case of a lost connection. My experience is
>> that you get the benefit of the cache only in cases of the outage lasting
>> less than 99 minutes. If the outage is longer, then when the link comes up
>> the rt130 starts sending real time data and never sends the previous 99
>> minutes. If however the outage is less than the cache time, it will start
>> sending the old cached data first. Seems weird that a 98 minute outage
>> results in no data loss, but a 100 minute outage results in a 100 minute
>> data loss.
>>
>> We have recent, but not the absolute latest versions of firmware, so I
>> should probably upgrade those just in case. We have stations showing this
>> issue with firmware at recent as 3.3.1 and I don't see anything in the
>> release notes that would suggest newer firmware addresses this. RTPD on the
>> server is the latest version, 2.1.9.0b.
>>
>> Here is some output of me running rtpid around the time I hit the "das
>> discovery" button. I hit the button at 11:29:00 and all the stations had
>> come back to life and sending data within 12 seconds of the discovery
>> action.
>>
>> thanks,
>> Philip
>>
>> 2013:053-11:27:56 earthworm rtpid[3545] RTPID version 2.1.0.0
>> 2013:053-11:27:56 earthworm rtpid[3545] Options:
>> 2013:053-11:27:56 earthworm rtpid[3545]   Host      = localhost
>> 2013:053-11:27:56 earthworm rtpid[3545]   Port      = 2543
>> 2013:053-11:27:56 earthworm rtpid[3545]   Retry     = nonfatal
>> 2013:053-11:27:56 earthworm rtpid[3545]   Log file  = rtpid.log
>> 2013:053-11:27:56 earthworm rtpid[3545]   Verbose   = FALSE
>> 2013:053-11:27:56 earthworm rtpid[3545]   Timeout   = 60
>> 2013:053-11:27:56 earthworm rtpid[3545]   Attempts  = 9999
>> 2013:053-11:27:56 earthworm rtpid[3545] Attempting connection:
>> localhost:2543
>> 2013:053-11:27:56 earthworm rtpid[3545] Successful connection:
>> localhost:2543
>>
>>     ---- hit "DAS-DISCOVERY" at 11:29:00 ----
>>
>> 2013:053-11:29:02 earthworm rtpid[3545] Unit A898 detected
>> 2013:053-11:29:03 earthworm rtpid[3545] Unit A064 detected
>> 2013:053-11:29:06 earthworm rtpid[3545] Unit A872 detected
>> 2013:053-11:29:12 earthworm rtpid[3545] Unit A900 detected
>>
>>
>



More information about the ANSS-netops mailing list