[EHPweb] rsync creating problems on ehpmaster and ehpbackup?

Fri Jun 27 22:01:34 GMT 2008

Hi Guys,

I've been trying to figure out why ehp3 and earthquake (Akamai) are  
intermittently unreachable from ehpmaster and ehpbackup, triggering  
zabbix and REQ_monitor messages.

My suspicion is that numerous simultaneous rsync jobs occasionally hog  
all of the resources on ehpmaster and ehpbackup.  At one point today  
there were 14 rsync jobs running simultaneously on ehpmaster.   I've  
spotted three types of rsyncs, and there may be more:

------------------------------
1) rsync of earthquake htdocs to ehpbackup and ehp1-4  [5 going  
simultaneously]

This is the standard update of the earthquake htdocs directory.  On a  
typical day, judging from the logfile at /var/log/rsync-transfer.log,  
there are something like a total of 3340 rsync jobs of this sort, with  
something like 300,000 files transferred.

Recommendations:
    A)  Could these jobs be run sequentially instead of concurrently?   
This is presumably all static content and not real-time info.
    B)  Instead of doing 5 blanket rsyncs of the htdocs directory,  
couldn't a list of just the files that had changed since the last  
update be transferred?  The find command could be used just once at  
the start of every update to make such a list.

------------------------------
2) rsync of pager stuff to ehp1-4  [4 rsyncs going simultaneously]

Recommendations:
    C)  Could these rsyncs also be logged to a file in /var/log?
    D)  Could a list of just the files that had changed be made in  
advance, and only those files get rsync'ed?

------------------------------
3) rsync of Shake stuff from horst, graben, mesa. [3 incoming rsyncs  
going simultaneously]

Recommendations:
   E)  This seems potentially very confusing to recipient rsync on  
ehpmaster, since the same files are simultaneously being examined and  
manipulated by 3 separate processes and computers.  Best solution  
would be to have the Shake folks send their information directly to  
ehpmaster and ehpbackup.
   F)   An interim solution might be again to use the find command to  
make a list of files that had changed since the last transmission, and  
to stagger the rsyncs slightly so they aren't occurring simultaneously?

------------------------------

This made me wonder if there were some disk or network bottlenecks.   
Watching disk performance using iostat yields values like this:

%> iostat -x -d sdb 5

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s  
avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.00 902.81 202.00 202.40 1614.43 8992.38   807.21   
4496.19    26.23    40.67   99.74   2.44  98.58

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s  
avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.00 453.40 270.00 39.80 2161.60 3945.60  1080.80   
1972.80    19.71     6.54   21.11   3.17  98.08

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s  
avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.00 374.90 331.33 33.94 2652.21 3270.68  1326.10   
1635.34    16.22     3.35    9.17   2.73  99.54

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s  
avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.00 438.12 392.22 147.70 3136.13 4686.63  1568.06   
2343.31    14.49     7.81   14.47   1.84  99.34

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s  
avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.00 3262.73 232.46 256.91 2670.94 28157.11  1335.47  
14078.56    62.99    77.10  157.54   2.04  99.64

Standard advice seems to be that if await and svctm aren't comparable  
values, then there's a problem.  This looks like a big problem.   
(See:  http://www.pythian.com/blogs/247/basic-io-monitoring-on-linux  
for an explanation of the fields to watch. )

I suspect that there are also network bottlenecks occurring when many  
rsync jobs happen to be running at the same time, which is perhaps the  
reason why ehpmaster occasionally can't get a response from Akamai or  
ehp3.  This suggests that if ehpmaster is struggling at times when  
nothing special is happening, it's really going to have a hard time if  
things get busy.  It might also explain the meltdown of both machines  
on May 4th when there was apparently a network problem in Denver,  
which probably stalled a bunch of rsyncs on both machines.

What do you think?  Does this make any sense?

Bob