what's "sure kill" when talking about NFS mount option? - kill

In the following link
http://www.faqs.org/docs/Linux-HOWTO/NFS-HOWTO.html
It said a process is not killable except by a "sure kill", but what's sure kill?
hard (NFS client mount option)
The program accessing a file on a NFS
mounted file system will hang when the
server crashes. The process cannot be
interrupted or killed (except by a
"sure kill") unless you also
specify intr. When the NFS server is
back online the program will continue
undisturbed from where it was. We
recommend using hard,intr on all NFS
mounted file systems.

I think it means kill -9, though please note that you are reading a quite old howto (as most other TLDP howtos, I think) from 2002 year, and regarding NFS mount options hard,intr are already default and
The intr / nointr mount option is
deprecated after kernel 2.6.25. Only
SIGKILL can interrupt a pending NFS
operation on these kernels, and if
specified, this mount option is
ignored to provide backwards
compatibility with older kernels.

Related

Cannot successfully disconnect from remote machine using 'nohup' or 'screen'

I am trying to do some work on a remote machine and disconnect without terminating the work. I have tried both nohup and screen, unfortunately it is not working out. After I type exit to logout my work also terminates immediately.
I am trying to run 108 simulations on a remote machine. For that purpose I have written a script named batch.sh which runs one simulation after the other until all 108 are done. The program that actually runs a simulation launches 5 programs in 5 different terminals (using xterm -e). I run batch.sh using:
nohup bash batch.sh &
As long as I am connected everything works just fine. If I disconnect and then reconnect to check whether everything is working as it should...no joy :(
Are there any caveats I am overlooking? Possibly because my program launches other programs in external terminals?
UPDATE
If I use the suggestions of adding -oForwardX11=no to ssh and unset DISPLAY before launching my script I get these errors:
nohup: ignoring input and appending output to nohup.out
In nohup.out I have these messages:
xterm Xt error: Can't open display:
xterm: DISPLAY is not set
Apparently your script/program is trying to launch xterm on its own. These days many systems enable X11 forwarding for their SSH client by default - as a result the DISPLAY variable is set in your shell session but becomes invalid once you disconnect. Therefore, as long as you are connected to the remote system, the xterm processes can access the X server on your local machine through the SSH connection, but die once that connection is severed.
I have occasionally encountered the same issue with Java programs that use e.g. the Java AWT subsystem to generate image files, even when there is no actual graphical window. You should first see if your program will somehow adapt if there is no X server available. One option is to disable X11 forwarding with the -oForwardX11=no option to ssh:
$ ssh -oForwardX11=no user#server.host.name
You could also try unsetting the DISPLAY environment variable before starting your script and see what happens.
However, if your program is launching xterm windows indiscriminately then you'd have to make it e.g. use an output file on the server instead - by modifying it, if necessary. As an added advantage, you would get rid off the network load and timing overhead involved with forwarded X connections.
If you cannot change the way your program works and you do not actually care about the output in those xterm windows, then you could try launching a virtual framebuffer X server on the remote system and have your script use that for xterm.

Which service monitors the events like (reset,halt,power off) and in turn executes runlevel 6,0 or 3

When I issue power off to a device, ideally the kernel should get this event and then runlevel 0 should get executed.
As of now the hardware turns off but the runlevel 0 is not executed.
When i manually executed the script
/etc/rc.d/rc
and hardcoded the runlevel value to 0 the script works fine and the system is halted.
All the services of halt,reboot are present in /etc/init.d directory
your runlevel specific services are present in /etc/rc.d/ directory
rc.d directory convention was quite old
Actually the changing of the run-level is done by the kernel modules the issue in my case was the binary that was responsible for issuing this reset was not packaged as it was missing in our package mapping list

rsync suddenly hanging indefinitely during transfers

For the past few years, I have been using an rsync one-liner to back up important folders on my Mac Mini desktop (OSX 10.9, 2.5 GHz i5, 4 GB RAM) to a FreeNAS box (0.7.2 Sabanda revision 5266, Pentium D 2.66 GHz, 822MiB RAM [reported by the system, I think there's 1 GB in there]). I am running an rsync daemon on the FreeNAS box. Recently, these transfers have been hanging indefinitely. I have done the usual Google-fu and am unable to identify the source of the problem or a solution.
The one-liner is:
rsync -rvOlt --exclude '.DS_Store' \
--exclude '.com.apple.timemachine.supported' \
--delete /Volumes/Storage/Music/Albums/ 192.168.1.100::albums
I have tried enabling -vvv and --progress, but there is no pattern that I can discern between what hangs and what doesn't. Heck, if I retry, the same file might hang at a different point during the transfer or not at all. A dry run (-n) does not always succeed either. The only "success" I've had is implementing a timeout (--timeout=10) and rerunning the command over and over. Eventually, I creep along, but with no guarantee of success and at a pace that is unacceptable. I've reached a point where I have one file that I can't get past.
The Mac Mini is connected to my router via 5 GHz. The FreeNAS box is wired into that same router on a 100 mbit port. When transfers are actually going, rsync --progress reports 2.5-4 MB/s. According to --progress, a hang is literally just that—no data transfer is occurring as far as I can tell.
I need help with both the diagnostics and the solution.
I was having the same problem. Removing -v didn't work for me. My use-case is slightly different in that I'm going from source (EXT4) to ExFAT. The issue for me was that rsync was attempting to preserve device files and permissions, which ExFAT doesn't support. I was using the -hrltDvaP switches. The -D and -a switches seemed to be my problem. The -a switch translates to -rlptgoD (no -H,-A,-X). The -p, -g, and -o switches seemed to be my root cause as rsync was barfing on one or all of those during runtime. Removing -a and specifying -Prltvc switches explicitly is working for me.
bkupcmd="nice -n$nicelevel /usr/bin/rsync -Prltvc --exclude-from=/var/tmp/ignorelist "
I've been running into the same thing again and again and it seems to help if you drop the -v option (which is annoying if you need that output).
Try using --whole-file/-W.
This command disables the rsync delta-transfer algorithm.
That is what worked for us (WSL to OSX)
our full sync flags were -avWPle
(e was because we were using ssh, and that has to be the last flag)
This happened to me when the remote device ran out of space. The error wouldn't show when --verbose option was used; turning that off yielded some STDERR output that explained that the remote device was out of space. When I freed some space, I was able to run rsync again with --verbose and everything went fine.
I am using openSUSE 13.2 Linux, rsync version 3.1.1-2.4.1.x86_64, and I experienced similar problems, doing an rsync between my laptop and an external hard disk, with the destination device definitively having enough free space.
I thought I got an improvement omitting option -v, but after 10 minutes it was hanging again: strace said:
select(5, [], [4], [], {60, 0}) = 0 (Timeout)
And with "iotop" I counld see confirm that the rsync processes did no significant disk IO any more.
Neither removing the -v option nor limiting the bandwidth using --bwlimit fixed the problem.
Just had a similar problem while doing rsync from harddisk to a FAT32 USB drive. rsync froze already in less than a second in my case and did not react at all after that ... left it with CTRL+C.
Found out that the problem was a combination of usage of hardlinks on the harddisk and having FAT32 filesystem on the USB drive, which does not support hardlinks.
Formatting the USB drive with ext4 solved the problem for me.
In my situation rsync was not actually failing.
I have regular server backups which transfers large files over 500GB+ and have --append-verify or --checkusm over ssh parameters specified.
What I have found upon analysis is that once the client side completes it's file checks then the server side checks start. Which means while the server is doing it's checks the client side will appear hanged and frozen - run htop on the server to rsync working away.
This is likely a non issue if rsync is run in deamon mode on the server and using the rsync protocol instead of ssh for transfers.
On related note, this very LONG wait would trigger SSH timeout and a rsync: connection unexpectedly closed (254 bytes received so far) [sender] error message, sollution is to add ClientAliveInterval 120 and ClientAliveCountMax 720 to /etc/ssh/sshd_config.
I've seen this quite often on 3.0.9 on a directory with hardlinks, but it also happened on 3.1.3.
There is a nice analysis in Debian bug 820916: when its internal sockets are congested with errors, rsync could go into a deadlock.
This might have been fixed in a 3.2 release just a few days ago (Jun 2020):
Avoid a hang when an overabundance of messages clogs up all the I/O buffers.
The only good workaround I can think of is, if the problem is not persistent, then put timeout in front of it: timeout rsync <args> <source> <destination>, then retry. If it is persistent for you, you're the lucky one who can debug it :D
It also happens when the user on target machine has not write permissions on target folder.
You can try giving write permission to others target folder:
sudo chmod -R o+w /path/to/target-folder
In my case, it was the IPC (Intrusion Protection Component) in our firewall. It sees all the TCP SYN packets as a flood attack and kills the connection. I left a rsync over NFS session open and turned off the IPC for the servers firewall rule and it starting working again right away.
rsync -ravh /source /destination
When it happened I was not able to kill the rsync session. It locked up the NFS mount and I would have to reboot the client machine to get it to work again. The strange thing is it would copy some files over then all of a sudden stop. It always seemed to stop on the same file. So I was looking for file issues, permission issues, TCP offloading issues, tried removing the -v in the rsync call. If you are having this issue at least in my case it even happened with a simple.
cp -rp /source /destination
So I knew then to start looking at other factors. So if you have any sort of intrusion protection on a firewall or router between the servers you can try turning that off temporarily to see if it solves your issue as well.
Most likely not "your" problem, but I stumbled upon this question when I was researching a similar behavior:
I'm observing "hanging" when the target site has too much io load. e.G. on one of my small business servers, when someone is resyncing his IMAP account and downloading large batchs of data and a backup job runs that writes his data.
In this situation I notice a steep drop in performance for rsync. Noticeable in a high load value in top on the target machine, even though CPU and Mem are fine.
Waiting for the process to finish has helped every time or interrupting and attempting the rsync at a later time again.
I was having the same problem and it was because I was running out of memory during the rsync. Created a swap file and problem solved.
Had rsync hanging issue on Ubuntu 16. None of the options above helped. The problem was in the source drive (external SSD) which suddenly became faulty. I tried several disk checks, but all of them stuck. Ended up rebooting the system and disk suddenly became accessible again.
Holger Ohmacht aka h8ohmh / 8ohmh:
The problem lies in the filesystem buffer / usage of the interworking of harddisk/hw so far as I could investigate.
Temporal solution for local drives (eg. USB3<->HD) : A script which is polling the changing disk space. If no changing free disk space then rsync is stalled and has to be restarted
cmd="rsync -aW --progress --stats --preallocate --super \
<here your source dir> \
<here your dest dir>"
eval "$cmd" &
rm ./ndf.txt
rm ./odf.txt
while [[ 0 == 0 ]]; do
df > ./ndf.txt
cmp ./odf.txt ./ndf.txt
res="$?"
echo "$res"
if [[ $res == 0 ]]; then
echo "###########################################"
ls -al "./ndf.txt"
ls -al "./odf.txt"
killall rsync
eval "$cmd" &
else
cp ./ndf.txt ./odf.txt
fi
sleep 60
done
Change <source dir> etc to your paths!
In my case it is always stalling by usage of rsync's --preallocate option (normally because of better disk performance and rescueing continuous blocks), so as long as the disk and filesystem drivers not reworked there just this solution

device /dev/ttyusb0 lock failed: operation not permitted

I was playing around with a router earlier this evening using minicom and I must not have closed it cleanly. Here is the error message that I get when I try to open minicom:
device /dev/ttyusb0 lock failed: operation not permitted
I have two questions, 1) how would I go about getting out of this state, and 2) how do I exit minicom cleanly so that I can avoid this happening again.
I found I was able to fix the situation on my CentOS box by running minicom -S <device> -o and the do the normal exit key sequence (CTRL-a, x).
In your situation it would have been
sudo minicom -S ttyusb0 -o
This cleared the lock files minicom had placed in /var/lock/
Good luck
Ash
I ran into a similar issue with using gtkterm from a remote terminal. I had shutdown the terminal without explicitly terminating gtkterm. The result was that subsequent gtkterm sessions gave me the error:
Device /dev/ttyUSB0 is locked.
Checking the process list via ps did not show any gtkterm processes still running.
I corrected this by simply deleting /run/lock/LCK..ttyUSB0. After doing that, gtkterm was able to open ttyUSB0 successfully.
[root#edge-tc lock]# minicom'
Device /dev/ttyUSB0 lock failed: Operation not permitted.'
Solution:'
Check the process which have locked and kill the process'
[root#edge-tc lock]# fuser /dev/ttyUSB0'
/dev/ttyUSB0: 18328
[root#edge-tc lock]# kill -9 18328
[root#edge-tc lock]#'
[root#edge-tc lock]#'
[root#edge-tc lock]# minicom'
Welcome to minicom 2.1'
The canonical way is to use lockdev. This manages the lock files on a per-device basis in /run/lock/lockdev/ (at least under CentOS 7.x).
lockdev <device> can be used without being root, and returns non-zero if the device has already been locked, in which case it can be unlocked with lockdev -u <device>.
This is apparently obsolete these days, but minicom (at least as of version 2.6.2) still uses it.

What is the best way to run ntpdate at reboot, only after network is ready

I'm using a BeagleBone, and since it has no built in RTC and battery back up, it loses the date on every reboot. I can easily set the date with the command:
/usr/bin/ntpdate -b -s -u pool.ntp.org
But if the power goes out and back on for the house for example, then the time is lost. The solution that comes with the latest beaglebone Angstrom linux distribution is to put a crontab line in that updates the time every half hour. But I would prefer to just run the command once on powerup.
I tried putting this command listed above in crontab with the #reboot line, but that I believe ran before network was configured, or something else failed since it didn't get me the right time when I pulled the power for 5 minutes and put it back to the beaglebone.
Is there some way to use ifconfig or something like that to run a script from init.d only after network is available?
opkg install ntp-systemd
systemctl enable ntpdate.service
systemctl enable ntpd.service
Edit /etc/ntp.conf and comment the following lines (no fallback on an hardware clock that doesn't exist and because the ntpdate service use the "ntpd -q" command)
#server 127.127.1.0
#fudge 127.127.1.0 stratum 14
Two services are installed:
/lib/systemd/system/ntpd.service:
[Unit]
Description=Network Time Service
After=network.target
[Service]
Type=forking
PIDFile=/run/ntpd.pid
ExecStart=/usr/bin/ntpd -p /run/ntpd.pid
/lib/systemd/system/ntpdate.service:
[Unit]
Description=Network Time Service (one-shot ntpdate mode)
Before=ntpd.service
[Service]
Type=oneshot
ExecStart=/usr/bin/ntpd -q -g -x
RemainAfterExit=yes
ntpd is started after the network is up (After=network.target) so the date should be continuously synchronized. BUT has explained in the ntpd man page:
Most operating systems and hardware of today incorporate a
time-of-year (TOY) chip to maintain the time during periods when the
power is off. When the machine is booted, the chip is used to
initialize the operating system time. After the machine has
synchronized to a NTP server, the operating system corrects the chip
from time to time. In case there is no TOY chip or for some reason its
time is more than 1000s from the server time, ntpd assumes something
must be terribly wrong and the only reliable action is for the
operator to intervene and set the clock by hand. This causes ntpd to
exit with a panic message to the system log. The -g option overrides
this check and the clock will be set to the server time regardless of
the chip time. However, and to protect against broken hardware, such
as when the CMOS battery fails or the clock counter becomes defective,
once the clock has been set, an error greater than 1000s will cause
ntpd to exit anyway.
So we need to set the date before starting ntpd and this is done by the ntpdate service by executing "ntpd -q -g -x" before starting ntpd.service.
From ntpd man page:
-q Exit the ntpd just after the first time the clock is set. This behavior mimics that of the ntpdate program, which is to be retired.
The -g and -x options can be used with this option. Note: The kernel
time discipline is disabled with this option.
Another service installed on the Beaglebone interact with the date/time
timestamp.service
[Unit]
Description=Timestamping service
ConditionPathExists=/etc/timestamp
After=remount-rootfs.service
[Service]
RemainAfterExit=yes
ExecStart=/usr/bin/load-timestamp.sh
ExecStop=/usr/bin/load-timestamp.sh --save
This service store the current timestamp in /etc/timestamp when it's stopped and set the date from that timestamp when it's started. So if ntpd isn't installed, the date set manually and the beaglebone rebooted, the date is only behind by the boot duration.
Do you have the /etc/network/if-post-up.d/ directory on your target system? If so, scripts in that directory should be run when the network comes up. If not, are you using DHCP? Your DHCP client may support running scripts.

Resources