shared memory read/write is slow for certain files - performance

Recently we have seen some shared memory read/write performance issue with our application.
In our application, we copy data from shared memory /run/shm to pinned memory, so as to achieve higher GPU transfer throughput. We observed that the copy speed is so different when copying from different files (same size).
To reproduce the issue, we copied the same source file to two files (i.e. file1, file2) in shared memory. Then we confirmed the problem by counting the time of copying file to /dev/null, and the results are as below.
numactl --cpubind=0 cp /run/shm/file1 /dev/null 0.05s user 0.28s system 94% cpu 0.355 total
numactl --cpubind=1 cp /run/shm/file1 /dev/null 0.11s user 0.22s system 91% cpu 0.360 total
numactl --cpubind=0 cp /run/shm/file2 /dev/null 0.08s user 2.15s system 93% cpu 2.383 total
numactl --cpubind=1 cp /run/shm/file2 /dev/null 0.16s user 5.44s system 97% cpu 5.748 total
We have checked the meminfo, we only used 50% of memory. Nothing was swapped out (and we explicitly disabled it).
Searched around but could not identify the issue. Finally, we rebooted the system and the issue was gone.
Does anyone know what might be the root cause?
Thanks!

Related

What is causing Xorg high CPU usage?

I am running feh image viewer on Debian and after some hours of normal CPU usage (3% aprox.) , xorg suddenly starts using much more CPU (80% aprox.) and everything runs very slowly. I am not running anything else so the bug should be either on feh or on the xserver...
I am using the command "feh -z -q -D20 -R 1" (-z for random image, -q for quiet, -D20 to change the picture every 20 seconds and -R 1 to refresh the directory every second, as I erase and insert pictures pretty often)
When I use the command "free -m" before the high CPU usage and feh running, I get
total used free shared buff/cache available
Mem: 923 117 474 19 331 735
Swap: 99 0 99
And after several hours I get the same for "mem" but the used amount of "swap" is 99.
The fact that your memory usage goes up (swap is full) points directly to memory leak in some program in your system. Considering that feh is not probably designed for such an use case I'd bet it's the cause for going out of memory.
The "everything runs slowly" is caused by kernel going out of memory and it's doing its best to keep the system running. If you insist on runnin feh your choices are
Triage the memory leak bug in feh and create a fix for it.
Try to get somebody else do the same for you.
Periodically kill feh and rerun it again. Basically you can do (in bash)
while true; do timeout 120m feh -z -q -D20 -R 1; sleep 2s; done
which will kill every 120 min and restart it after 2 second delay (which allows you to kill the while loop if needed). Another choice would be to use ulimit to set maximum amount of memory you want to allow for feh and the process probably simply dies once it's using too much.
I solved this problem, but I don't know why, too.
You can try run this code kill this process:
ps -a | grep Xorg | awk '{print $1}' | xargs kill 9

Not able to load files larger than 100 MB into HDFS

I'm facing a really strange issue with my cluster.
Whenever I'm trying to load any file into HDFS that is larger than 100 MB(104857600 bytes) it fails with the following error:
All datanodes are bad... Aborting.
This is really strange as 100 MB has become the threshold for filesize.
Even if i try to increase the file size by 1 single byte (104857601 bytes), and try to load it in HDFS, it fails with a long stacktrace. Principally saying "All datanodes are bad... Aborting"
Has anybody faced similar situation earlier?
Is it possible that there's some configuration change by mistake which has led to this behaviour? If yes, any configuration that limits the size of data that can be ingested that i can change?
Thanks
"Has anybody faced similar situation earlier?"
Yes I had. You should decrease the limit for the user who runs hadoop.
I installed hadoop on a linux box downloading it from apache website, and my system was not tuned for it, I got your message. These are the settings from cloudera quick start, compare your limit with these.
[cloudera#quickstart ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30494
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
If the you are able to get more details of the data node from the stack trace, you should be able to stop the datanode which would be one the daemons.
By killing or stopping the datanode you would get the data node which is likely to be corrupted, re-instantiated. Basically you are restarting jvm of that particular datanode.
Command:
To stop: bin/hadoop-daemon.sh stop datanode
To start: bin/hadoop-daemon.sh start datanode

Write program consumes certain amount of memory

In my recently experiment, I need a program that consumes certain amount of memory. I want to implement it in bash script, say, I want this script runs as a daemon and consumes around 200mb physical memory. How to design this script?
If it's possible, I hope it can be run without permission.
Seems like this is what you looking for
mntroot rw
cd /dev
while :
do
dd > /dev/null 2>&1 if=/dev/zero of=myfile1 count=20000 bs=1024 # use 200MB ram
usleep 1
rm myfile1
done

Hadoop start-dfs.sh fail

I am new to Hadoop, I config cluster follow the instruction
After config, I start HDFS daemons with /bin/start-dfs.sh
I check log file /home/deploy/hadoop/libexec/../logs/hadoop-deploy-datanode-slave1.out to make sure is run, but I see only text as below:
ulimit -a for user deploy
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63524
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Hope anyone can help?
#Minh, you are looking at /home/deploy/hadoop/libexec/../logs/hadoop-deploy-datanode-slave1.out, instead see /home/deploy/hadoop/libexec/../logs/hadoop-deploy-datanode-slave1.log.
Similarly there would be other log files in /home/deploy/hadoop/libexec/../logs/ folder.
Let me explain more about .log and .out files.
Some of the files in logs folder end with .log, and others end with .out.
The .out files are only written to when daemons are starting.
After daemons have started successfully, .out files are truncated.
By contrasts, all log messages can be found in the .log files, including the daemon start-up messages that are sent to the .out files.

rsync suddenly hanging indefinitely during transfers

For the past few years, I have been using an rsync one-liner to back up important folders on my Mac Mini desktop (OSX 10.9, 2.5 GHz i5, 4 GB RAM) to a FreeNAS box (0.7.2 Sabanda revision 5266, Pentium D 2.66 GHz, 822MiB RAM [reported by the system, I think there's 1 GB in there]). I am running an rsync daemon on the FreeNAS box. Recently, these transfers have been hanging indefinitely. I have done the usual Google-fu and am unable to identify the source of the problem or a solution.
The one-liner is:
rsync -rvOlt --exclude '.DS_Store' \
--exclude '.com.apple.timemachine.supported' \
--delete /Volumes/Storage/Music/Albums/ 192.168.1.100::albums
I have tried enabling -vvv and --progress, but there is no pattern that I can discern between what hangs and what doesn't. Heck, if I retry, the same file might hang at a different point during the transfer or not at all. A dry run (-n) does not always succeed either. The only "success" I've had is implementing a timeout (--timeout=10) and rerunning the command over and over. Eventually, I creep along, but with no guarantee of success and at a pace that is unacceptable. I've reached a point where I have one file that I can't get past.
The Mac Mini is connected to my router via 5 GHz. The FreeNAS box is wired into that same router on a 100 mbit port. When transfers are actually going, rsync --progress reports 2.5-4 MB/s. According to --progress, a hang is literally just that—no data transfer is occurring as far as I can tell.
I need help with both the diagnostics and the solution.
I was having the same problem. Removing -v didn't work for me. My use-case is slightly different in that I'm going from source (EXT4) to ExFAT. The issue for me was that rsync was attempting to preserve device files and permissions, which ExFAT doesn't support. I was using the -hrltDvaP switches. The -D and -a switches seemed to be my problem. The -a switch translates to -rlptgoD (no -H,-A,-X). The -p, -g, and -o switches seemed to be my root cause as rsync was barfing on one or all of those during runtime. Removing -a and specifying -Prltvc switches explicitly is working for me.
bkupcmd="nice -n$nicelevel /usr/bin/rsync -Prltvc --exclude-from=/var/tmp/ignorelist "
I've been running into the same thing again and again and it seems to help if you drop the -v option (which is annoying if you need that output).
Try using --whole-file/-W.
This command disables the rsync delta-transfer algorithm.
That is what worked for us (WSL to OSX)
our full sync flags were -avWPle
(e was because we were using ssh, and that has to be the last flag)
This happened to me when the remote device ran out of space. The error wouldn't show when --verbose option was used; turning that off yielded some STDERR output that explained that the remote device was out of space. When I freed some space, I was able to run rsync again with --verbose and everything went fine.
I am using openSUSE 13.2 Linux, rsync version 3.1.1-2.4.1.x86_64, and I experienced similar problems, doing an rsync between my laptop and an external hard disk, with the destination device definitively having enough free space.
I thought I got an improvement omitting option -v, but after 10 minutes it was hanging again: strace said:
select(5, [], [4], [], {60, 0}) = 0 (Timeout)
And with "iotop" I counld see confirm that the rsync processes did no significant disk IO any more.
Neither removing the -v option nor limiting the bandwidth using --bwlimit fixed the problem.
Just had a similar problem while doing rsync from harddisk to a FAT32 USB drive. rsync froze already in less than a second in my case and did not react at all after that ... left it with CTRL+C.
Found out that the problem was a combination of usage of hardlinks on the harddisk and having FAT32 filesystem on the USB drive, which does not support hardlinks.
Formatting the USB drive with ext4 solved the problem for me.
In my situation rsync was not actually failing.
I have regular server backups which transfers large files over 500GB+ and have --append-verify or --checkusm over ssh parameters specified.
What I have found upon analysis is that once the client side completes it's file checks then the server side checks start. Which means while the server is doing it's checks the client side will appear hanged and frozen - run htop on the server to rsync working away.
This is likely a non issue if rsync is run in deamon mode on the server and using the rsync protocol instead of ssh for transfers.
On related note, this very LONG wait would trigger SSH timeout and a rsync: connection unexpectedly closed (254 bytes received so far) [sender] error message, sollution is to add ClientAliveInterval 120 and ClientAliveCountMax 720 to /etc/ssh/sshd_config.
I've seen this quite often on 3.0.9 on a directory with hardlinks, but it also happened on 3.1.3.
There is a nice analysis in Debian bug 820916: when its internal sockets are congested with errors, rsync could go into a deadlock.
This might have been fixed in a 3.2 release just a few days ago (Jun 2020):
Avoid a hang when an overabundance of messages clogs up all the I/O buffers.
The only good workaround I can think of is, if the problem is not persistent, then put timeout in front of it: timeout rsync <args> <source> <destination>, then retry. If it is persistent for you, you're the lucky one who can debug it :D
It also happens when the user on target machine has not write permissions on target folder.
You can try giving write permission to others target folder:
sudo chmod -R o+w /path/to/target-folder
In my case, it was the IPC (Intrusion Protection Component) in our firewall. It sees all the TCP SYN packets as a flood attack and kills the connection. I left a rsync over NFS session open and turned off the IPC for the servers firewall rule and it starting working again right away.
rsync -ravh /source /destination
When it happened I was not able to kill the rsync session. It locked up the NFS mount and I would have to reboot the client machine to get it to work again. The strange thing is it would copy some files over then all of a sudden stop. It always seemed to stop on the same file. So I was looking for file issues, permission issues, TCP offloading issues, tried removing the -v in the rsync call. If you are having this issue at least in my case it even happened with a simple.
cp -rp /source /destination
So I knew then to start looking at other factors. So if you have any sort of intrusion protection on a firewall or router between the servers you can try turning that off temporarily to see if it solves your issue as well.
Most likely not "your" problem, but I stumbled upon this question when I was researching a similar behavior:
I'm observing "hanging" when the target site has too much io load. e.G. on one of my small business servers, when someone is resyncing his IMAP account and downloading large batchs of data and a backup job runs that writes his data.
In this situation I notice a steep drop in performance for rsync. Noticeable in a high load value in top on the target machine, even though CPU and Mem are fine.
Waiting for the process to finish has helped every time or interrupting and attempting the rsync at a later time again.
I was having the same problem and it was because I was running out of memory during the rsync. Created a swap file and problem solved.
Had rsync hanging issue on Ubuntu 16. None of the options above helped. The problem was in the source drive (external SSD) which suddenly became faulty. I tried several disk checks, but all of them stuck. Ended up rebooting the system and disk suddenly became accessible again.
Holger Ohmacht aka h8ohmh / 8ohmh:
The problem lies in the filesystem buffer / usage of the interworking of harddisk/hw so far as I could investigate.
Temporal solution for local drives (eg. USB3<->HD) : A script which is polling the changing disk space. If no changing free disk space then rsync is stalled and has to be restarted
cmd="rsync -aW --progress --stats --preallocate --super \
<here your source dir> \
<here your dest dir>"
eval "$cmd" &
rm ./ndf.txt
rm ./odf.txt
while [[ 0 == 0 ]]; do
df > ./ndf.txt
cmp ./odf.txt ./ndf.txt
res="$?"
echo "$res"
if [[ $res == 0 ]]; then
echo "###########################################"
ls -al "./ndf.txt"
ls -al "./odf.txt"
killall rsync
eval "$cmd" &
else
cp ./ndf.txt ./odf.txt
fi
sleep 60
done
Change <source dir> etc to your paths!
In my case it is always stalling by usage of rsync's --preallocate option (normally because of better disk performance and rescueing continuous blocks), so as long as the disk and filesystem drivers not reworked there just this solution

Resources