Apparently there is a known problem of XFS locking up the kernel/processes and corrupting volumes under heavy traffic.
Some web pages talk about it, but I was not able to figure out which pages are new and may have a solution.
My company's deployments have Debian with kernel 3.4.107, xfsprogs 3.1.4, and large storage arrays.
We have large data (PB) and high throughput (GB/sec) using async IO to several large volumes.
We constantly experience these unpredictable lockups on several systems.
Kernel logs/dmesg show something like the following:
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986515] INFO: task Sr2dReceiver-5:46829 blocked for more than 120 seconds.
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986518] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986520] Sr2dReceiver-5 D ffffffff8105b39e 0 46829 7284 0x00000000
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986524] ffff881e71f57b38 0000000000000082 000000000000000b ffff884066763180
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986529] 0000000000000000 ffff884066763180 0000000000011180 0000000000011180
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986532] ffff881e71f57fd8 ffff881e71f56000 0000000000011180 ffff881e71f56000
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986536] Call Trace:
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986545] [<ffffffff814ffe9f>] schedule+0x64/0x66
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986548] [<ffffffff815005f3>] rwsem_down_failed_common+0xdb/0x10d
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986551] [<ffffffff81500638>] rwsem_down_write_failed+0x13/0x15
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986555] [<ffffffff8126b583>] call_rwsem_down_write_failed+0x13/0x20
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986558] [<ffffffff814ff320>] ? down_write+0x25/0x27
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986572] [<ffffffffa01f29e0>] xfs_ilock+0xbc/0x12e [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986580] [<ffffffffa01eec71>] xfs_rw_ilock+0x2c/0x33 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986586] [<ffffffffa01eec71>] ? xfs_rw_ilock+0x2c/0x33 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986593] [<ffffffffa01ef234>] xfs_file_aio_write_checks+0x41/0xfe [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986600] [<ffffffffa01ef358>] xfs_file_buffered_aio_write+0x67/0x179 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986603] [<ffffffff8150099a>] ? _raw_spin_unlock_irqrestore+0x30/0x3d
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986611] [<ffffffffa01ef81d>] xfs_file_aio_write+0x163/0x1b5 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986614] [<ffffffff8106f1af>] ? futex_wait+0x22c/0x244
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986619] [<ffffffff8110038e>] do_sync_write+0xd9/0x116
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986622] [<ffffffff8150095f>] ? _raw_spin_unlock+0x26/0x31
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986634] [<ffffffff8106f2f1>] ? futex_wake+0xe8/0xfa
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986637] [<ffffffff81100d1d>] vfs_write+0xae/0x10a
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986639] [<ffffffff811015b3>] ? fget_light+0xb0/0xbf
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986642] [<ffffffff81100dd3>] sys_pwrite64+0x5a/0x79
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986645] [<ffffffff81506912>] system_call_fastpath+0x16/0x1b
Lockups leave the system in a bad state. The processes in D state that hang cannot even be killed with signal 9.
The only way to resume operations is to reboot, repair XFS and then the system works for another while.
But occasionally after the lockup we cannot even repair some volumes, as they get totally corrupted and we need to rebuild them with mkfs.
As a last resort, we now run xfs-repair periodically and this reduced the frequency of lockups and data loss to a certain extent.
But the incidents still occur often enough, so we need some solution.
I was wondering if there is a solution for this with kernel 3.4.107, e.g. some patch that we may apply.
Due to the large number of deployments and other software issues, we cannot upgrade the kernel in the near future.
However, we are working towards updating our applications so that we can run on kernel 3.16 in our next releases.
Does anyone know if this XFS lockup problem was fixed in 3.16?
Some people have experienced this but it was not a problem with XFS it was because the kernel was unable to flush dirty pages within the 120s time period. Have a look here but please check the numbers they're using as default on your own system.
http://blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/
and here
http://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
You can see what you're dirty cache ratio is by running this
sysctl -a | grep dirty
or
cat /proc/sys/vm/dirty_ratio
The best write up on this I could find is here...
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
Essentially you need to tune your application to make sure that it can write the dirty buffers to disk within the time period or change the timer period etc.
You can also see some interesting paramaters as follows
sysctl -a | grep hung
You could increase the timeout permanently using /etc/sysctl.conf as follows...
kernel.hung_task_timeout_secs = 300
Does anyone know if this XFS lockup problem was fixed in 3.16?
It is said so in A Short Guide to Kernel Debugging:
Searching for “xfs splice deadlock” turns up an email thread from 2011 that describes this
problem. However, bisecting the kernel source repository shows that
the bug wasn’t really addressed until April, 2014 (8d02076) for release in Linux 3.16.
Related
Question (TL;DR):
For an SDcard with multiple partitions, how to get at least "change" event for the mounted nodes from udev, when an uSD or SDcard is removed from USB-SD reader, without disconnecting USB cable from your PC?
Bonus: Get "add/remove" event for "partition" nodes ( e.g "/dev/sdh2" )
What does not work:
When a partition is mounted, udev does not output any event for the partition node, when the card is removed !!!
Steps to reproduce:
You need a USB-SD reader ( I highly recommend: https://www.kingston.com/en/flash/readers/FCR-HS4 ). But I tested on many other USB-SD readers (Such as GenesysLogic based ones), situation is the same.
You need an uSD or SDcard , with at least 1 or 2 partition.
Step1:
Create a new udev rule, named as: /etc/udev/rules.d/98-test.rules
Content:
KERNEL!="sd*" , SUBSYSTEM!="block", GOTO="END"
LABEL="WORK"
ACTION=="add", RUN+="/usr/bin/logger *****Received add event for %k*****"
ACTION=="remove", RUN+="/usr/bin/logger *****Received remove event for %k*****"
ACTION=="change", RUN+="/usr/bin/logger *****Received change event for %k*****"
LABEL="END"
Step2:
Install ccze ( sudo apt install ccze ). This will give you beautiful colored logging for events.
Open a terminal, run following command:
journalctl -f | ccze -A
Result:
Mar 09 23:01:32 Ev-Turbo kernel: sd 6:0:0:3: [sdh] 30408704 512-byte logical blocks: (15.6 GB/14.5 GiB)
Mar 09 23:01:32 Ev-Turbo kernel: sdh: sdh1 sdh2
Mar 09 23:01:33 Ev-Turbo root[19519]: *****Received change event for sdh*****
Mar 09 23:01:33 Ev-Turbo root[19523]: *****Received change event for sdh*****
Mar 09 23:01:33 Ev-Turbo root[19545]: *****Received add event for sdh2*****
Mar 09 23:01:33 Ev-Turbo root[19552]: *****Received add event for sdh1*****
Step3:
Now remove the uSD card from the slot, but don't disconnect USB cable from your PC. Watch the log:
Mar 09 23:06:56 Ev-Turbo root[21220]: *****Received change event for sdh*****
Mar 09 23:06:56 Ev-Turbo root[21223]: *****Received remove event for sdh2*****
Mar 09 23:06:56 Ev-Turbo root[21222]: *****Received remove event for sdh1*****
Step4:
Now, insert your uSD card again, you will see:
Mar 09 23:11:21 Ev-Turbo kernel: sd 6:0:0:3: [sdh] 30408704 512-byte logical blocks: (15.6 GB/14.5 GiB)
Mar 09 23:11:21 Ev-Turbo kernel: sdh: sdh1 sdh2
Mar 09 23:11:21 Ev-Turbo root[22652]: *****Received change event for sdh*****
Mar 09 23:11:21 Ev-Turbo root[22653]: *****Received change event for sdh*****
Mar 09 23:11:21 Ev-Turbo root[22679]: *****Received add event for sdh2*****
Mar 09 23:11:21 Ev-Turbo root[22682]: *****Received add event for sdh1*****
And now, try mount one of the partitions to somewhere in your system:
sudo mount /dev/sdh2 /media/uSD2
You can double check if it is really mounted ( run commands: lsblk, mount...etc)
Step5:
Now, while the partition is mounted, remove your uSD card from the slot. Watch the log:
Mar 09 23:12:32 Ev-Turbo root[23049]: *****Received change event for sdh*****
Nothing more... Why there is no "remove" event anymore???
BONUS NOTES (Irrelevant to above question):
1) Most of the info on the web regarding udev/systemd/systemd-udevd and mount scripts are obsolete. Especially, for systemd v239, many of the "solved/working" answers are not usable ( Working on the issue for 2 weeks, read most of the solutions on the web, Tested on Debian 9.7, Linux Mint 18.3, Ubuntu 18.04 )
2) For systemd versions > 212, you need service files to mount your removable devices. Example:
https://serverfault.com/questions/766506/automount-usb-drives-with-systemd
3) Especially for systemd v239, you need to disable PrivateMounts to achieve automatic mounting via systemd units. For details: https://unix.stackexchange.com/questions/464959/udev-rule-to-automount-media-devices-stopped-working-after-systemd-was-updated-t
4) Mount unit files does not fit for every case, for example, when you want to mount to some specific directories based on USB host, port,lun numbers... But for some cases, this approach is very easy: https://dev.to/adarshkkumar/mount-a-volume-using-systemd-1h2f
I use an apache storm topology on a cluster of 8+1 machines. The date on these machines is not the same and we may have more than 5 minutes of difference.
preprod-storm-nimbus-01:
Thu Feb 25 16:20:30 GMT 2016
preprod-storm-supervisor-01:
Thu Feb 25 16:20:32 GMT 2016
preprod-storm-supervisor-02:
Thu Feb 25 16:20:32 GMT 2016
preprod-storm-supervisor-03:
Thu Feb 25 16:14:54 UTC 2016 <<-- this machine is very late :(
preprod-storm-supervisor-04:
Thu Feb 25 16:20:31 GMT 2016
preprod-storm-supervisor-05:
Thu Feb 25 16:20:17 GMT 2016
preprod-storm-supervisor-06:
Thu Feb 25 16:20:00 GMT 2016
preprod-storm-supervisor-07:
Thu Feb 25 16:20:31 GMT 2016
preprod-storm-supervisor-08:
Thu Feb 25 16:19:55 GMT 2016
preprod-storm-supervisor-09:
Thu Feb 25 16:20:30 GMT 2016
Question:
Is the storm topology affected by this non-synchronization?
Note: I know that synchronizing is better, but the sysadmins won't do it without proving them proofs/reasons that they have to do it. Do they really have to do it, "for the topology's sake" :) ?
Thanks
It depends on the computation you are doing... It might have an effect on your result if you do time based window operations. Otherwise, it doesn't matter.
For Storm as an execution engine it has no effect at all.
I have a single instance of MongoDB 2.4.8 running on Windows Server 2012 R2. MongoDB is installed as a Windows Service. I have journalling enabled.
The MongoDB documentation suggests that the MongoDB service should just be shut down via the Windows Service Control Manager:
net stop MongoDB
When I did this recently, the following was logged and I ended up with a non-zero byte mongod.lock file on disk. (I used the --repair option to fix this but it turns out this probably wasn't necessary as I had journalling enabled.)
Thu Nov 21 11:08:12.011 [serviceShutdown] got SERVICE_CONTROL_STOP request from Windows Service Control Manager, will terminate after current cmd ends
Thu Nov 21 11:08:12.043 [serviceShutdown] now exiting
Thu Nov 21 11:08:12.043 dbexit:
Thu Nov 21 11:08:12.043 [serviceShutdown] shutdown: going to close listening sockets...
Thu Nov 21 11:08:12.043 [serviceShutdown] closing listening socket: 1492
Thu Nov 21 11:08:12.043 [serviceShutdown] closing listening socket: 1500
Thu Nov 21 11:08:12.043 [serviceShutdown] shutdown: going to flush diaglog...
Thu Nov 21 11:08:12.043 [serviceShutdown] shutdown: going to close sockets...
Thu Nov 21 11:08:12.043 [serviceShutdown] shutdown: waiting for fs preallocator...
Thu Nov 21 11:08:12.043 [serviceShutdown] shutdown: lock for final commit...
Thu Nov 21 11:08:12.043 [serviceShutdown] shutdown: final commit...
Thu Nov 21 11:08:12.043 [conn1333] end connection 127.0.0.1:51612 (18 connections now open)
Thu Nov 21 11:08:12.043 [conn1331] end connection 127.0.0.1:51610 (18 connections now open)
...snip...
Thu Nov 21 11:08:12.043 [conn1322] end connection 10.1.2.212:53303 (17 connections now open)
Thu Nov 21 11:08:12.043 [conn1337] end connection 127.0.0.1:51620 (18 connections now open)
Thu Nov 21 11:08:12.839 [serviceShutdown] shutdown: closing all files...
Thu Nov 21 11:08:14.683 [serviceShutdown] Progress: 5/163 3% (File Closing Progress)
Thu Nov 21 11:08:16.012 [serviceShutdown] Progress: 6/163 3% (File Closing Progress)
...snip...
Thu Nov 21 11:08:52.030 [serviceShutdown] Progress: 143/163 87% (File Closing Progress)
Thu Nov 21 11:08:54.092 [serviceShutdown] Progress: 153/163 93% (File Closing Progress)
Thu Nov 21 11:08:55.405 [serviceShutdown] closeAllFiles() finished
Thu Nov 21 11:08:55.405 [serviceShutdown] journalCleanup...
Thu Nov 21 11:08:55.405 [serviceShutdown] removeJournalFiles
Thu Nov 21 11:09:05.578 [DataFileSync] ERROR: Client::shutdown not called: DataFileSync
The last line is my main concern.
I'm also interested in how MongoDB is able to take longer to shut down than Windows normally allows for service shutdown? At what point is it safe to shut down the machine without checking the log file?
I have a arch 64bit VPS on digitalocean. I installed gwan and run it in deamon mode. It stopped running every midnight.
Here is the log file
[Wed Apr 24 06:10:28 2013 GMT] memory footprint: 3.78 MiB
[Thu, 25 Apr 2013 00:00:19 GMT] * child abort(8) coredump
[Thu, 25 Apr 2013 00:00:19 GMT] * child abort(8) coredump
[Thu, 25 Apr 2013 00:00:19 GMT] * child abort(8) coredump
[Thu, 25 Apr 2013 00:00:19 GMT] * child died 3 times within 3 seconds
[Thu Apr 25 12:39:39 2013 GMT] memory footprint: 3.77 MiB.
[Thu Apr 25 12:39:56 2013 GMT] loaded maintenance script/opt/gwan_linux64-bit/0.0.0.0_8080/#0.0.0.0/csp/crash.c 43.14 KiB MD5:820cf6b4-2152b838-08a13fcb-5f0dc4be
[Fri, 26 Apr 2013 00:00:10 GMT] * child abort(8) coredump
[Fri, 26 Apr 2013 00:00:10 GMT] * child abort(8) coredump
[Fri, 26 Apr 2013 00:00:10 GMT] * child abort(8) coredump
[Fri, 26 Apr 2013 00:00:10 GMT] * child died 3 times within 3 seconds
This problem does not happen on all platforms and so far all the user reports we received used hypervisors which alter the CPU and OS behavior in erratic and undocumented ways (not to cite the additional bugs they inject into the system).
UPDATE
That new problem for 4-years old code that worked fine so far is a platform issue, for which we have found a workaround, to be published with the next release in a few weeks.
I just installed the latest Xmas gift from gwan team, but I'm having some problems:
Segmentation fault with archlinux .
On Ubuntu strange behavior.
I can't run any script on it.
About #1, Archlinux is up to date and uses the 2.16 GLIBC.
About #2, I'm loading http://188.165.219.99:8080/100.html sometimes it display 100 X, sometimes an error page (with CSS) and sometimes an error page without CSS.
About #3, I can't run any csp script:
http://188.165.219.99:8080/?hello.c
http://188.165.219.99:8080/?hello.rb
http://188.165.219.99:8080/?hello.php
None of the above work. Has the csp url changed?
I have installed php5-cli and ruby on my ubuntu.
For informations:
# ldd --version
ldd (Ubuntu EGLIBC 2.11.1-0ubuntu7.11) 2.11.1
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
Here the log on archlinux
# cat logs/gwan.log
[Thu Dec 27 14:03:44 2012 GMT] ------------------------------------------------
[Thu Dec 27 14:03:44 2012 GMT] G-WAN 3.12.26 64-bit (Dec 26 2012 13:58:12)
[Thu Dec 27 14:03:44 2012 GMT] ------------------------------------------------
[Thu Dec 27 14:03:44 2012 GMT] Local Time: Thu, 27 Dec 2012 16:03:44 GMT+2
[Thu Dec 27 14:03:44 2012 GMT] RAM: (1.60 GiB free + 0 shared + 834.57 MiB buffers) / 23.64 GiB total
[Thu Dec 27 14:03:44 2012 GMT] Physical Pages: 1.60 GiB / 23.64 GiB
[Thu Dec 27 14:03:44 2012 GMT] DISK: 1.71 TiB free / 1.88 TiB total
[Thu Dec 27 14:03:44 2012 GMT] 336 processes, including pid:1545 './gwan'
[Thu Dec 27 14:03:44 2012 GMT] Multi-Core, HT enabled
[Thu Dec 27 14:03:44 2012 GMT] 1x Intel(R) Xeon(R) CPU W3520 # 2.67GHz (4 Core(s)/CPU, 2 thread(s)/Core)
[Thu Dec 27 14:03:44 2012 GMT] using 4 workers 0[1111]3
[Thu Dec 27 14:03:44 2012 GMT] among 8 threads 0[11110000]7
[Thu Dec 27 14:03:44 2012 GMT] 64-bit little-endian (least significant byte first)
1: segmentation fault with archlinux
Many informations are listed at the top of the /logs/wgan.log file. It would immensely help to see it (that's what log files are for).
2: on ubuntu strange behavior
Not knowing the nature of the "strange behavior" nor the Ubuntu version makes difficult to answer your question (if that's a question).
3: I can't run any script on it
We received more than 50 emails this morning alove. None (but yours) reports that scripts fail to run. So far, people report great results with all languages.
Check your files permissions (can the G-WAN account read the script files?) and verify that the relevant compilers / runtimes are installed.
Ogla's report suggests that javascript embedded in HTML (the new release does minifying there) migh be the cause of the cut files.
Other than that, the fact that the daemon mode is broken in v3.12.25/26 may explain your problems if you run in daemon mode.