why udev does not trigger remove event for mounted sdcard partitions? - systemd

Question (TL;DR):
For an SDcard with multiple partitions, how to get at least "change" event for the mounted nodes from udev, when an uSD or SDcard is removed from USB-SD reader, without disconnecting USB cable from your PC?
Bonus: Get "add/remove" event for "partition" nodes ( e.g "/dev/sdh2" )
What does not work:
When a partition is mounted, udev does not output any event for the partition node, when the card is removed !!!
Steps to reproduce:
You need a USB-SD reader ( I highly recommend: https://www.kingston.com/en/flash/readers/FCR-HS4 ). But I tested on many other USB-SD readers (Such as GenesysLogic based ones), situation is the same.
You need an uSD or SDcard , with at least 1 or 2 partition.
Step1:
Create a new udev rule, named as: /etc/udev/rules.d/98-test.rules
Content:
KERNEL!="sd*" , SUBSYSTEM!="block", GOTO="END"
LABEL="WORK"
ACTION=="add", RUN+="/usr/bin/logger *****Received add event for %k*****"
ACTION=="remove", RUN+="/usr/bin/logger *****Received remove event for %k*****"
ACTION=="change", RUN+="/usr/bin/logger *****Received change event for %k*****"
LABEL="END"
Step2:
Install ccze ( sudo apt install ccze ). This will give you beautiful colored logging for events.
Open a terminal, run following command:
journalctl -f | ccze -A
Result:
Mar 09 23:01:32 Ev-Turbo kernel: sd 6:0:0:3: [sdh] 30408704 512-byte logical blocks: (15.6 GB/14.5 GiB)
Mar 09 23:01:32 Ev-Turbo kernel: sdh: sdh1 sdh2
Mar 09 23:01:33 Ev-Turbo root[19519]: *****Received change event for sdh*****
Mar 09 23:01:33 Ev-Turbo root[19523]: *****Received change event for sdh*****
Mar 09 23:01:33 Ev-Turbo root[19545]: *****Received add event for sdh2*****
Mar 09 23:01:33 Ev-Turbo root[19552]: *****Received add event for sdh1*****
Step3:
Now remove the uSD card from the slot, but don't disconnect USB cable from your PC. Watch the log:
Mar 09 23:06:56 Ev-Turbo root[21220]: *****Received change event for sdh*****
Mar 09 23:06:56 Ev-Turbo root[21223]: *****Received remove event for sdh2*****
Mar 09 23:06:56 Ev-Turbo root[21222]: *****Received remove event for sdh1*****
Step4:
Now, insert your uSD card again, you will see:
Mar 09 23:11:21 Ev-Turbo kernel: sd 6:0:0:3: [sdh] 30408704 512-byte logical blocks: (15.6 GB/14.5 GiB)
Mar 09 23:11:21 Ev-Turbo kernel: sdh: sdh1 sdh2
Mar 09 23:11:21 Ev-Turbo root[22652]: *****Received change event for sdh*****
Mar 09 23:11:21 Ev-Turbo root[22653]: *****Received change event for sdh*****
Mar 09 23:11:21 Ev-Turbo root[22679]: *****Received add event for sdh2*****
Mar 09 23:11:21 Ev-Turbo root[22682]: *****Received add event for sdh1*****
And now, try mount one of the partitions to somewhere in your system:
sudo mount /dev/sdh2 /media/uSD2
You can double check if it is really mounted ( run commands: lsblk, mount...etc)
Step5:
Now, while the partition is mounted, remove your uSD card from the slot. Watch the log:
Mar 09 23:12:32 Ev-Turbo root[23049]: *****Received change event for sdh*****
Nothing more... Why there is no "remove" event anymore???
BONUS NOTES (Irrelevant to above question):
1) Most of the info on the web regarding udev/systemd/systemd-udevd and mount scripts are obsolete. Especially, for systemd v239, many of the "solved/working" answers are not usable ( Working on the issue for 2 weeks, read most of the solutions on the web, Tested on Debian 9.7, Linux Mint 18.3, Ubuntu 18.04 )
2) For systemd versions > 212, you need service files to mount your removable devices. Example:
https://serverfault.com/questions/766506/automount-usb-drives-with-systemd
3) Especially for systemd v239, you need to disable PrivateMounts to achieve automatic mounting via systemd units. For details: https://unix.stackexchange.com/questions/464959/udev-rule-to-automount-media-devices-stopped-working-after-systemd-was-updated-t
4) Mount unit files does not fit for every case, for example, when you want to mount to some specific directories based on USB host, port,lun numbers... But for some cases, this approach is very easy: https://dev.to/adarshkkumar/mount-a-volume-using-systemd-1h2f

Related

Analysis of Redis error logs "LOADING Redis is loading the dataset in memory" and more

I am frequently seeing these messages in the redis logs
1#
602854:M 23 Dec 2022 09:48:54.028 * 10 changes in 300 seconds. Saving...
602854:M 23 Dec 2022 09:48:54.035 * Background saving started by pid 3266364
3266364:C 23 Dec 2022 09:48:55.844 * DB saved on disk
3266364:C 23 Dec 2022 09:48:55.852 * RDB: 12 MB of memory used by copy-on-write
602854:M 23 Dec 2022 09:48:55.938 * Background saving terminated with success
2#
LOADING Redis is loading the dataset in memory
3#
7678:signal-handler (1671738516) Received SIGTERM scheduling shutdown...
7678:M 22 Dec 2022 23:48:36.300 # User requested shutdown...
7678:M 22 Dec 2022 23:48:36.300 # systemd supervision requested, but NOTIFY_SOCKET not found
7678:M 22 Dec 2022 23:48:36.300 * Saving the final RDB snapshot before exiting.
7678:M 22 Dec 2022 23:48:36.300 # systemd supervision requested, but NOTIFY_SOCKET not found
7678:M 22 Dec 2022 23:48:36.720 * DB saved on disk
7678:M 22 Dec 2022 23:48:36.720 * Removing the pid file.
7678:M 22 Dec 2022 23:48:36.720 # Redis is now ready to exit, bye bye...
7901:C 22 Dec 2022 23:48:37.071 # WARNING supervised by systemd - you MUST set appropriate values for TimeoutStartSec and TimeoutStopSec in your service unit.
7901:C 22 Dec 2022 23:48:37.071 # systemd supervision requested, but NOTIFY_SOCKET not found
7914:C 22 Dec 2022 23:48:37.071 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
7914:C 22 Dec 2022 23:48:37.071 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=7914, just started
7914:C 22 Dec 2022 23:48:37.071 # Configuration loaded
Are these messages concerning?
Let me know if there's any optimization to be carried out in terms of settings.
The first set of informational messages is related to Redis persistence: it appears your Redis node is configured to save the database to disk if 300 seconds elapsed and it surpassed 10 write operations against it. You can change that according to your needs through the Redis configuration file.
The message LOADING Redis is loading the dataset in memory, on the other side, is an error message received while attempting to connect to a Redis instance which is loading its dataset in memory: that occurs during the startup for standalone servers and master nodes or when replicas reconnect and fully resynchronize with master. If you are seeing this error too often and not right after a system restart, I would suggest to check your system log files and learn why your Redis instance is restarting or resynchronizing (depending on your topology).

Elastic 2.3.4. Node Startup Quiet Failure

We are using a 5 node cluster hosted in Google Cloud (Ubuntu 16.04 LTS) and we noticed that one of the node's disk space was at 90%+ so we shut down the node with:
sudo service elasticsearch stop
Then stopping the instance in the GCP console.
After upgrading the node's disk space, we tried starting elastic again using:
sudo service elasticsearch start
This command seems to fail silently, and the SSH session terminates after freezing momentarily. Nothing shows in the node's elasticsearch logs, and nothing shows up in the current cluster's master elasticsearch logs either. The only hint we can find of something going wrong is in the node's syslog:
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Cleanup of Temporary Directories.
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Starting Elasticsearch...
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Elasticsearch.
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.597729] kernel tried to execute NX-protected page - exploit attempt? (uid: 113)
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.605545] BUG: unable to handle kernel paging request at 00007f896d5467c0
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.612621] IP: 0x7f896d5467c0
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.615779] PGD 80000003050ee067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.615780] P4D 80000003050ee067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.619199] PUD 30508d067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.622626] PMD 305162067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.625438] PTE 80000003df15b867
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.628245]
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.633174] Oops: 0011 [#1] SMP PTI
The cluster health with 4 nodes is green, and we can't seem to figure out why this may be happening.
Any ideas on why this may be happening would be very helpful.
Here is our config located in /etc/default/elasticsearch:
https://gist.github.com/deppi/58826c38ea8414d301eb034e9a29cd54
Also here is our /etc/elasticsearch/elasticsearch.yml
https://gist.github.com/deppi/17b1f28e649ee528b0fe2ca93a2ff19c
The only thing I can think that might be causing this issue is discovery.zen.minimum_master_nodes: 2
When maybe it should be configured as
discovery.zen.minimum_master_nodes: 3
But we are uncertain this is the issue and don't want to risk further breaking out elasticsearch cluster
By experience, I know that shutting down the cluster using the elasticsearch command was not the best, we had issues with nodes not entirely down, and trying to take the master level. That's maybe why you can see 2 nodes, but your node is not part of it anymore.
What you should do, is shutting down the elasticsearch process on each nodes, unless you still index on the two nodes. In this case shut your cluster properly :
Stop the collect first everytime you need to stop elasticsearch, so logstash if you are using the stack
Then stop elasticsearch itself https://www.elastic.co/guide/en/elasticsearch/reference/master/stopping-elasticsearch.html
Start your first nodes as you let the protocol take place
Start elastic on the other nodes => see if all the nodes enter in
If not your config might be the problem, as I would use 1 master node and 3 slaves, and use another data path. When you need to shut down your cluster, stop the collect, stop the queuing, stop the storage (elastic), node by node
This seems to be an issue with a new kernel that has been deployed on GCP for the Ubuntu 16.04 LTS OS.
Problem Kernel:
uname -a
Linux elasticsearch-1-vm 4.13.0-1007-gcp #10-Ubuntu SMP Fri Jan 12 13:56:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Proper Kernel:
uname -a
Linux elasticsearch-1-vm 4.13.0-1006-gcp #9-Ubuntu SMP Mon Jan 8 21:13:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
To fix the issue with the GCP instances, I ran:
sudo apt remove 4.13.0-1007-gcp
sudo apt install 4.13.0-1006-gcp
exit
Then in google cloud console, restart the instance, then SSH back in then:
sudo service elasticsearch start

Suspicious Activity in system.log OSX

A mac user was having some clock errors, and thought they had seen someone using remote/VNC action on their screen. I went through the system.log and most of this activity is showing at times when the laptop was off and unplugged (no battery) and the user was asleep.
System.log file here- https://ghostbin.com/paste/mcukf
These were the lines that interested me.
Java connection causing clock to be off.
23:54:32 Ushas-Air Java Updater[531]: Original euid:501
Apr 24 23:54:32 Ushas-Air com.apple.xpc.launchd[1] (com.apple.preference.datetime.remoteservice[366]): Service exited due to signal: Killed: 9 sent by com.apple.preference.datetime.re[366]
Apr 24 23:54:32 Ushas-Air Java Updater[531]: Host name is javadl-esd-secure.oracle.com
Apr 24 23:54:32 Ushas-Air Java Updater[531]: Feed URL: https
Apr 24 23:54:32 Ushas-Air Java Updater[531]: Hostname check passed. Valid Oracle hostname
Apr 24 23:54:33 Ushas-Air com.apple.xpc.launchd[1] (com.apple.bsd.dirhelper[523]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.bsd.dirhelper
Apr 24 23:54:36 Ushas-Air java[541]: objc[541]: Class JavaLaunchHelper is implemented in both /Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java (0x1023604c0) and /Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/lib/jli/./libjli.dylib (0x119327480). One of the two will be used. Which one is undefined.
Instances of IMRemoteURLConnection Agent happening
Apr 25 00:14:11 Ushas-MacBook-Air com.apple.xpc.launchd[1] (com.apple.imfoundation.IMRemoteURLConnectionAgent): Unknown key for integer: _DirtyJetsamMemoryLimit
Apr 25 00:01:22 Ushas-MacBook-Air com.apple.xpc.launchd[1] (com.apple.imfoundation.IMRemoteURLConnectionAgent): Unknown key for integer: _DirtyJetsamMemoryLimit
Apr 25 00:05:57 Ushas-MacBook-Air com.apple.xpc.launchd[1] (com.apple.preferences.users.remoteservice[762]): Service exited due to signal: Killed: 9 sent by com.apple.preferences.users.remo[762]
Multiple cache deletes requested after.
Apr 25 00:01:27 Ushas-MacBook-Air logd[57]: _handle_cache_delete_with_urgency(0x7fdf19412a60, 3, 0)
Apr 25 00:01:27 Ushas-MacBook-Air logd[57]: _handle_cache_delete_with_urgency(0x7fdf19412a60, 3, 0)
Apr 25 00:01:31 Ushas-MacBook-Air com.apple.preferences.icloud.remoteservice[700]: BUG in libdispatch client: kevent[EVFILT_MACHPORT] monitored resource vanished before the source cancel handler was invoked
Apr 25 00:01:33 Ushas-MacBook-Air logd[57]: _handle_cache_delete_with_urgency(0x7fdf19658620, 3, 0)
Apr 25 00:01:33 Ushas-MacBook-Air logd[57]: _volume_contains_cached_data(is /private/var/db/diagnostics/ in /) - YES
Apr 25 00:01:34 Ushas-MacBook-Air logd[57]: 239517600 bytes of purgeable space from log files
Apr 25 00:01:34 Ushas-MacBook-Air logd[57]: _purge_uuidtext only runs at urgency 0 (3)
Apr 25 00:01:34 Ushas-MacBook-Air logd[57]: 0 bytes of purgeable space from uuidtext files
And appears to be launching the FamilyCircleFramework
Apr 24 23:56:11 Ushas-Air com.apple.xpc.launchd[1] (com.apple.imfoundation.IMRemoteURLConnectionAgent): Unknown key for integer: _DirtyJetsamMemoryLimit
Apr 24 23:56:16 --- last message repeated 1 time ---
Apr 24 23:56:16 Ushas-Air familycircled[615]: objc[615]: Class FAFamilyCloudKitProperties is implemented in both /System/Library/PrivateFrameworks/FamilyCircle.framework/Versions/A/FamilyCircle (0x7fffbe466a60) and /System/Library/PrivateFrameworks/FamilyCircle.framework/Versions/A/Resources/familycircled (0x10aa01178). One of the two will be used. Which one is undefined.
Apr 24 23:56:16 Ushas-Air familycircled[615]: objc[615]: Class FAFamilyMember is implemented in both /System/Library/PrivateFrameworks/FamilyCircle.framework/Versions/A/FamilyCircle (0x7fffbe466880) and /System/Library/PrivateFrameworks/FamilyCircle.framework/Versions/A/Resources/familycircled (0x10aa01268). One of the two will be used. Which one is undefined.
Apr 24 23:56:16 Ushas-Air familycircled[615]: objc[615]: Class FAFamilyCircle is implemented in both /System/Library/PrivateFrameworks/FamilyCircle.framework/Versions/A/FamilyCircle (0x7fffbe466a10) and /System/Library/PrivateFrameworks/FamilyCircle.framework/Versions/A/Resources/familycircled (0x10aa01358). One of the two will be used. Which one is undefined.
Activity related to Findmyfriends happening. The mac owner doesn't use FindMyFriends, or have a mac phone.
Apr 25 00:30:00 Ushas-MacBook-Air syslogd[40]: Configuration Notice:
ASL Module "com.apple.mobileme.fmf1.internal" sharing output destination "/var/log/FindMyFriendsApp/FindMyFriendsApp.asl" with ASL Module "com.apple.mobileme.fmf1".
Output parameters from ASL Module "com.apple.mobileme.fmf1" override any specified in ASL Module "com.apple.mobileme.fmf1.internal".
Apr 25 00:30:00 Ushas-MacBook-Air syslogd[40]: Configuration Notice:
ASL Module "com.apple.mobileme.fmf1.internal" sharing output destination "/var/log/FindMyFriendsApp" with ASL Module "com.apple.mobileme.fmf1".
Output parameters from ASL Module "com.apple.mobileme.fmf1" override any specified in ASL Module "com.apple.mobileme.fmf1.internal".
Apr 25 00:30:00 Ushas-MacBook-Air syslogd[40]: Configuration Notice:
The keybaglogd being shared with com.apple.mkb
Apr 25 00:30:00 Ushas-MacBook-Air syslogd[40]: Configuration Notice:
ASL Module "com.apple.mkb.internal" sharing output destination "/private/var/log/keybagd.log" with ASL Module "com.apple.mkb".

Is there any solution to the XFS lockup in linux?

Apparently there is a known problem of XFS locking up the kernel/processes and corrupting volumes under heavy traffic.
Some web pages talk about it, but I was not able to figure out which pages are new and may have a solution.
My company's deployments have Debian with kernel 3.4.107, xfsprogs 3.1.4, and large storage arrays.
We have large data (PB) and high throughput (GB/sec) using async IO to several large volumes.
We constantly experience these unpredictable lockups on several systems.
Kernel logs/dmesg show something like the following:
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986515] INFO: task Sr2dReceiver-5:46829 blocked for more than 120 seconds.
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986518] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986520] Sr2dReceiver-5 D ffffffff8105b39e 0 46829 7284 0x00000000
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986524] ffff881e71f57b38 0000000000000082 000000000000000b ffff884066763180
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986529] 0000000000000000 ffff884066763180 0000000000011180 0000000000011180
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986532] ffff881e71f57fd8 ffff881e71f56000 0000000000011180 ffff881e71f56000
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986536] Call Trace:
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986545] [<ffffffff814ffe9f>] schedule+0x64/0x66
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986548] [<ffffffff815005f3>] rwsem_down_failed_common+0xdb/0x10d
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986551] [<ffffffff81500638>] rwsem_down_write_failed+0x13/0x15
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986555] [<ffffffff8126b583>] call_rwsem_down_write_failed+0x13/0x20
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986558] [<ffffffff814ff320>] ? down_write+0x25/0x27
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986572] [<ffffffffa01f29e0>] xfs_ilock+0xbc/0x12e [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986580] [<ffffffffa01eec71>] xfs_rw_ilock+0x2c/0x33 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986586] [<ffffffffa01eec71>] ? xfs_rw_ilock+0x2c/0x33 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986593] [<ffffffffa01ef234>] xfs_file_aio_write_checks+0x41/0xfe [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986600] [<ffffffffa01ef358>] xfs_file_buffered_aio_write+0x67/0x179 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986603] [<ffffffff8150099a>] ? _raw_spin_unlock_irqrestore+0x30/0x3d
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986611] [<ffffffffa01ef81d>] xfs_file_aio_write+0x163/0x1b5 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986614] [<ffffffff8106f1af>] ? futex_wait+0x22c/0x244
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986619] [<ffffffff8110038e>] do_sync_write+0xd9/0x116
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986622] [<ffffffff8150095f>] ? _raw_spin_unlock+0x26/0x31
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986634] [<ffffffff8106f2f1>] ? futex_wake+0xe8/0xfa
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986637] [<ffffffff81100d1d>] vfs_write+0xae/0x10a
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986639] [<ffffffff811015b3>] ? fget_light+0xb0/0xbf
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986642] [<ffffffff81100dd3>] sys_pwrite64+0x5a/0x79
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986645] [<ffffffff81506912>] system_call_fastpath+0x16/0x1b
Lockups leave the system in a bad state. The processes in D state that hang cannot even be killed with signal 9.
The only way to resume operations is to reboot, repair XFS and then the system works for another while.
But occasionally after the lockup we cannot even repair some volumes, as they get totally corrupted and we need to rebuild them with mkfs.
As a last resort, we now run xfs-repair periodically and this reduced the frequency of lockups and data loss to a certain extent.
But the incidents still occur often enough, so we need some solution.
I was wondering if there is a solution for this with kernel 3.4.107, e.g. some patch that we may apply.
Due to the large number of deployments and other software issues, we cannot upgrade the kernel in the near future.
However, we are working towards updating our applications so that we can run on kernel 3.16 in our next releases.
Does anyone know if this XFS lockup problem was fixed in 3.16?
Some people have experienced this but it was not a problem with XFS it was because the kernel was unable to flush dirty pages within the 120s time period. Have a look here but please check the numbers they're using as default on your own system.
http://blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/
and here
http://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
You can see what you're dirty cache ratio is by running this
sysctl -a | grep dirty
or
cat /proc/sys/vm/dirty_ratio
The best write up on this I could find is here...
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
Essentially you need to tune your application to make sure that it can write the dirty buffers to disk within the time period or change the timer period etc.
You can also see some interesting paramaters as follows
sysctl -a | grep hung
You could increase the timeout permanently using /etc/sysctl.conf as follows...
kernel.hung_task_timeout_secs = 300
Does anyone know if this XFS lockup problem was fixed in 3.16?
It is said so in A Short Guide to Kernel Debugging:
Searching for “xfs splice deadlock” turns up an email thread from 2011 that describes this
problem. However, bisecting the kernel source repository shows that
the bug wasn’t really addressed until April, 2014 (8d02076) for release in Linux 3.16.

aerospike on openvz 2core 4Gb ram doesn't start and doesn't give errors

after installation without any trouble I've started aerospike on a openvz vps with 2cores and 4gb ram.
this is the result:
root#outland:~# /etc/init.d/aerospike start
* Start aerospike: asd [OK]
then check for running asd:
root#outland:~# /etc/init.d/aerospike status
* Halt aerospike: asd [fail]
what is going wrong?
adding logs:
Mar 03 2015 15:17:57 GMT: INFO (config): (cfg.c::3033) system file descriptor limit: 100000, proto-fd-max: 15000
Mar 03 2015 15:17:57 GMT: WARNING (cf:misc): (id.c::249) Tried eth,bond,wlan and list of all available interfaces on device.Failed to retrieve physical address with errno 19 No such device
Mar 03 2015 15:17:57 GMT: CRITICAL (config): (cfg.c:3363) could not get unique id and/or ip address
Mar 03 2015 15:17:57 GMT: WARNING (as): (signal.c::120) SIGINT received, shutting down
Mar 03 2015 15:17:57 GMT: WARNING (as): (signal.c::123) startup was not complete, exiting immediately
This is your config problem
Mar 03 2015 15:17:57 GMT: WARNING (cf:misc): (id.c::249) Tried eth,bond,wlan and list of all available interfaces on device.Failed to retrieve physical address with errno 19 No such device
Mar 03 2015 15:17:57 GMT: CRITICAL (config): (cfg.c:3363) could not get unique id and/or ip address
Basically the vps has a non standard interface name.
The solution is to add your interface name as network-interface-name to the config.
http://www.aerospike.com/docs/operations/troubleshoot/startup/#problem-with-network-interface
Which OS are your using btw?

Resources