Vagrant VirtualBox (Ubuntu Trusty) stops responding after several days - vagrant

I am running Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-32-generic x86_64) using Vagrant [vagrant:amd64 1:1.6.3] (VirtualBox). My host system is the same OS version.
After several days of flawless (or so it seems) operation, the vagrant box will stop responding... or more specifically:
My Supervisor managed services no longer respond (webserver etc...).
I can vagrant ssh into the box and navigate around most directories
Anything interacting with the shared /vagrant directory will not respond (including sudo supervisorctl).
Running vagrant halt from the host machine will fail to halt peacefully and will eventually forcefully be halted.
Re-upping the box afterwards will then give several more happy days.
The only thing that I can see (that might be of relevance) in /var/log/syslog is the following:
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.711678] BUG: unable to handle kernel paging request at 0000006c0000003f
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714065] IP: [<ffffffffa00a10f6>] vbglPhysHeapExcludeBlock+0x16/0x60 [vboxguest]
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] PGD 3c7c7067 PUD 0
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] Oops: 0002 [#1] SMP
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] Modules linked in: rpcsec_gss_krb5 nfsv4 vboxsf(OF) nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache dm_crypt ppdev serio_raw parport_pc vboxguest(OF) parport ahci psmouse libahci e1000
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] CPU: 0 PID: 1632 Comm: vminfo Tainted: GF O 3.13.0-32-generic #57-Ubuntu
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] task: ffff88003c202fe0 ti: ffff88003cd20000 task.ti: ffff88003cd20000
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] RIP: 0010:[<ffffffffa00a10f6>] [<ffffffffa00a10f6>] vbglPhysHeapExcludeBlock+0x16/0x60 [vboxguest]
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] RSP: 0018:ffff88003cd21d78 EFLAGS: 00010206
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] RAX: 0000006c00000027 RBX: ffff88003ce5014c RCX: ffff88003ce60000
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] RDX: ffff88003ce50124 RSI: ffff88003ce50124 RDI: ffff88003ce5016c
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] RBP: ffff88003cd21d78 R08: 0000000000000292 R09: ffff88003ce5014c
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] R10: ffff88003c6fcc10 R11: 0000000000000246 R12: ffff88003ce50124
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] R13: ffff88003ce5014c R14: 0000000000000020 R15: 0000000000000000
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] FS: 00007fc8b5475700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] CR2: 0000006c0000003f CR3: 000000003ccf0000 CR4: 00000000000006f0
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] Stack:
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] ffff88003cd21d98 ffffffffa00a1609 ffff88003ce5014c ffff88003cd21e70
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] ffff88003cd21db0 ffffffffa009faae ffff88003cd21e78 ffff88003cd21e38
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] ffffffffa009de5e ffff880000000000 0000000000000000 0000000000000050
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] Call Trace:
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffffa00a1609>] VbglPhysHeapFree+0xc9/0xe0 [vboxguest]
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffffa009faae>] VbglGRFree+0x1e/0x30 [vboxguest]
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffffa009de5e>] VBoxGuestCommonIOCtl+0x54e/0x1b90 [vboxguest]
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffff811a103b>] ? kfree+0xab/0x140
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffffa009b7ce>] vboxguestLinuxIOCtl+0x9e/0x200 [vboxguest]
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffff8101b7e9>] ? sched_clock+0x9/0x10
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffff8109d1ad>] ? sched_clock_local+0x1d/0x80
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffff811cfd10>] do_vfs_ioctl+0x2e0/0x4c0
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffff8109ddf4>] ? vtime_account_user+0x54/0x60
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffff811cff71>] SyS_ioctl+0x81/0xa0
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] [<ffffffff8172c87f>] tracesys+0xe1/0xe6
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] Code: 05 00 00 c7 05 c8 05 02 00 00 00 00 00 5d c3 66 0f 1f 44 00 00 66 66 66 66 90 48 8b 47 10 55 48 89 e5 48 85 c0 74 0c 48 8b 57 18 <48> 89 50 18 488 8b 47 10 48 8b 57 18 48 85 d2 74 19 48 89 42 10
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] RIP [<ffffffffa00a10f6>] vbglPhysHeapExcludeBlock+0x16/0x60 [vboxguest]
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] RSP <ffff88003cd21d78>
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.714663] CR2: 0000006c0000003f
Aug 20 22:45:31 vagrant-ubuntu-trusty-64 kernel: [226490.828096] ---[ end trace 1eefe230ded2b9f8 ]---
Happy to supply any more information that people request that I've missed out.

If you look at http://www.vagrantbox.es/ then you will notice that virtualbox guest additions isn't included in the VM, which is the cause for some of the problems you are having.
You might be able to install Virtualbox Guest Additions using Puppet or what ever provisioning option you choose. I imagine you will have to manually setup the /vagrant shared folder.

Related

BUG: unable to handle kernel paging request, DPDK

I was trying to run DPDK KNI application with dpdk version 16.07.2,
For that I first unbinded the ports from ixgbe and binded them to igb_uio module with following command
echo 0000:05:00.1 > /sys/bus/pci/drivers/ixgbe/unbind
echo 0000:05:00.0 > /sys/bus/pci/drivers/ixgbe/unbind
echo 0x8086 0x1528 > /sys/bus/pci/drivers/igb_uio/new_id
I compiled the kni application for target machine with Linux version 4.4.20 (sushila#dev03) (gcc version 4.9.2 (crosstool-NG 1.20.0) ) #1 SMP Fri Feb 24 14:32:28 CST 2017
and when I ran the application it hung with the following message
Feb 28 10:09:37 (none) user.alert kernel: [ 87.029554] BUG: unable to handle kernel paging request at 0000077e1d012900
Feb 28 10:09:37 (none) user.alert kernel: [ 87.029695] IP: [<ffffffffa0033722>] kni_net_rx_normal+0x2e2/0x440 [rte_kni]
Feb 28 10:09:37 (none) user.warn kernel: [ 87.029801] PGD 0
Feb 28 10:09:37 (none) user.warn kernel: [ 87.029889] Oops: 0000 [#1] SMP
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030010] Modules linked in: rte_kni(O) igb_uio(O)
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030167] CPU: 7 PID: 709 Comm: kni_single Tainted: G IO 4.4.20 #1
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030242] Hardware name: /DX58SO2, BIOS SOX5820J.86A.0603.2010.1117.1506 11/17/2010
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030320] task: ffff8805a8ad8000 ti: ffff8805a7ae0000 task.ti: ffff8805a7ae0000
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030395] RIP: 0010:[<ffffffffa0033722>] [<ffffffffa0033722>] kni_net_rx_normal+0x2e2/0x440 [rte_kni]
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030517] RSP: 0018:ffff8805a7ae3d30 EFLAGS: 00010286
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030576] RAX: 0000077e1d012900 RBX: 0000000000000020 RCX: 0000000000000010
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030639] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffffa00388a3
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030701] RBP: ffff8805a7ae3e80 R08: 000000000000000a R09: 00000000fffffffe
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030766] R10: 00000000ffff2fea R11: 0000000000000006 R12: ffff8805a8a75000
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030829] R13: ffff8800b8c12800 R14: 0000000000000000 R15: ffff8805a8a75800
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030893] FS: 0000000000000000(0000) GS:ffff88062fce0000(0000) knlGS:0000000000000000
Feb 28 10:09:37 (none) user.warn kernel: [ 87.030971] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031031] CR2: 0000077e1d012900 CR3: 0000000001e0a000 CR4: 00000000000006e0
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031094] Stack:
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031148] ffff88062fcf5940 ffff8805a8ad8560 0000000000000000 ffff88060000054e
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031367] 0000077e1d012900 00000000b8c12800 00000000b8c11ec0 00000000b8c11580
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031587] 00000000b8c10c40 00000000b8c10300 00000000b8c0f9c0 00000000b8c0f080
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031811] Call Trace:
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031871] [<ffffffffa00343af>] kni_net_rx+0xf/0x20 [rte_kni]
Feb 28 10:09:37 (none) user.warn kernel: [ 87.031937] [<ffffffffa0032f05>] kni_thread_single+0x45/0xb0 [rte_kni]
Feb 28 10:09:37 (none) user.warn kernel: [ 87.032004] [<ffffffffa0032ec0>] ? kni_init_net+0x50/0x50 [rte_kni]
Feb 28 10:09:37 (none) user.warn kernel: [ 87.032067] [<ffffffff8107b7cb>] kthread+0xdb/0x100
Feb 28 10:09:37 (none) user.warn kernel: [ 87.032125] [<ffffffff8107b6f0>] ? kthread_park+0x60/0x60
Feb 28 10:09:37 (none) user.warn kernel: [ 87.032186] [<ffffffff81834c2f>] ret_from_fork+0x3f/0x70
Feb 28 10:09:37 (none) user.warn kernel: [ 87.032246] [<ffffffff8107b6f0>] ? kthread_park+0x60/0x60
Feb 28 10:09:37 (none) user.warn kernel: [ 87.032306] Code: 48 89 85 d0 fe ff ff eb 80 41 f6 c6 0f 75 0e 48 c7 c7 9f 88 03 a0 31 c0 e8 02 e9 11 e1 48 8b 85 d0 fe ff ff 48 c7 c7 a3 88 03 a0 <42> 0f b6 34 30 31 c0 49 83 c6 01 e8 e4 e8 11 e1 e9 5e fe ff ff
Feb 28 10:09:37 (none) user.alert kernel: [ 87.034742] RIP [<ffffffffa0033722>] kni_net_rx_normal+0x2e2/0x440 [rte_kni]
Feb 28 10:09:37 (none) user.warn kernel: [ 87.034844] RSP <ffff8805a7ae3d30>
Feb 28 10:09:37 (none) user.warn kernel: [ 87.034900] CR2: 0000077e1d012900
Feb 28 10:09:37 (none) user.warn kernel: [ 87.034956] ---[ end trace 5b31765eb0372d51 ]---
In there I saw it was failing somewhere in kni_net_rx_normal() function of kni_net.c file.
So I narrowed down the line of code where it was failing and it came to line 169 where the memcpy happens
Next I tried to print some addresses in that function and it gave me
kva data addresses: data_kva 0000077e1d012900 kva->buff_add 00007f7e1d012880 kva->data_off 128 kni->mbuf_va (null) and kni->mbuf_kva ffff880000000000
Next I tried to see if I can print the data in data_kva address and it failed there, so it looks like it fails when I try to access data_kva # 0000077e1d012900, I guess address is wrong, I dont know why, Can you give me some idea on this or some things to try out to debug the problem.

soft lockup in user process

I have one issue, at customer machine the user space process is hogging up the processor (soft lockup)along with 2 kernel process and dump stack trace showing RIP at _ticket_spin_lock in all 3 process.
As i know "If an user-space process had caused the soft-lockup, a line identifying the process by its pid would logged, followed by the contents of various CPU-registers without a call-trace of any sorts" but in my case i am getting dump stack trace for user process too.
is it coming from a misbehaving user space app?
is it normal functionality of soft lockup? if is it functionality of soft lockup then how to resolve the issue?
Any help will be highly appreciated.
it is x86_64 machine and kernel is 3.1.10. I know all 3 process are waiting for _ticket_spin_lock.
see :-
Aug 26 09:31:58 at-vie01a-cq21b kernel: [115452.492033] BUG: soft lockup - CPU#3 stuck for 22s! [virtio_shm/5/3:7874]
Aug 26 09:32:00 at-vie01a-cq21b kernel: [115455.404215] BUG: soft lockup - CPU#31 stuck for 23s! [kni_thread:6605]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172014] BUG: soft lockup - CPU#0 stuck for 22s! [gis:14145]
here gis is my user space process but has call trace.
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172014] BUG: soft lockup - CPU#0 stuck for 22s! [gis:14145]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172017] Modules linked in: xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 crypto_null af_key iptable_mangle ip_set arptable_filter arp_tables iptable_raw iptable_nat nfnetlink_log nfnetlink ipt_ULOG ipt_PORTMAP af_packet zlib zlib_deflate sha512_generic sha256_generic sha1_generic md5 icp_qa_al pcie8120 rte_kni pfe_pep virtio_rte virtio_shm virtio_vtnet virtio_uio igb_uio virtio_ring virtio uio xt_tcpudp xt_state xt_pkttype nf_conntrack_control bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf ipmi_devintf ipmi_si ipmi_msghandler edd nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage uas iTCO_wdt ioatdma pcspkr iTCO_vendor_support ixgbe igb wmi i2c_i801 mdio dca sg button container ipv6 autofs4 usbhid ehci_hcd megasr(P) usbcore processor thermal_sys
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172098] CPU 0
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172099] Modules linked in: xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 crypto_null af_key iptable_mangle ip_set arptable_filter arp_tables iptable_raw iptable_nat nfnetlink_log nfnetlink ipt_ULOG ipt_PORTMAP af_packet zlib zlib_deflate sha512_generic sha256_generic sha1_generic md5 icp_qa_al pcie8120 rte_kni pfe_pep virtio_rte virtio_shm virtio_vtnet virtio_uio igb_uio virtio_ring virtio uio xt_tcpudp xt_state xt_pkttype nf_conntrack_control bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf ipmi_devintf ipmi_si ipmi_msghandler edd nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage uas iTCO_wdt ioatdma pcspkr iTCO_vendor_support ixgbe igb wmi i2c_i801 mdio dca sg button container ipv6 autofs4 usbhid ehci_hcd megasr(P) usbcore processor thermal_sys
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172163]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172166] Pid: 14145, comm: gis Tainted: P 3.1.10-gb20-default #1 Intel Corporation S2600CO/S2600CO
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172170] RIP: 0010:[<ffffffff8102064d>] [<ffffffff8102064d>] __ticket_spin_lock+0x15/0x1b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172178] RSP: 0000:ffff88043ee03cf0 EFLAGS: 00000293
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172180] RAX: 00000000000069bf RBX: 00000000020110ac RCX: 000000000000000e
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172182] RDX: 00000000000069bc RSI: 000000000000000e RDI: ffff88041e56a484
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172184] RBP: ffff88041e56a484 R08: ffff88041e56a740 R09: ffff8804154a5840
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172187] R10: 00007f0afce77000 R11: 0000000000000000 R12: ffff88043ee03c68
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172189] R13: ffffffff813f831e R14: ffff88041e56a484 R15: ffff88041e568280
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172192] FS: 00007f0afd70b700(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172194] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172196] CR2: 00007f54f6b88098 CR3: 000000042427e000 CR4: 00000000000406f0
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172199] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172201] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172204] Process gis (pid: 14145, threadinfo ffff88037537e000, task ffff88036a8fe180)
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172205] Stack:
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172207] ffffffff8106b766 ffffffffa05e3a1e 0000000101b72e68 ffff8808260ae680
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172213] 0000002e1e568280 ffff880420450000 ffff88041f887a00 ffff880420450000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172218] ffffffff8192a870 0000000000000608 0000000000000000 ffffffff81928b00
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172224] Call Trace:
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172233] [<ffffffff8106b766>] do_raw_spin_lock+0x5/0x8
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172240] [<ffffffffa05e3a1e>] packet_rcv+0x254/0x2ab [af_packet]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172257] [<ffffffff81337bbf>] __netif_receive_skb+0x2e1/0x36b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172262] [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172266] [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172277] [<ffffffffa031adee>] igb_clean_rx_irq+0x30d/0x39e [igb]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172298] [<ffffffffa031aecd>] igb_poll+0x4e/0x74 [igb]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172313] [<ffffffff81339c88>] net_rx_action+0x65/0x178
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172319] [<ffffffff81045c73>] __do_softirq+0xb2/0x19d
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172324] [<ffffffff813f9aac>] call_softirq+0x1c/0x30
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172329] [<ffffffff81003931>] do_softirq+0x3c/0x7b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172333] [<ffffffff81045f98>] irq_exit+0x3c/0xac
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172337] [<ffffffff81003655>] do_IRQ+0x82/0x98
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172342] [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e
From your description is looks like the issue is somewhere in the kernel and not in a user process. You are getting a kernel dump stack trace which points in that direction. It just happens so that a user process is active at that specific time.
A soft lockup in the kernel is reported when a kernel thread of execution holds a processor for a prolonged time. Most of the time that's a sign of a problem in the kernel code, for instance, in a specific device driver that is used in your installation. It appears that you have got a deadlock which may happen due to variety of reasons. It's impossible to pinpoint the problem without seeing the code and the lockup's stack trace.
I got another dump,What i observed from dump trace
kernel: [115455.404446] [<ffffffff8106b766>] do_raw_spin_lock+0x5/0x8
kernel: [115455.404454] [<ffffffffa05e3a1e>] packet_rcv+0x254/0x2ab [af_packet]
kernel: [115455.404477] [<ffffffff81337bbf>] __netif_receive_skb+0x2e1/0x36b
kernel: [115455.404482] [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
kernel: [115455.404487] [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31
kernel: [115455.404497] [<ffffffffa031adee>] igb_clean_rx_irq+0x30d/0x39e [igb]
kernel: [115455.404517] [<ffffffffa031aecd>] igb_poll+0x4e/0x74 [igb]
kernel: [115455.404532] [<ffffffff81339c88>] net_rx_action+0x65/0x178
kernel: [115455.404538] [<ffffffff81045c73>] __do_softirq+0xb2/0x19d
kernel: [115455.404544] [<ffffffff813f9aac>] call_softirq+0x1c/0x30
kernel: [115455.404550] [<ffffffff81003931>] do_softirq+0x3c/0x7b
kernel: [115455.404555] [<ffffffff81045f98>] irq_exit+0x3c/0xac
kernel: [115455.404558] [<ffffffff81003655>] do_IRQ+0x82/0x98
kernel: [115455.404565] [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e
kernel: [115455.404573] [<ffffffffa05e0003>] atomic_inc+0x3/0x4 [af_packet]
kernel: [115455.404579] [<ffffffffa05e3a33>] packet_rcv+0x269/0x2ab [af_packet]
kernel: [115455.404589] [<ffffffff81337bbf>] __netif_receive_skb+0x2e1/0x36b
kernel: [115455.404593] [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
kernel: [115455.404610] [<ffffffffa041bd4b>] kni_net_rx_normal+0x12d/0x178 [rte_kni]
kernel: [115455.404690] [<ffffffffa041ae58>] kni_thread+0x39/0x91 [rte_kni]
kernel: [115455.404758] [<ffffffff8105975a>] kthread+0x76/0x7e
kernel: [115455.404763] [<ffffffff813f99b4>] kernel_thread_helper+0x4/0x10
rte_kni is run on kthread that like as user-space context. netif_receive_skb() is being called by kni_net_rx_normal() as a normal function, which normally is called soft-irq context. now on that same core we received a softirq for the same socket then we are going into deadlock as when rte_kni has called the kernel function we have not disabled the softirq on the core.
So to disable softirq here to avoid race between timers and netif_receive_skb, it should be replaced by netif_rx or adding local_bh_disable/enable around netif_receive_skb.

why oom-killer with large inactive cache and enough free swap space?

It confuses me that there was large inactive file page cache 734812kB and dirty cache 800088kB seemed could be reclaimed, why did oom-killer happen? .
The vm.swappiness was set 0, as says in linux document, 0 does not mean avoiding swap completely, and I have found swap space used reached 300MB with swappiness=0 with the same setting and OS on other server.
OS info:
CentOS 6.4 kernel: 2.6.32-358.23.2.el6.x86_64
oom-killer log as following:
Aug 26 14:34:48 withivan.me kernel: java invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Aug 26 14:34:48 withivan.me kernel: java cpuset=/ mems_allowed=0
Aug 26 14:34:48 withivan.me kernel: Pid: 28505, comm: java Not tainted 2.6.32-358.23.2.el6.x86_64 #1
Aug 26 14:34:48 withivan.me kernel: Call Trace:
Aug 26 14:34:48 withivan.me kernel: [<ffffffff810cb641>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8111ce40>] ? dump_header+0x90/0x1b0
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8121d4ec>] ? security_real_capable_noaudit+0x3c/0x70
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8111d2c2>] ? oom_kill_process+0x82/0x2a0
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8111d201>] ? select_bad_process+0xe1/0x120
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8111d700>] ? out_of_memory+0x220/0x3c0
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8112c3dc>] ? __alloc_pages_nodemask+0x8ac/0x8d0
Aug 26 14:34:48 withivan.me kernel: [<ffffffff81160c6a>] ? alloc_pages_current+0xaa/0x110
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8148d667>] ? tcp_sendmsg+0x677/0xa20
Aug 26 14:34:48 withivan.me kernel: [<ffffffff81435f33>] ? sock_sendmsg+0x123/0x150
Aug 26 14:34:48 withivan.me kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Aug 26 14:34:48 withivan.me kernel: [<ffffffff810aa43e>] ? futex_wake+0x10e/0x120
Aug 26 14:34:48 withivan.me kernel: [<ffffffff810ac3a0>] ? do_futex+0x100/0xb60
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8119cfdf>] ? destroy_inode+0x2f/0x60
Aug 26 14:34:48 withivan.me kernel: [<ffffffff81436249>] ? sys_sendto+0x139/0x190
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8103b8cc>] ? kvm_clock_read+0x1c/0x20
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10
Aug 26 14:34:48 withivan.me kernel: [<ffffffff810a1507>] ? getnstimeofday+0x57/0xe0
Aug 26 14:34:48 withivan.me kernel: [<ffffffff810a15fa>] ? do_gettimeofday+0x1a/0x50
Aug 26 14:34:48 withivan.me kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Aug 26 14:34:48 withivan.me kernel: Mem-Info:
Aug 26 14:34:48 withivan.me kernel: Node 0 DMA per-cpu:
Aug 26 14:34:48 withivan.me kernel: CPU 0: hi: 0, btch: 1 usd: 0
Aug 26 14:34:48 withivan.me kernel: CPU 1: hi: 0, btch: 1 usd: 0
Aug 26 14:34:48 withivan.me kernel: CPU 2: hi: 0, btch: 1 usd: 0
Aug 26 14:34:48 withivan.me kernel: CPU 3: hi: 0, btch: 1 usd: 0
Aug 26 14:34:48 withivan.me kernel: Node 0 DMA32 per-cpu:
Aug 26 14:34:48 withivan.me kernel: CPU 0: hi: 186, btch: 31 usd: 32
Aug 26 14:34:48 withivan.me kernel: CPU 1: hi: 186, btch: 31 usd: 0
Aug 26 14:34:48 withivan.me kernel: CPU 2: hi: 186, btch: 31 usd: 0
Aug 26 14:34:48 withivan.me kernel: CPU 3: hi: 186, btch: 31 usd: 1
Aug 26 14:34:48 withivan.me kernel: Node 0 Normal per-cpu:
Aug 26 14:34:48 withivan.me kernel: CPU 0: hi: 186, btch: 31 usd: 4
Aug 26 14:34:48 withivan.me kernel: CPU 1: hi: 186, btch: 31 usd: 38
Aug 26 14:34:48 withivan.me kernel: CPU 2: hi: 186, btch: 31 usd: 0
Aug 26 14:34:48 withivan.me kernel: CPU 3: hi: 186, btch: 31 usd: 57
Aug 26 14:34:48 withivan.me kernel: active_anon:1697553 inactive_anon:373583 isolated_anon:0
Aug 26 14:34:48 withivan.me kernel: active_file:174263 inactive_file:199171 isolated_file:0
Aug 26 14:34:48 withivan.me kernel: unevictable:0 dirty:216860 writeback:1 unstable:0
Aug 26 14:34:48 withivan.me kernel: free:35470 slab_reclaimable:14993 slab_unreclaimable:6945
Aug 26 14:34:48 withivan.me kernel: mapped:3423 shmem:45 pagetables:5263 bounce:0
Aug 26 14:34:48 withivan.me kernel: Node 0 DMA free:15740kB min:148kB low:184kB high:220kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15344kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Aug 26 14:34:48 withivan.me kernel: lowmem_reserve[]: 0 3512 10077 10077
Aug 26 14:34:48 withivan.me kernel: Node 0 DMA32 free:61200kB min:34800kB low:43500kB high:52200kB active_anon:2479864kB inactive_anon:621800kB active_file:39752kB inactive_file:61872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596500kB mlocked:0kB dirty:67352kB writeback:0kB mapped:20kB shmem:0kB slab_reclaimable:21672kB slab_unreclaimable:1952kB kernel_stack:3792kB pagetables:4344kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:94464 all_unreclaimable? no
Aug 26 14:34:48 withivan.me kernel: lowmem_reserve[]: 0 0 6565 6565
Aug 26 14:34:48 withivan.me kernel: Node 0 Normal free:64940kB min:65048kB low:81308kB high:97572kB active_anon:4310348kB inactive_anon:872532kB active_file:657300kB inactive_file:734812kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:6722560kB mlocked:0kB dirty:800088kB writeback:4kB mapped:13672kB shmem:180kB slab_reclaimable:38300kB slab_unreclaimable:25828kB kernel_stack:4568kB pagetables:16708kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:605088 all_unreclaimable? no
Aug 26 14:34:48 withivan.me kernel: lowmem_reserve[]: 0 0 0 0
Aug 26 14:34:48 withivan.me kernel: Node 0 DMA: 3*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15740kB
Aug 26 14:34:48 withivan.me kernel: Node 0 DMA32: 1247*4kB 1189*8kB 969*16kB 379*32kB 92*64kB 12*128kB 1*256kB 8*512kB 5*1024kB 1*2048kB 0*4096kB = 61076kB
Aug 26 14:34:48 withivan.me kernel: Node 0 Normal: 2047*4kB 1672*8kB 781*16kB 309*32kB 46*64kB 3*128kB 1*256kB 20*512kB 7*1024kB 0*2048kB 0*4096kB = 64940kB
Aug 26 14:34:48 withivan.me kernel: 379124 total pagecache pages
Aug 26 14:34:48 withivan.me kernel: 4685 pages in swap cache
Aug 26 14:34:48 withivan.me kernel: Swap cache stats: add 167082, delete 162397, find 114795/130707
Aug 26 14:34:48 withivan.me kernel: Free swap = 4166416kB
Aug 26 14:34:48 withivan.me kernel: Total swap = 4194296kB
Aug 26 14:34:48 withivan.me kernel: 2621439 pages RAM
Aug 26 14:34:48 withivan.me kernel: 89408 pages reserved
Aug 26 14:34:48 withivan.me kernel: 384993 pages shared
Aug 26 14:34:48 withivan.me kernel: 2116876 pages non-shared

hadoop cause system crash with "soft lock" and "hard lock"

I am running hadoop2.2 on redhat6.3-6.5,and all of my machines crashed after a while. /var/log/messages shows repeatedly:
Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [jsvc:11508]
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel: CPU 1
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel:
Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G W --------------- 2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW
Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>] [<ffffffff8104d088>] wait_for_rqlock+0x28/0x40
Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8 EFLAGS: 00000202
Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 RCX: ffff880028216680
Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 RDI: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 R09: 0000000000000001
Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e R15: ffff8807786c3e48
Aug 11 06:30:42 jn4_73_128 kernel: FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 CR4: 00000000000006e0
Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo ffff8807786c2000, task ffff880c1def3500)
Aug 11 06:30:42 jn4_73_128 kernel: Stack:
Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 0000000000000000 ffff8807786c3f28
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 ffff880c1def39c8 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 ffff8807786c3f78 00007f092d0ad700
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
</em>
and finally crashed
crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore
crash 6.1.0-5.el6
Copyright (C) 2002-2012 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
please wait... (determining panic task)
WARNING: active task ffff881071850040 on cpu 12 not found in PID hash
KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Sun Aug 10 09:47:32 2014
UPTIME: 7 days, 16:00:19
LOAD AVERAGE: 11.01, 3.11, 1.08
TASKS: 724
NODENAME: master1.otocyon.com
RELEASE: 2.6.32-431.5.1.el6.x86_64
VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
MACHINE: x86_64 (1895 Mhz)
MEMORY: 64 GB
PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"
PID: 23976
COMMAND: "sh"
TASK: ffff881071850aa0 [THREAD_INFO: ffff880a05c80000]
CPU: 0
STATE: TASK_INTERRUPTIBLE (PANIC)
crash> bt
PID: 23976 TASK: ffff881071850aa0 CPU: 0 COMMAND: "sh"
#0 [ffff880028207b50] machine_kexec at ffffffff81038f3b
#1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82
#2 [ffff880028207c80] panic at ffffffff8152751a
#3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d
#4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847
#5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14
#6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87
#7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69
#8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825
#9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a
#10 [ffff880028207ef0] notify_die at ffffffff810a153e
#11 [ffff880028207f20] do_nmi at ffffffff8152b4eb
#12 [ffff880028207f50] nmi at ffffffff8152adb0
[exception RIP: task_rq_unlock_wait+44]
RIP: ffffffff810534fc RSP: ffff880a05c81dc8 RFLAGS: 00000016
RAX: 000000000ec70ebe RBX: ffff881071850040 RCX: ffff8800282d6840
RDX: 0000000000000ec7 RSI: 0000000000000000 RDI: ffff881071850040
RBP: ffff880a05c81dc8 R8: dead000000200200 R9: dead000000200200
R10: ffff8810734a42d0 R11: 0000000000000246 R12: 00000000000114b8
R13: ffff8810734a4180 R14: ffff881071fd3440 R15: ffff881071fd3c48
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffff880a05c81dc8] task_rq_unlock_wait at ffffffff810534fc
#14 [ffff880a05c81dd0] release_task at ffffffff81075454
#15 [ffff880a05c81e10] wait_consider_task at ffffffff81075fb6
#16 [ffff880a05c81e80] do_wait at ffffffff810763e6
#17 [ffff880a05c81ee0] sys_wait4 at ffffffff810765d3
#18 [ffff880a05c81f80] system_call_fastpath at ffffffff8100b072
RIP: 0000003e1a2ac8be RSP: 00007fffa58c6330 RFLAGS: 00010207
RAX: 000000000000003d RBX: ffffffff8100b072 RCX: 0000003e1a232be0
RDX: 0000000000000000 RSI: 00007fffa58c62ec RDI: ffffffffffffffff
RBP: 00000000ffffffff R8: 000000000203b8d0 R9: 000000000203d590
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000005d00
ORIG_RAX: 000000000000003d CS: 0033 SS: 002b
It happened on machines from different vendors,and I have tried to update to the latest kernel from redhat.
Can anyone with the same experience help?

usb_register_dev crashing linux kernel

This is for a class, but we are stumped. We are currently writing a USB driver for a Logitech camera that uses USBCore. What's happening is we load the module, then when we connect the USB Camera, and the kernel crashes and gives us a kernel trace (below). After a bit of debugging, we are pretty sure it crashes on usb_register_dev within the probe function, but we can't figure out why. We were hoping that someone would have any helpful suggestions or to send us on the right path. We're not asking for answers, just guidance.
We have looked at all of our variable initalizers and based on our notes and skull examples, it looks alright. Below are code snippits to the important functions and the call trace.
Kernel (Custom school, but based on 3.2.34):
Linux ETSELE 3.2.34etsele #1 SMP PREEMPT Tue Jan 22 18:22:05 EST 2013 i686 i686 i386 GNU/Linux
Init:
static int __init
usb_cam_init(void) {
int result = 0;
if ((result = usb_register(&cam_driver)))
printk("usb_register failed. Error number %d", result);
return result;
}
Probe:
static int
usb_cam_probe(struct usb_interface * intf, const struct usb_device_id * devid) {
int retval = 0;
struct usb_host_interface *interface;
struct usb_endpoint_descriptor *endpoint;
struct usb_device *dev = interface_to_usbdev(intf);
struct usb_cam *usbdev = NULL;
int n, m, altSetNum, activeInterface = -1;
printk("kmalloc\n");
usbdev = kmalloc(sizeof(struct usb_ele_cam), GFP_KERNEL); ///////////////
printk("usb_get_dev\n");
usbdev->usb_dev = usb_get_dev(dev);
usbdev->class = (struct usb_class_driver *) kmalloc(sizeof(struct usb_class_driver), GFP_KERNEL);
usbdev->class->name = "cam";
usbdev->class->fops = &cam_fops;
usbdev->class->minor_base = 0;
// usbdev->class->mode = O_RDWR;
printk("for\n");
for (n = 0; n < intf->num_altsetting; n++) {
interface = &intf->altsetting[n];
altSetNum = interface->desc.bAlternateSetting;
for (m = 0; m < interface->desc.bNumEndpoints; m++) {
endpoint = &interface->endpoint[m].desc;
if (!usbdev->bulk_in_endpointAddr && (endpoint->bEndpointAddress & USB_DIR_IN)
&& ((endpoint->bmAttributes & USB_ENDPOINT_XFERTYPE_MASK) == USB_ENDPOINT_XFER_BULK)) {
usbdev->bulk_in_size = endpoint->wMaxPacketSize;
usbdev->bulk_in_endpointAddr = endpoint->bEndpointAddress;
usbdev->bulk_in_buffer = kmalloc(usbdev->bulk_in_size, GFP_KERNEL);
activeInterface = altSetNum;
break;
}
}
if (activeInterface != -1)
break;
}
printk("usb_set_intfdata\n");
usb_set_intfdata(intf, usbdev);
printk("usb_register_dev\n");
usb_register_dev(intf, usbdev->class);
//printk("Not able to get a minor for this device");
printk("usb_set_interface\n");
usb_set_interface(dev, interface->desc.bInterfaceNumber, activeInterface);
return retval;
}
Structures and global variables:
struct usb_cam {
struct usb_device *usb_dev;
struct usb_interface *usb_inf;
struct usb_class_driver *class;
struct semaphore sem;
unsigned char *bulk_in_buffer;
size_t bulk_in_size;
__u8 bulk_in_endpointAddr;
__u8 bulk_out_endpointAddr;
int errors;
int open_count;
struct kref kref;
};
Logs from kern.log:
Nov 26 11:25:15 ETSELE kernel: [ 123.845972] usbcore: deregistering interface driver uvcvideo
Nov 26 11:25:32 ETSELE kernel: [ 140.234188] kmalloc
Nov 26 11:25:32 ETSELE kernel: [ 140.234192] usb_get_dev
Nov 26 11:25:32 ETSELE kernel: [ 140.234194] for
Nov 26 11:25:32 ETSELE kernel: [ 140.234196] usb_set_intfdata
Nov 26 11:25:32 ETSELE kernel: [ 140.234198] usb_register_dev
Nov 26 11:25:32 ETSELE kernel: [ 140.234450] BUG: unable to handle kernel paging request at 6d742e65
Nov 26 11:25:32 ETSELE kernel: [ 140.234506] IP: [<6d742e65>] 0x6d742e64
Nov 26 11:25:32 ETSELE kernel: [ 140.234539] *pdpt = 000000002bf84001 *pde = 0000000000000000
Nov 26 11:25:32 ETSELE kernel: [ 140.234585] Oops: 0010 [#1] PREEMPT SMP
Nov 26 11:25:32 ETSELE kernel: [ 140.234619] Modules linked in: usb_cam(O+) snd_usb_audio snd_usbmidi_lib videodev vtsspp(O) sep3_10(O) pax(O) autofs4 apwr3_1(O) bnep rfcomm bluetooth parport_pc ppdev tpm_infineon binfmt_misc snd_hda_codec_realtek nfsd nfs snd_hda_intel lockd snd_hda_codec fscache auth_rpcgss snd_hwdep nfs_acl snd_pcm sunrpc snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device hp_wmi sparse_keymap snd dm_multipath psmouse serio_raw tpm_tis mac_hid soundcore snd_page_alloc mei(C) lp parport dm_raid45 xor dm_mirror dm_region_hash dm_log btrfs zlib_deflate libcrc32c usbhid hid e1000e i915 drm_kms_helper drm i2c_algo_bit video wmi zram(C) [last unloaded: uvcvideo]
Nov 26 11:25:32 ETSELE kernel: [ 140.235146]
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] Pid: 3153, comm: insmod Tainted: G C O 3.2.34etsele #1 Hewlett-Packard HP Compaq 6000 Pro MT PC/3048h
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] EIP: 0060:[<6d742e65>] EFLAGS: 00210206 CPU: 0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] EIP is at 0x6d742e65
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] EAX: ea9b8800 EBX: ea9b8800 ECX: 6d742e65 EDX: eb18dc90
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] ESI: eb18dc90 EDI: eb18dc90 EBP: eb18dc30 ESP: eb18dc24
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] Process insmod (pid: 3153, ti=eb18c000 task=eb30d400 task.ti=eb18c000)
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] Stack:
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] c144a1fd ea9b8800 eb18dc4c eb18dc44 c13b05df ea9b8800 00000000 ea9b8808
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] eb18dca0 c13b6be1 00000000 eb3bc0d0 eb18dc94 c11c5560 00000000 00000000
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] 0000000a eb3bffff 00000001 14e7232a eb18dcd5 ffffffff eb18dc80 f7022208
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] Call Trace:
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c144a1fd>] ? usb_devnode+0x2d/0x40
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b05df>] device_get_devnode+0x5f/0xd0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b6be1>] devtmpfs_create_node+0x41/0x100
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c11c5560>] ? sysfs_do_create_link+0xb0/0x1e0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13aff3f>] device_add+0x1ff/0x620
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b9ae0>] ? device_pm_init+0x60/0x80
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b0377>] device_register+0x17/0x20
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b0431>] device_create_vargs+0xb1/0xe0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b048d>] device_create+0x2d/0x30
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c144a093>] usb_register_dev+0x133/0x270
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c15e8afd>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<f98d81dc>] ele784_probe+0x17c/0x1bc [usb_cam]
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c14482ae>] usb_probe_interface+0xce/0x210
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b2815>] ? driver_sysfs_add+0x75/0xa0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b2a0f>] driver_probe_device+0x8f/0x2e0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c15e7392>] ? mutex_lock_nested+0x42/0x50
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b2cf9>] __driver_attach+0x99/0xa0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b2c60>] ? driver_probe_device+0x2e0/0x2e0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b1979>] bus_for_each_dev+0x49/0x70
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b2661>] driver_attach+0x21/0x30
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b2c60>] ? driver_probe_device+0x2e0/0x2e0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b22b7>] bus_add_driver+0x1c7/0x2e0
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c13b31d6>] driver_register+0x66/0x110
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c12e5912>] ? __raw_spin_lock_init+0x32/0x60
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c1447229>] usb_register_driver+0x79/0x140
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<f90bc01b>] ele784_init+0x1b/0x1000 [usb_cam]
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c103b3ef>] ? set_memory_nx+0x5f/0x70
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c1003035>] do_one_initcall+0x35/0x170
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<f90bc000>] ? 0xf90bbfff
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c10a3aeb>] sys_init_module+0x2db/0x1d60
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] [<c15ef79f>] sysenter_do_call+0x12/0x38
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] Code: Bad EIP value.
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] EIP: [<6d742e65>] 0x6d742e65 SS:ESP 0068:eb18dc24
Nov 26 11:25:32 ETSELE kernel: [ 140.235146] CR2: 000000006d742e65
Nov 26 11:25:32 ETSELE kernel: [ 140.361304] ---[ end trace 3f64a15c3c778575 ]---
Your usb_class_driver structure must be correctly initialized.
You could use kzalloc instead of kmalloc, but having multiple classes for multiple cameras would be wrong, so you should make the camera class a static variable (like in every other driver that uses usb_register_dev).

Resources