I have a Go service doing heavy reads from an NFS / GPFS volume. I do occasionally run into issues at scale during which the underlying mount would not answer to a specific syscall, resulting in the entire service being taken down by the kernel:
[98549.941930] Tainted: G O 4.14.13-1.el7.elrepo.x86_64 #1
[98549.942454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[98549.943422] ls D 0 14884 1 0x00000084
[98549.943968] Call Trace:
[98549.944498] __schedule+0x28d/0x880
[98549.945033] schedule+0x36/0x80
[98549.945552] schedule_preempt_disabled+0xe/0x10
[98549.946095] __mutex_lock.isra.5+0x269/0x500
[98549.946611] __mutex_lock_slowpath+0x13/0x20
[98549.947153] mutex_lock+0x2f/0x40
[98549.947695] fuse_lock_inode+0x2a/0x30 [fuse]
[98549.948248] fuse_readdir+0x113/0x7e0 [fuse]
[98549.948795] iterate_dir+0x16e/0x190
[98549.949323] ? __audit_syscall_entry+0xaf/0x100
[98549.949847] SyS_getdents+0x98/0x120
[98549.950358] ? iterate_dir+0x190/0x190
[98549.950898] do_syscall_64+0x67/0x1b0
[98549.951410] entry_SYSCALL64_slow_path+0x25/0x25
[98549.951948] RIP: 0033:0x7ffff749dcb5
[98549.952454] RSP: 002b:00007fffffffd160 EFLAGS: 00000246 ORIG_RAX: 000000000000004e
[98549.953423] RAX: ffffffffffffffda RBX: 00000000006260a0 RCX: 00007ffff749dcb5
[98549.953985] RDX: 0000000000008000 RSI: 00000000006260a0 RDI: 0000000000000005
[98549.954518] RBP: 00000000006260a0 R08: 0000000000000080 R09: 0000000000008030
[98549.955131] R10: 00007fffffffced0 R11: 0000000000000246 R12: fffffffffffffe90
[98549.955655] R13: 0000000000000000 R14: 0000000000626030 R15: 0000000000626000
I an looking for a way to add a timeout so that no individual failed syscall can take the whole service down, but cannot find a good way to do it in Go.
One usual design I found is to run syscalls from an OS thread and kill this thread on timeout, but that does not seem like a possibility with Golang due to the lack of control on the underlying system threads. The service is typically executing large amounts of syscalls in parallel (possibly hundreds).
you can get the pid of the process using the sycall package
pid, _, _ := syscall.Syscall(syscall.SYS_GETPID, 0, 0, 0)
and latter kill the process by pid
select {
case end = <-endSignal:
fmt.Println("The end!")
case <-time.After(5 * time.Second):
proc, _ := os.FindProcess(pid)
// Kill the process
proc.Kill()
}
Related
An Mbed code is throwing the crash dump below and I wish to find the line corresponding to the given PC. I'm on Windows though, so the simple "addr2line" is not available. I tried addr2line with Ubuntu shell on Windows, but it gives ??:?
What is the best tool on Windows 10 to perform address-to-line resolution from ARM ELF?
++ MbedOS Fault Handler ++
FaultType: HardFault
Context:
R0: 0
R1: 2000A0C8
R2: 1
R3: 14
R4: 20007854
R5: 2000A0
R6: 68
R7: 0
R8: 0
R9: 0
R10: 0
R11: 0
R12: 29FC1
SP : 2000A8B8
LR : 2C007
PC : 2000A0C8
xPSR : 0
PSP : 2000A898
MSP : 2003FFC0
CPUID: 410FC241
HFSR : 40000000
MMFSR: 0
BFSR : 0
UFSR : 2
DFSR : 0
AFSR : 0
Mode : Thread
Priv : Privileged
Stack: PSP
-- MbedOS Fault Handler --
I write a driver for i2c rtc chip for learning purpose. The driver can detect interrupts on GPIO pin from rtc chip. I want to schedule a tasklet into interrupt context and do some usefull work into the tasklet later.
GPIO irq handler:
static irqreturn_t alarm_irq_handler(int irq, void *dev)
{
struct ds3231_state *driver_data;
driver_data = i2c_get_clientdata(to_i2c_client(dev));
if (!driver_data)
return IRQ_NONE;
tasklet_schedule(&driver_data->tasklet);
return IRQ_HANDLED;
}
Tasklet function:
Into the tasklet function I want to do some communucation with i2c-chip, because I need to clear interrupt flags into the chip.
static void bottom_half_ds3231_handler(unsigned long data)
{
struct device *dev = (struct device *)data;
struct i2c_client *client = to_i2c_client(dev);
printk(KERN_INFO "ds3231 gpio alarm interrupt detect\n");
ds3231_disable_onchip_alarm_detect(client); // <--- this functions use I2C
ds3231_clear_onchip_alarm_flags(client);
}
When I try to use i2c, my kernel fails.
I know about i2c layer from non-process context.
I2C functions are so slow, that's why i have wanted to do slow work into bottom half.
But i can't use i2c into bottom half. Why?
How can i reset some flags into the chip after interrupts have been detected?
UPD 1: show stack trace after the kernel was failed
# [ 3160.248272] my-ds3231-rtc 1-0068: ds3231 gpio alarm interrupt detect
[ 3160.254792] BUG: scheduling while atomic: swapper/0/0/0x00000100
[ 3160.260827] Modules linked in: rtc_ds3231(O)
[ 3160.265127] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 5.5.0 #1
[ 3160.272376] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 3160.278536] [<c0312994>] (unwind_backtrace) from [<c030cc4c>] (show_stack+0x10/0x14)
[ 3160.286325] [<c030cc4c>] (show_stack) from [<c0ef7270>] (dump_stack+0xc0/0xd4)
[ 3160.293589] [<c0ef7270>] (dump_stack) from [<c0370324>] (__schedule_bug+0x68/0x88)
[ 3160.301204] [<c0370324>] (__schedule_bug) from [<c0f10f90>] (__schedule+0x450/0x618)
[ 3160.308983] [<c0f10f90>] (__schedule) from [<c0f111c0>] (schedule+0x68/0xf4)
[ 3160.316066] [<c0f111c0>] (schedule) from [<c0f15110>] (schedule_timeout+0x188/0x30c)
[ 3160.323846] [<c0f15110>] (schedule_timeout) from [<c0f1214c>] (wait_for_completion_timeout+0xd8/0x16c)
[ 3160.333205] [<c0f1214c>] (wait_for_completion_timeout) from [<c0c6f174>] (omap_i2c_xfer_common+0x3e4/0x5a8)
[ 3160.342998] [<c0c6f174>] (omap_i2c_xfer_common) from [<c0c5bfec>] (__i2c_transfer+0x1d8/0x650)
[ 3160.351650] [<c0c5bfec>] (__i2c_transfer) from [<c0c5c4c0>] (i2c_transfer+0x5c/0x104)
[ 3160.359528] [<c0c5c4c0>] (i2c_transfer) from [<bf000244>] (read_reg+0x68/0xb4 [rtc_ds3231])
[ 3160.367924] [<bf000244>] (read_reg [rtc_ds3231]) from [<bf000ec4>] (ds3231_disable_onchip_alarm_detect+0x28/0xb0 [rtc_ds3231])
[ 3160.379371] [<bf000ec4>] (ds3231_disable_onchip_alarm_detect [rtc_ds3231]) from [<bf001220>] (bottom_half_ds3231_handler+0x38/0xcc [rtc_ds3231])
[ 3160.392389] [<bf001220>] (bottom_half_ds3231_handler [rtc_ds3231]) from [<c034dfdc>] (tasklet_action_common.constprop.3+0x70/0x174)
[ 3160.404272] [<c034dfdc>] (tasklet_action_common.constprop.3) from [<c03022d8>] (__do_softirq+0x130/0x3b4)
[ 3160.413881] [<c03022d8>] (__do_softirq) from [<c034e72c>] (irq_exit+0xcc/0xd8)
[ 3160.421141] [<c034e72c>] (irq_exit) from [<c039df38>] (__handle_domain_irq+0x60/0xb4)
[ 3160.429007] [<c039df38>] (__handle_domain_irq) from [<c0301acc>] (__irq_svc+0x6c/0x90)
[ 3160.436956] Exception stack(0xc1801f10 to 0xc1801f58)
[ 3160.442030] 1f00: 00000000 0002077c df9a7d70 c031dcc0
[ 3160.450243] 1f20: ffffe000 c1804e6c c1804eb0 00000001 00000000 c1804e48 c1760628 00000000
[ 3160.458455] 1f40: 00000001 c1801f60 c0309194 c0309198 60000013 ffffffff
[ 3160.465107] [<c0301acc>] (__irq_svc) from [<c0309198>] (arch_cpu_idle+0x38/0x3c)
[ 3160.472540] [<c0309198>] (arch_cpu_idle) from [<c03783f4>] (do_idle+0x1bc/0x2ac)
[ 3160.479970] [<c03783f4>] (do_idle) from [<c03787a4>] (cpu_startup_entry+0x18/0x1c)
[ 3160.487576] [<c03787a4>] (cpu_startup_entry) from [<c1600dac>] (start_kernel+0x480/0x4b0)
[ 3160.495786] bad: scheduling from the idle thread!
I use uio generic driver with HW composed of PCIe device (FPGA) connected to Intel ATOM cpu.
But, on testing, although interrupt is seen in the driver, it is not delivered to userspace.
These are the steps I'm doing:
echo "10ee 0007" > /sys/bus/pci/drivers/uio_pci_generic/new_id
I use userspace application which wait for interrupt, just as described in code example here.
I than trigger an interrupt from FPGA, but no print from the userspace application is given and there is an exception:
irq 23: nobody cared (try booting with the "irqpoll" option)
[ 91.030760] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.18.16 #6
[ 91.037037] Hardware name: /conga-MA5, BIOS MA50R000 10/30/2019
[ 91.043302] Call Trace:
[ 91.045881] <IRQ>
[ 91.048002] dump_stack+0x5c/0x80
[ 91.051464] __report_bad_irq+0x35/0xaf
[ 91.055465] note_interrupt.cold.9+0xa/0x63
[ 91.059823] handle_irq_event_percpu+0x68/0x70
[ 91.064470] handle_irq_event+0x37/0x57
[ 91.068481] handle_fasteoi_irq+0x97/0x150
...
[ 91.176043] handlers:
[ 91.178419] [<00000000ec05b056>] uio_interrupt
[ 91.183054] Disabling IRQ #23
I started debugging the uio driver, and I see that interrupt handler is called, but not handled:
static irqreturn_t irqhandler(int irq, struct uio_info *info)
{
struct uio_pci_generic_dev *gdev = to_uio_pci_generic_dev(info);
printk("here 1\n"); <<--- we get here interrupt is catched here
if (!pci_check_and_mask_intx(gdev->pdev))
return IRQ_NONE;
printk("here 2\n"); <<--- But we never get here
/* UIO core will signal the user process. */
return IRQ_HANDLED;
}
It seems that pci_check_and_mask_intx() does not detect it as an interrupt from our device!
The device appear as following:
02:00.0 RAM memory: Xilinx Corporation Default PCIe endpoint ID
Subsystem: Xilinx Corporation Default PCIe endpoint ID
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 23
Region 0: Memory at 91200000 (32-bit, non-prefetchable) [size=1M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [58] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 1, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: uio_pci_generic
Is it an issue of FPGA device ? or does UIO generic PCI does not support interrupts ?
Eventually after commenting the following lines for irqhandler,
I am able to receive interrupts from FPGA
static irqreturn_t irqhandler(int irq, struct uio_info *info)
{
struct uio_pci_generic_dev *gdev = to_uio_pci_generic_dev(info);
- if (!pci_check_and_mask_intx(gdev->pdev))
- return IRQ_NONE;
/* UIO core will signal the user process. */
return IRQ_HANDLED;
}
Yet, obviously it is still a workaround, and we need later to investigate why FPGA does not deliver bit status change in status register in configuration space of PCIe together with the irq.
Is there some mechanism in Linux which is poisoning addresses by zeroing upper 16 bits?
I am debugging a kernel crash on an Intel x86-64 machine. The instruction which is causing the crash tries to access an address of 0x880139f3da00:
crash> bt
R10: 0000000000000001 R11: 0000000000000001 R12: 0000880139f3da00
~~~~~~~~~~~~~~~~~~~~~
crash> p arp_tbl.nht->hash_buckets[255]
$66 = (struct neighbour *) 0x880139f3da00
crash> p *arp_tbl.nht->hash_buckets[255]
Cannot access memory at address 0x880139f3da00
The hash_buckets table is valid:
crash> p arp_tbl.nht->hash_buckets[253]
$70 = (struct neighbour *) 0xffff88007325ae00
$71 = {
next = 0x0,
tbl = 0xffffffff81abbf20 <arp_tbl>,
Setting upper word to 0xffff makes the address valid and returns a valid data structure:
crash> p *((struct neighbour *)0xffff880139f3da00)
$73 = {
next = 0xffff88006de69a00,
tbl = 0xffffffff81abbf20 <arp_tbl>,
... rest looks reasonable too ...
Structure is updated by RCU operations (e.g. very likely by these in neigh_flush_dev()). So, what could be the reason that the address becomes invalid in such a way?
I can exclude hardware defects (seen on two machines and with different addresses). Systems are running CentOS 7 with kernel 3.10.0-514.6.1.el7.centos.plus.x86_64 till 3.10.0-514.21.2.el7.centos.plus.x86_64.
Update
From another crash dump, I see an skb of an IPv6 packet with
crash> p *((struct sk_buff *)0xffff880070e25e00)
$57 = {
transport_header = 54,
network_header = 14,
mac_header = 0,
...
head = 0xffff880138e28000 "",
data = 0xffff880138e2800e "`",
...
}
This crashes when writing the first 0x8 bytes in
#define HH_DATA_MOD 16
static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *skb)
{
if (likely(hh_len <= HH_DATA_MOD)) {
memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD); <<<<<
This would explain why two bytes are overridden (16 - 14).
can you inspect the memory location this address was read from? typically such a "partial zero" read is a result of memset being run on the area. after this cpu triggered a crash there was possibly enough time for whoever else was modifying the area to finish zeroing and possibly even fill it with other data.
so far there is no reason to suspect rcu plays any role here
this is most definitely not "poisoning" done by the kernel (it would be quite weird to do it in this way). however, if the crash is reproducible (you say it occurred on at least 2 different machines?) then running a debug kernel may be of help, especially with slab debug enabled.
We are using FPGA cards with PCI express drivers to move data around with DMA engines. This all works fine for a single card in a machine, however with two cards it fails. As an initial investigation, I have narrowed an error down to the add_timer function that is used to set up the polling mechanism. When insmod adds the driver modules, a stack trace is produced as the poll_timer routine is the same for both instances. The code has been reduced to
static int dat_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
{
struct timer_list * timer = &poll_timer;
int i;
/* Start polling routine */
log_normal(KERN_INFO "DEBUG ADD TIMER: Starting poll routine with %x\n", pdev);
init_timer(timer);
// random number added so that expires value is different for both instances of timer
get_random_bytes(&i, 1);
timer->expires=jiffies+HZ+i;
timer->data=(unsigned long) pdev;
timer->function = poll_routine;
log_verbose("DEBUG ADD TIMER: Timer expires %x\n", timer->expires);
log_verbose("DEBUG ADD TIMER: Timer data %x\n", timer->data);
log_verbose("DEBUG ADD TIMER: Timer function %x\n", timer->function);
// ***** THIS IS WHERE STACK TRACE OCCURS (WHEN CALLED FOR SECOND TIME)
add_timer(timer);
log_verbose("DEBUG ADD TIMER: Value of HZ is %d\n", HZ);
log_verbose("DEBUG ADD TIMER: End of probe\n");
return 0;
}
the stack trace produces
list_add corruption. prev->next should be next (ffffffff81f76228), but was (null). (prev=ffffffffa050a3c0).
and
list_add double add: new=ffffffffa050a3c0, prev=ffffffffa050a3c0, next=ffffffff81f76228.
Looking at the printk statements, it is clear that the add_timer is trying to add the same routine to the linked list. Is this correct?
DEBUG ADD TIMER: Timer expires fffd9cd3
DEBUG ADD TIMER: Timer data 6c0ac000
DEBUG ADD TIMER: Timer function **a0508150**
DEBUG ADD TIMER: Value of HZ is 1000
DEBUG ADD TIMER: End of probe
DEBUG ADD TIMER: Starting poll routine with 6c0ad000
DEBUG ADD TIMER: Timer expires fffd9c7d
DEBUG ADD TIMER: Timer data 6c0ad000
DEBUG ADD TIMER: Timer function **a0508150**
So my question(s) is(are), how should I configure the timer for multiple instantations of the same driver? (Assuming that is what is happening when multiple boards are inserted into the machine).
full stack trace
DEBUG ADD TIMER: Inserting driver into kernel.
DEBUG ADD TIMER: Starting poll routine with 6c0ac000
DEBUG ADD TIMER: Timer expires fffd9cd3
DEBUG ADD TIMER: Timer data 6c0ac000
DEBUG ADD TIMER: Timer function a0508150
DEBUG ADD TIMER: Value of HZ is 1000
DEBUG ADD TIMER: End of probe
DEBUG ADD TIMER: Starting poll routine with 6c0ad000
DEBUG ADD TIMER: Timer expires fffd9c7d
DEBUG ADD TIMER: Timer data 6c0ad000
DEBUG ADD TIMER: Timer function a0508150
------------[ cut here ]------------
WARNING: CPU: 0 PID: 2201 at lib/list_debug.c:33 __list_add+0xa0/0xd0()
list_add corruption. prev->next should be next (ffffffff81f76228), but was (null). (prev=ffffffffa050a3c0).
Modules linked in: xdma_v7(POE+) xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller crc32c_intel eeepc_wmi ghash_clmulni_intel asus_wmi ftdi_sio iTCO_wdt snd_hda_codec sparse_keymap raid0 iTCO_vendor_support
snd_hda_core rfkill sb_edac ipmi_ssif video mxm_wmi edac_core snd_hwdep mei_me snd_seq snd_seq_device ipmi_msghandler snd_pcm mei acpi_pad tpm_infineon lpc_ich mfd_core snd_timer tpm_tis shpchp tpm snd soundcore i2c_i801 wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm drm igb serio_raw ptp pps_core dca i2c_algo_bit
CPU: 0 PID: 2201 Comm: insmod Tainted: P OE 4.1.8-100.fc21.x86_64 #1
Hardware name: ASUSTeK COMPUTER INC. Z10PE-D8 WS/Z10PE-D8 WS, BIOS 1001 03/17/2015
0000000000000000 00000000ec73155d ffff880457123928 ffffffff81792065
0000000000000000 ffff880457123980 ffff880457123968 ffffffff810a163a
0000000000000246 ffffffffa050a3c0 ffffffff81f76228 ffffffffa050a3c0
Call Trace:
[<ffffffff81792065>] dump_stack+0x45/0x57
[<ffffffff810a163a>] warn_slowpath_common+0x8a/0xc0
[<ffffffff810a16c5>] warn_slowpath_fmt+0x55/0x70
[<ffffffff810f8250>] ? vprintk_emit+0x3b0/0x560
[<ffffffff813c7c30>] __list_add+0xa0/0xd0
[<ffffffff81108412>] __internal_add_timer+0xb2/0x130
[<ffffffff811084bf>] internal_add_timer+0x2f/0xb0
[<ffffffff8110a1ca>] mod_timer+0x12a/0x210
[<ffffffff8110a2c8>] add_timer+0x18/0x30
[<ffffffffa050810f>] dat_probe+0xbf/0x100 [xdma_v7]
[<ffffffff813f6da5>] local_pci_probe+0x45/0xa0
[<ffffffff812a8da2>] ? sysfs_do_create_link_sd.isra.2+0x72/0xc0
[<ffffffff813f8109>] pci_device_probe+0xf9/0x150
[<ffffffff814e7e59>] driver_probe_device+0x209/0x4b0
[<ffffffff814e81db>] __driver_attach+0x9b/0xa0
[<ffffffff814e8140>] ? __device_attach+0x40/0x40
[<ffffffff814e5973>] bus_for_each_dev+0x73/0xc0
[<ffffffff814e772e>] driver_attach+0x1e/0x20
[<ffffffff814e72e0>] bus_add_driver+0x180/0x250
[<ffffffffa000a000>] ? 0xffffffffa000a000
[<ffffffff814e89d4>] driver_register+0x64/0xf0
[<ffffffff813f662c>] __pci_register_driver+0x4c/0x50
[<ffffffffa000a02c>] dat_init+0x2c/0x1000 [xdma_v7]
[<ffffffff81002148>] do_one_initcall+0xd8/0x210
[<ffffffff812094f9>] ? kmem_cache_alloc_trace+0x1a9/0x230
[<ffffffff817911bc>] ? do_init_module+0x28/0x1cc
[<ffffffff817911f5>] do_init_module+0x61/0x1cc
[<ffffffff811270bb>] load_module+0x20db/0x2550
[<ffffffff81122990>] ? store_uevent+0x70/0x70
[<ffffffff8122e860>] ? kernel_read+0x50/0x80
[<ffffffff81127766>] SyS_finit_module+0xa6/0xe0
[<ffffffff8179892e>] system_call_fastpath+0x12/0x71
---[ end trace 340e5d7ba2d89081 ]---
------------[ cut here ]------------
WARNING: CPU: 0 PID: 2201 at lib/list_debug.c:36 __list_add+0xcb/0xd0()
list_add double add: new=ffffffffa050a3c0, prev=ffffffffa050a3c0, next=ffffffff81f76228.
Modules linked in: xdma_v7(POE+) xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller crc32c_intel eeepc_wmi ghash_clmulni_intel asus_wmi ftdi_sio iTCO_wdt snd_hda_codec sparse_keymap raid0 iTCO_vendor_support
snd_hda_core rfkill sb_edac ipmi_ssif video mxm_wmi edac_core snd_hwdep mei_me snd_seq snd_seq_device ipmi_msghandler snd_pcm mei acpi_pad tpm_infineon lpc_ich mfd_core snd_timer tpm_tis shpchp tpm snd soundcore i2c_i801 wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ast drm_kms_helper ttm drm igb serio_raw ptp pps_core dca i2c_algo_bit
CPU: 0 PID: 2201 Comm: insmod Tainted: P W OE 4.1.8-100.fc21.x86_64 #1
Hardware name: ASUSTeK COMPUTER INC. Z10PE-D8 WS/Z10PE-D8 WS, BIOS 1001 03/17/2015
0000000000000000 00000000ec73155d ffff880457123928 ffffffff81792065
0000000000000000 ffff880457123980 ffff880457123968 ffffffff810a163a
0000000000000246 ffffffffa050a3c0 ffffffff81f76228 ffffffffa050a3c0
Call Trace:
[<ffffffff81792065>] dump_stack+0x45/0x57
[<ffffffff810a163a>] warn_slowpath_common+0x8a/0xc0
[<ffffffff810a16c5>] warn_slowpath_fmt+0x55/0x70
[<ffffffff810f8250>] ? vprintk_emit+0x3b0/0x560
[<ffffffff813c7c5b>] __list_add+0xcb/0xd0
[<ffffffff81108412>] __internal_add_timer+0xb2/0x130
[<ffffffff811084bf>] internal_add_timer+0x2f/0xb0
[<ffffffff8110a1ca>] mod_timer+0x12a/0x210
[<ffffffff8110a2c8>] add_timer+0x18/0x30
[<ffffffffa050810f>] dat_probe+0xbf/0x100 [xdma_v7]
[<ffffffff813f6da5>] local_pci_probe+0x45/0xa0
[<ffffffff812a8da2>] ? sysfs_do_create_link_sd.isra.2+0x72/0xc0
[<ffffffff813f8109>] pci_device_probe+0xf9/0x150
[<ffffffff814e7e59>] driver_probe_device+0x209/0x4b0
[<ffffffff814e81db>] __driver_attach+0x9b/0xa0
[<ffffffff814e8140>] ? __device_attach+0x40/0x40
[<ffffffff814e5973>] bus_for_each_dev+0x73/0xc0
[<ffffffff814e772e>] driver_attach+0x1e/0x20
[<ffffffff814e72e0>] bus_add_driver+0x180/0x250
[<ffffffffa000a000>] ? 0xffffffffa000a000
[<ffffffff814e89d4>] driver_register+0x64/0xf0
[<ffffffff813f662c>] __pci_register_driver+0x4c/0x50
[<ffffffffa000a02c>] dat_init+0x2c/0x1000 [xdma_v7]
[<ffffffff81002148>] do_one_initcall+0xd8/0x210
[<ffffffff812094f9>] ? kmem_cache_alloc_trace+0x1a9/0x230
[<ffffffff817911bc>] ? do_init_module+0x28/0x1cc
[<ffffffff817911f5>] do_init_module+0x61/0x1cc
[<ffffffff811270bb>] load_module+0x20db/0x2550
[<ffffffff81122990>] ? store_uevent+0x70/0x70
[<ffffffff8122e860>] ? kernel_read+0x50/0x80
[<ffffffff81127766>] SyS_finit_module+0xa6/0xe0
[<ffffffff8179892e>] system_call_fastpath+0x12/0x71
---[ end trace 340e5d7ba2d89082 ]---
DEBUG ADD TIMER: Value of HZ is 1000
DEBUG ADD TIMER: End of probe
The problem is that the second call to dat_probe is clobbering the poll_timer variable that was initialized and queued by the first call to dat_probe. You are clobbering the pointers in the kernel's timer list.
You need to get rid of the poll_timer variable and give each device its own dynamically allocated private data structure containing its own struct timer_list member. Call pci_set_drvdata to set the private data pointer for the PCI device. The other PCI driver functions can call pci_get_drvdata to retrieve that pointer.