Managing interrupt in Linux-based OS - linux-kernel

Context
I have a device tree in which one of the node is:
gpio#41210000 {
#gpio-cells = <0x2>;
#interrupt-cells = <0x2>;
compatible = "xlnx,xps-gpio-1.00.a,generic-uio,BTandSW";
gpio-controller;
interrupt-controller;
interrupt-parent = <0x4>;
//interrupt-parent =<&gic>;
interrupts = <0x0 0x1d 0x4>;
reg = <0x41210000 0x10000>;
xlnx,all-inputs = <0x1>;
xlnx,all-inputs-2 = <0x1>;
xlnx,all-outputs = <0x0>;
xlnx,all-outputs-2 = <0x0>;
xlnx,dout-default = <0x0>;
xlnx,dout-default-2 = <0x0>;
xlnx,gpio-width = <0x4>;
xlnx,gpio2-width = <0x2>;
xlnx,interrupt-present = <0x1>;
xlnx,is-dual = <0x1>;
xlnx,tri-default = <0xffffffff>;
xlnx,tri-default-2 = <0xffffffff>;
};
Once the kernel is running, performing the
cat /proc/interrupts
the results are:
root#linaro-developer:~# cat /proc/interrupts
CPU0 CPU1
16: 0 0 GIC-0 27 Edge gt
17: 0 0 GIC-0 43 Level ttc_clockevent
18: 1588 1064 GIC-0 29 Edge twd
21: 43 0 GIC-0 39 Level f8007100.adc
24: 0 0 GIC-0 35 Level f800c000.ocmc
25: 506 0 GIC-0 59 Level xuartps
26: 0 0 GIC-0 51 Level e000d000.spi
27: 0 0 GIC-0 54 Level eth5
28: 4444 0 GIC-0 56 Level mmc0
29: 0 0 GIC-0 45 Level f8003000.dmac
30: 0 0 GIC-0 46 Level f8003000.dmac
31: 0 0 GIC-0 47 Level f8003000.dmac
32: 0 0 GIC-0 48 Level f8003000.dmac
33: 0 0 GIC-0 49 Level f8003000.dmac
34: 0 0 GIC-0 72 Level f8003000.dmac
35: 0 0 GIC-0 73 Level f8003000.dmac
36: 0 0 GIC-0 74 Level f8003000.dmac
37: 0 0 GIC-0 75 Level f8003000.dmac
38: 0 0 GIC-0 40 Level f8007000.devcfg
45: 0 0 GIC-0 41 Edge f8005000.watchdog
IPI1: 0 0 Timer broadcast interrupts
IPI2: 1731 2206 Rescheduling interrupts
IPI3: 29 36 Function call interrupts
IPI4: 0 0 CPU stop interrupts
IPI5: 0 0 IRQ work interrupts
IPI6: 0 0 completion interrupts
Err: 0
Questions
Once the kernel is running, should it recognize, automatically, and update the data in the file /proc/interrupt?
However, I wrote a .probe function in this way:
static int SWITCH_of_probe(struct platform_device *pdev)
{
int ret=0;
struct irq_data data_tmp;
SWITCH_01_devices->temp_res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!(SWITCH_01_devices->temp_res)) {
dev_err(&pdev->dev, "TRY could not get IO memory\n");
return -ENXIO;
}
PDEBUG("resource : regs.start=%#x,regs.end=%#x\n",SWITCH_01_devices->temp_res->start,SWITCH_01_devices->temp_res->end);
//SWITCH_01_devices->irq_line = platform_get_irq(pdev, 0);
SWITCH_01_devices->irq_line = irq_of_parse_and_map(pdev->dev.of_node, 0);
if (SWITCH_01_devices->irq_line < 0) {
dev_err(&pdev->dev, "could not get IRQ\n");
printk(KERN_ALERT "could not get IRQ\n");
return SWITCH_01_devices->irq_line;
}
PDEBUG(" resource VIRTUAL IRQ NUMBER : irq=%#x\n",SWITCH_01_devices->irq_line);
ret = request_irq((SWITCH_01_devices->irq_line), SWITCH_01_interrupt, IRQF_SHARED , DRIVER_NAME, NULL);
if (ret) {
printk(KERN_ALERT "NEW SWITCH_01: can't get assigned irq, ret= %d\n", SWITCH_01_devices->irq_line, ret);
SWITCH_01_devices->irq_line = -1;
}
SWITCH_01_devices->mem_region_requested = request_mem_region((SWITCH_01_devices->temp_res->start),resource_size(SWITCH_01_devices->temp_res),"SWITCH_01");
if(SWITCH_01_devices->mem_region_requested == NULL){
printk(KERN_WARNING "[LEO] SWITCH: FaiSWITCH request_mem_region(res.start,resource_size(&(SWITCH_01_devices->res)),...);\n");
}
else
PDEBUG(" [+] request_mem_region\n");
return 0; /* Success */
}
When inserting the module in the kernel I have this output from dmesg:
[ 1249.777189] SWITCH_01: loading out-of-tree module taints kernel.
[ 1249.777787] [LEO] SWITCH_01: dinamic allocation of major number
[ 1249.777801] [LEO] cdev initialized
[ 1249.777988] [LEO] resource : regs.start=0x41210000,regs.end=0x4121ffff
[ 1249.777994] [LEO] resource : irq=0x2e
[ 1249.778000] NEW SWITCH_01: can't get assigned irq, ret= -22
[ 1249.782531] [LEO] [+] request_mem_region
What am I doing wrong? Why I cannot perform a correct request_irq?
Note: the interrupts = <0x0 0x1d 0x4> field of the device tree and the irq_number detected are different. As it was point out here, I changed the platform_get_irq(pdev, 0); with irq_of_parse_and_map(pdev->dev.of_node, 0); but the result is the same.

Once the kernel is running, should it recognize, automatically, and
update the data in the file /proc/interrupt?
Yes it will update, once the interrupt is registered.
[ 1249.777189] SWITCH_01: loading out-of-tree module taints kernel.
[ 1249.777787] [LEO] SWITCH_01: dinamic allocation of major number
[ 1249.777801] [LEO] cdev initialized
[ 1249.777988] [LEO] resource : regs.start=0x41210000,regs.end=0x4121ffff
[ 1249.777994] [LEO] resource : irq=0x2e
[ 1249.778000] NEW SWITCH_01: can't get assigned irq, ret= -22
[ 1249.782531] [LEO] [+] request_mem_region
What am I doing wrong? Why I cannot perform a correct request_irq?
A shared interrupt (IRQF_SHARED) must pass the dev_id (which you are passing NULL in request_irq()), if NULL, -EINVAL is returned back, so make sure you pass a non-null valid dev_id

Related

Out of memory: kill process

Does anyone have any ideas on why these logs are giving these errors - Out of memory: kill process ... nginx invoked oom-killer?
Lately, our cms has been going down and we have to manually restart the server in AWS and we're not sure what is happening to cause this behavior. log errors
Here are the exact lines of code that repeated 33 times while the server was down:
Out of memory: kill process 15654 (ruby) score 338490 or a child
Killed process 15654 (ruby) vsz:1353960kB, anon-rss:210704kB, file-rss:0kB
nginx invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
nginx cpuset=/ mems_allowed=0
Pid: 8729, comm: nginx Tainted: G W 2.6.35.14-97.44.amzn1.x86_64 #1
Call Trace:
[<ffffffff8108e638>] ? cpuset_print_task_mems_allowed+0x98/0xa0
[<ffffffff810bb157>] dump_header.clone.1+0x77/0x1a0
[<ffffffff81318d49>] ? _raw_spin_unlock_irqrestore+0x19/0x20
[<ffffffff811ab3af>] ? ___ratelimit+0x9f/0x120
[<ffffffff810bb2f6>] oom_kill_process.clone.0+0x76/0x140
[<ffffffff810bb4d8>] __out_of_memory+0x118/0x190
[<ffffffff810bb5d2>] out_of_memory+0x82/0x1c0
[<ffffffff810beb89>] __alloc_pages_nodemask+0x689/0x6a0
[<ffffffff810e7864>] alloc_pages_current+0x94/0xf0
[<ffffffff810b87ef>] __page_cache_alloc+0x7f/0x90
[<ffffffff810c15e0>] __do_page_cache_readahead+0xc0/0x200
[<ffffffff810c173c>] ra_submit+0x1c/0x20
[<ffffffff810b9f63>] filemap_fault+0x3e3/0x430
[<ffffffff810d023f>] __do_fault+0x4f/0x4b0
[<ffffffff810d2774>] handle_mm_fault+0x1b4/0xb40
[<ffffffff81007682>] ? check_events+0x12/0x20
[<ffffffff81006f1d>] ? xen_force_evtchn_callback+0xd/0x10
[<ffffffff81007682>] ? check_events+0x12/0x20
[<ffffffff8131c752>] do_page_fault+0x112/0x310
[<ffffffff813194b5>] page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 35
CPU 1: hi: 186, btch: 31 usd: 0
CPU 2: hi: 186, btch: 31 usd: 30
CPU 3: hi: 186, btch: 31 usd: 0
Node 0 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 202
CPU 1: hi: 186, btch: 31 usd: 30
CPU 2: hi: 186, btch: 31 usd: 59
CPU 3: hi: 186, btch: 31 usd: 140
active_anon:3438873 inactive_anon:284496 isolated_anon:0
active_file:0 inactive_file:62 isolated_file:64
unevictable:0 dirty:0 writeback:0 unstable:0
free:16763 slab_reclaimable:1340 slab_unreclaimable:2956
mapped:29 shmem:12 pagetables:11130 bounce:0
Node 0 DMA free:7892kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15772kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 4024 14836 14836
Node 0 DMA32 free:47464kB min:4224kB low:5280kB high:6336kB active_anon:3848564kB inactive_anon:147080kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4120800kB mlocked:0kB dirty:0kB writeback:0kB mapped:60kB shmem:0kB slab_reclaimable:28kB slab_unreclaimable:268kB kernel_stack:48kB pagetables:8604kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:82 all_unreclaimable? yes
lowmem_reserve[]: 0 0 10811 10811
Node 0 Normal free:11184kB min:11352kB low:14188kB high:17028kB active_anon:9906928kB inactive_anon:990904kB active_file:0kB inactive_file:1116kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:11071436kB mlocked:0kB dirty:0kB writeback:0kB mapped:56kB shmem:48kB slab_reclaimable:5332kB slab_unreclaimable:11556kB kernel_stack:2400kB pagetables:35916kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB 1*8kB 2*16kB 1*32kB 0*64kB 3*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7892kB
Node 0 DMA32: 62*4kB 104*8kB 53*16kB 29*32kB 27*64kB 7*128kB 2*256kB 3*512kB 1*1024kB 3*2048kB 8*4096kB = 47464kB
Node 0 Normal: 963*4kB 0*8kB 5*16kB 4*32kB 7*64kB 5*128kB 6*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 11292kB
318 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
3854801 pages RAM
86406 pages reserved
14574 pages shared
3738264 pages non-shared
It's happening because your server is running out of memory. To solve this problem you have 2 options.
Update your Server's Ram or use SWAP (But upgrading Physical ram is recommended instead of using SWAP)
Limit Nginx ram use.
To limit nginx ram use open the /etc/nginx/nginx.conf file and add client_max_body_size <your_value_here> under the http configuration block. For example:
worker_processes 1;
http {
client_max_body_size 10M;
...
}
Note: use M for MB, G for GB and T for TB

RaspberryPi 3b+ with multiple can buses (MPC2515))

I'm trying to connect 6 mcp2515 over spi0. I have adapted an SPI overlay to add the neccesary chip select lines. My new SPI overlay looks like this:
{
compatible = "brcm,bcm2835", "brcm,bcm2836", "brcm,bcm2708", "brcm,bcm2709";
fragment#0 {
target = <&spi0>;
frag0: __overlay__ {
#address-cells = <1>;
#size-cells = <0>;
pinctrl-0 = <&spi0_pins &spi0_cs_pins>;
status = "okay";
cs-gpios = <&gpio 8 1>, <&gpio 7 1>, <&gpio 22 1>, <&gpio 23 1>, <&gpio 24 1>, <&gpio 25 1>;
spidev#0{
compatible = "spidev";
reg = <0>; /* CE0 */
#address-cells = <1>;
#size-cells = <0>;
spi-max-frequency = <500000>;
};
spidev#1{
compatible = "spidev";
reg = <1>; /* CE1 */
#address-cells = <1>;
#size-cells = <0>;
spi-max-frequency = <500000>;
};
spidev#2{
compatible = "spidev";
reg = <2>; /* CE2 */
#address-cells = <1>;
#size-cells = <0>;
spi-max-frequency = <500000>;
};
spidev#3{
compatible = "spidev";
reg = <3>; /* CE3 */
#address-cells = <1>;
#size-cells = <0>;
spi-max-frequency = <500000>;
};
spidev#4{
compatible = "spidev";
reg = <4>; /* CE4 */
#address-cells = <1>;
#size-cells = <0>;
spi-max-frequency = <500000>;
};
spidev#5{
compatible = "spidev";
reg = <5>; /* CE5 */
#address-cells = <1>;
#size-cells = <0>;
spi-max-frequency = <500000>;
};
};
};
fragment#1 {
target = <&gpio>;
__overlay__ {
spi0_cs_pins: spi0_cs_pins {
brcm,pins = <7 8 22 23 24 25>;
brcm,function = <1>; /* out */
};
};
};
With this SPI overlay i have the 6 spi's in /sys/bus/spi/devices/
spi0.0 spi0.1 spi0.2 spi0.3 spi0.4 spi0.5
I have also made new overlays for the mcp2515 (can0 to can5) in order to bind them with the new chip select lines of spi0.
My /boot/config.txt looks like this:
dtoverlay=spi-gpio-cs-new
dtoverlay=mcp2515-can0,oscillator=8000000,interrupt=5
dtoverlay=mcp2515-can4,oscillator=8000000,interrupt=26
dtoverlay=mcp2515-can5,oscillator=8000000,interrupt=27
dmesg | grep mcp
[ 7.870207] mcp251x spi0.5 can0: MCP2515 successfully initialized.
[ 7.892886] mcp251x spi0.4 can1: MCP2515 successfully initialized.
[ 7.908725] mcp251x spi0.0 can2: MCP2515 successfully initialized.
ifconfig
can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 10 (UNSPEC)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
can1: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 10 (UNSPEC)
RX packets 36 bytes 180 (180.0 B)
RX errors 0 dropped 36 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
can2: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 10 (UNSPEC)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
I only have 3 mcp2515 boards at my disposal for the moment. I have modified them regarding voltage supply to the CAN transceiver(5V) and can controller(3V) in order no to damage the Raspberry Pi GPIO's, the boards have been individually tested and I was able to send/receive CAN frames with them. They are connected to the Raspberry like this
Out of these 3 interfaces only can1 (spi0.4) is working! Using candump I can see can frames on the network.
My question is why can0 and can2 are muted when trying to send or receive CAN messages (candump and cansend)?
Kernel interrupt table
CPU0 CPU1 CPU2 CPU3
17: 217 0 0 0 ARMCTRL-level 1 Edge 3f00b880.mailbox
18: 47 0 0 0 ARMCTRL-level 2 Edge VCHIQ doorbell
40: 0 0 0 0 ARMCTRL-level 48 Edge bcm2708_fb DMA
42: 352 0 0 0 ARMCTRL-level 50 Edge DMA IRQ
44: 3062 0 0 0 ARMCTRL-level 52 Edge DMA IRQ
45: 0 0 0 0 ARMCTRL-level 53 Edge DMA IRQ
48: 0 0 0 0 ARMCTRL-level 56 Edge DMA IRQ
56: 14104 0 0 0 ARMCTRL-level 64 Edge dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb1
78: 0 0 0 0 ARMCTRL-level 86 Edge 3f204000.spi
80: 158 0 0 0 ARMCTRL-level 88 Edge mmc0
81: 7450 0 0 0 ARMCTRL-level 89 Edge uart-pl011
86: 4207 0 0 0 ARMCTRL-level 94 Edge mmc1
161: 0 0 0 0 bcm2836-timer 0 Edge arch_timer
162: 1813 1900 2264 1528 bcm2836-timer 1 Edge arch_timer
165: 0 0 0 0 bcm2836-pmu 9 Edge arm-pmu
166: 0 0 0 0 lan78xx-irqs 17 Edge usb-001:004:01
167: 0 0 0 0 pinctrl-bcm2835 26 Edge spi0.5
168: 6 0 0 0 pinctrl-bcm2835 27 Edge spi0.4
169: 0 0 0 0 pinctrl-bcm2835 5 Level spi0.0
FIQ: usb_fiq
IPI0: 0 0 0 0 CPU wakeup interrupts
IPI1: 0 0 0 0 Timer broadcast interrupts
IPI2: 1469 2966 3711 4460 Rescheduling interrupts
IPI3: 203 798 542 445 Function call interrupts
IPI4: 0 0 0 0 CPU stop interrupts
IPI5: 55 85 41 24 IRQ work interrupts
IPI6: 0 0 0 0 completion interrupts
Err: 0
I can see from this table that the SPI's are assigned with interrupts but only spi0.4 is actually activated. How can activate the other 2 interrupts for spi0.0 and spi0.5?
It's working!!!
As i mentioned in my first post only one board was working (can1 spi0.4), after i rechecked the other two non working boards i discovered that one had a hardware damage causing the other board not to work as well. As a final conclusion my spi and mcp overlays are fully functional!
Regards
Antmar
[ 7.846788] mcp251x spi0.0 can0: MCP2515 successfully initialized.
[ 7.888039] mcp251x spi0.1 can1: MCP2515 successfully initialized.
[ 7.924747] mcp251x spi0.2 can2: MCP2515 successfully initialized.
[ 7.936608] mcp251x spi0.3 can3: MCP2515 successfully initialized.
can0 241 [5] 67 A4 31 F0 C7
can1 2A0 [2] 02 93
can2 241 [5] 67 A4 31 F0 CB
can3 240 [2] 02 6A

Which condition is technically more efficient, i >= 0 or i > -1?

This kind of usage is common while writing loops.
I was wondering if i >=0 will need more CPU cycles as it has two conditions greater than OR equal to when compared to i > -1. Is one known to be better than the other, and if so, why?
This is not correct. The JIT will implement both tests as a single machine language instruction.
And the number of CPU clock cycles is not determined by the number of comparisons to zero or -1, because the CPU should do one comparison and set flags to indicate whether the result of the comparison is <, > or =.
It's possible that one of those instructions will be more efficient on certain processors, but this kind of micro-optimization is almost always not worth doing. (It's also possible that the JIT - or javac - will actually generate the same instructions for both tests.)
On the contrary, comparsions (including non-strict) with zero takes one CPU instruction less. x86 architecture supports conditional jumps after any arithmetic or loading operation. It is reflected in Java bytecode instruction set, there is a group of instructions to compare the value on the top of the stack and jump: ifeq/ifgt/ifge/iflt/ifle/ifne. (See the full list). Comparsion with -1 requires additional iconst_m1 operation (loading -1 constant onto the stack).
The are two loops with different comparsions:
#GenerateMicroBenchmark
public int loopZeroCond() {
int s = 0;
for (int i = 1000; i >= 0; i--) {
s += i;
}
return s;
}
#GenerateMicroBenchmark
public int loopM1Cond() {
int s = 0;
for (int i = 1000; i > -1; i--) {
s += i;
}
return s;
}
The second version is one byte longer:
public int loopZeroCond();
Code:
0: iconst_0
1: istore_1
2: sipush 1000
5: istore_2
6: iload_2
7: iflt 20 //
10: iload_1
11: iload_2
12: iadd
13: istore_1
14: iinc 2, -1
17: goto 6
20: iload_1
21: ireturn
public int loopM1Cond();
Code:
0: iconst_0
1: istore_1
2: sipush 1000
5: istore_2
6: iload_2
7: iconst_m1 //
8: if_icmple 21 //
11: iload_1
12: iload_2
13: iadd
14: istore_1
15: iinc 2, -1
18: goto 6
21: iload_1
22: ireturn
It is slightly more performant on my machine (to my surprise. I expected JIT to compile these loops into identical assembly.)
Benchmark Mode Thr Mean Mean error Units
t.LoopCond.loopM1Cond avgt 1 0,319 0,004 usec/op
t.LoopCond.loopZeroCond avgt 1 0,302 0,004 usec/op
Сonclusion
Compare with zero whenever sensible.

scala implicit performance

This comes up regularly. Functions coded up using generics are signifficnatly slower in scala. See example below. Type specific version performs about a 1/3 faster than the generic version. This is doubly surprising given that the generic component is outside of the expensive loop. Is there a known explanation for this?
def xxxx_flttn[T](v: Array[Array[T]])(implicit m: Manifest[T]): Array[T] = {
val I = v.length
if (I <= 0) Array.ofDim[T](0)
else {
val J = v(0).length
for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
val flt = Array.ofDim[T](I * J)
for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
flt
}
}
def flttn(v: Array[Array[Double]]): Array[Double] = {
val I = v.length
if (I <= 0) Array.ofDim[Double](0)
else {
val J = v(0).length
for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
val flt = Array.ofDim[Double](I * J)
for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
flt
}
}
You can't really tell what you're measuring here--not very well, anyway--because the for loop isn't as fast as a pure while loop, and the inner operation is quite inexpensive. If we rewrite the code with while loops--the key double-iteration being
var i = 0
while (i<I) {
var j = 0
while (j<J) {
flt(i * J + j) = v(i)(j)
j += 1
}
i += 1
}
flt
then we see that the bytecode for the generic case is actually dramatically different. Non-generic:
133: checkcast #174; //class "[D"
136: astore 6
138: iconst_0
139: istore 5
141: iload 5
143: iload_2
144: if_icmpge 191
147: iconst_0
148: istore 4
150: iload 4
152: iload_3
153: if_icmpge 182
// The stuff above implements the loop; now we do the real work
156: aload 6
158: iload 5
160: iload_3
161: imul
162: iload 4
164: iadd
165: aload_1
166: iload 5
168: aaload // v(i)
169: iload 4
171: daload // v(i)(j)
172: dastore // flt(.) = _
173: iload 4
175: iconst_1
176: iadd
177: istore 4
// Okay, done with the inner work, time to jump around
179: goto 150
182: iload 5
184: iconst_1
185: iadd
186: istore 5
188: goto 141
It's just a bunch of jumps and low-level operations (daload and dastore being the key ones that load and store a double from an array). If we look at the key inner part of the generic bytecode, it instead looks like
160: getstatic #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
163: aload 7
165: iload 6
167: iload 4
169: imul
170: iload 5
172: iadd
173: getstatic #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
176: aload_1
177: iload 6
179: aaload
180: iload 5
182: invokevirtual #107; //Method scala/runtime/ScalaRunTime$.array_apply:(Ljava/lang/Object;I)Ljava/lang/Object;
185: invokevirtual #111; //Method scala/runtime/ScalaRunTime$.array_update:(Ljava/lang/Object;ILjava/lang/Object;)V
188: iload 5
190: iconst_1
191: iadd
192: istore 5
which, as you can see, has to call methods to do the array apply and update. The bytecode for that is a huge mess of stuff like
2: aload_3
3: instanceof #98; //class "[Ljava/lang/Object;"
6: ifeq 18
9: aload_3
10: checkcast #98; //class "[Ljava/lang/Object;"
13: iload_2
14: aaload
15: goto 183
18: aload_3
19: instanceof #100; //class "[I"
22: ifeq 37
25: aload_3
26: checkcast #100; //class "[I"
29: iload_2
30: iaload
31: invokestatic #106; //Method scala/runtime/BoxesRunTime.boxToInteger:
34: goto 183
37: aload_3
38: instanceof #108; //class "[D"
41: ifeq 56
44: aload_3
45: checkcast #108; //class "[D"
48: iload_2
49: daload
50: invokestatic #112; //Method scala/runtime/BoxesRunTime.boxToDouble:(
53: goto 183
which basically has to test each type of array and box it if it's the type you're looking for. Double is pretty near the front (3rd of 10), but it's still a pretty major overhead, even if the JVM can recognize that the code ends up being box/unbox and therefore doesn't actually need to allocate memory. (I'm not sure it can do that, but even if it could it wouldn't solve the problem.)
So, what to do? You can try [#specialized T], which will expand your code tenfold for you, as if you wrote each primitive array operation by yourself. Specialization is buggy in 2.9 (should be less so in 2.10), though, so it may not work the way you hope. If speed is of the essence--well, first, write while loops instead of for loops (or at least compile with -optimise which helps for loops out by a factor of two or so!), and then consider either specialization or writing the code by hand for the types you require.
This is due to boxing, when you apply the generic to a primitive type and use containing arrays (or the type appearing plain in method signatures or as member).
Example
In the following trait, after compilation, the process method will take an erased Array[Any].
trait Foo[A]{
def process(as: Array[A]): Int
}
If you choose A to be a value/primitive type, like Double it has to be boxed. When writing the trait in a non-generic way (e.g. with A=Double), process is compiled to take an Array[Double], which is a distinct array type on the JVM. This is more efficient, since in order to store a Double inside the Array[Any], the Double has to be wrapped (boxed) into an object, a reference to which gets stored inside the array. The special Array[Double] can store the Double directly in memory as a 64-Bit value.
The #specialized-Annotation
If you feel adventerous, you can try the #specialized keyword (it's pretty buggy and crashes the compiler often). This makes scalac compile special versions of a class for all or selected primitive types. This only makes sense, if the type parameter appears plain in type signatures (get(a: A), but not get(as: Seq[A])) or as a type paramter to Array. I think you'll receive a warning if speicialization is pointless.

Struggling - yet another memory corruption problem, bad alloc (C++, VS 2008)

I've read a lot of posts on memory corruption and it seems like it can be a considerably difficult problem to solve. When I run my code on my linux machine it executes fine and valgrind doesn't report any leaks or errors. However when I run the code on my lab's windows machine with VS2008, I get a bad alloc error, stopping with _RAISE(nomem). This seems strange to me because I would have expected valgrind to catch it.
void *__CRTDECL operator new(size_t size) _THROW1(_STD bad_alloc)
{ // try to allocate size bytes
void *p;
while ((p = malloc(size)) == 0)
if (_callnewh(size) == 0)
{ // report no memory
static const std::bad_alloc nomem;
_RAISE(nomem);
}
return (p);
}
From what I've read it seems like this problem often comes from writing past the end of an allocated block of memory or after it has been freed but I haven't had any luck identifying where that might be happening. Here's my call stack from after trying to run a Release build.
KernelBase.dll!_RaiseException#16() + 0x58 bytes
msvcr90.dll!_CxxThrowException(void * pExceptionObject=0x0040f6d4, const _s__ThrowInfo * pThrowInfo=0x6f63d604) Line 161 C++
msvcr90.dll!operator new(unsigned int size=4) Line 63 + 0x17 bytes C++
tempMem.exe!std::vector<unsigned char,std::allocator<unsigned char> >::vector<unsigned char,std::allocator<unsigned char> >(const std::vector<unsigned char,std::allocator<unsigned char> > & _Right=[...]()) Line 500 + 0x31 bytes C++
tempMem.exe!DendriteSegment::DendriteSegment(const DendriteSegment & __that={...}) + 0x4a bytes C++
tempMem.exe!std::list<DendriteSegment,std::allocator<DendriteSegment> >::_Buynode(std::_List_nod<DendriteSegment,std::allocator<DendriteSegment> >::_Node * _Next=0x005d84d0, std::_List_nod<DendriteSegment,std::allocator<DendriteSegment> >::_Node * _Prev=0x0093af50, const DendriteSegment & _Val={...}) Line 1208 C++
tempMem.exe!TP::adaptSegments(Cell * & cellToAdapt=0x005d8450, std::list<segUpdate,std::allocator<segUpdate> > & segUpdateList=[1]({segToUpdate=0x00000000 synapseChanges=[0]() newSynapsesToAdd=[2](0x005bc8f8 {DSlist=[19]({sequenceSegment=true synapse=[3](0x005d79a0 {DSlist={...} connected=0x005d7af8 currentState=0x005d79c0 ...},0x005d7ef8 {DSlist={...} connected=0x005d8050 currentState=0x005d7f18 ...},0x005d8450 {DSlist={...} connected=0x005d85a8 currentState=0x005d8470 ...}) permanence=[3](80 'P',80 'P',80 'P') ...},. ), bool posReinforce=true) Line 701 + 0x1b bytes C++
tempMem.exe!Level::TPlearning() Line 236 + 0x26 bytes C++
tempMem.exe!main(int argc=, char * * argv=) Line 96 C++
msvcr90.dll!_encode_pointer(void * ptr=0x6f5d3607) Line 114 + 0x5 bytes C
0069ee20()
msvcr90.dll!_initterm(void (void)* * pfbegin=0x00000001, void (void)* * pfend=0x000a1ef8) Line 903 C
tempMem.exe!__tmainCRTStartup() Line 582 + 0x17 bytes C
kernel32.dll!#BaseThreadInitThunk#12() + 0x12 bytes
ntdll.dll!___RtlUserThreadStart#8() + 0x27 bytes
ntdll.dll!__RtlUserThreadStart#8() + 0x1b bytes
When running a debugging session I get a different error (it doesn't seem to happen all the time though...)
return HeapAlloc(_crtheap, 0, size ? size : 1);
from here
#ifdef _WIN64
return HeapAlloc(_crtheap, 0, size ? size : 1);
#else /* _WIN64 */
if (__active_heap == __SYSTEM_HEAP) {
return HeapAlloc(_crtheap, 0, size ? size : 1);
} else
if ( __active_heap == __V6_HEAP ) {
if (pvReturn = V6_HeapAlloc(size)) {
return pvReturn;
}
}
In this case the call stack is
ntdll.dll!_RtlpBreakPointHeap#4() + 0x23 bytes
ntdll.dll!#RtlpAllocateHeap#24() + 0x57dbc bytes
ntdll.dll!_RtlAllocateHeap#12() + 0x502a bytes
ntdll.dll!_RtlDebugAllocateHeap#12() + 0xb5 bytes
ntdll.dll!#RtlpAllocateHeap#24() + 0x57c17 bytes
ntdll.dll!_RtlAllocateHeap#12() + 0x502a bytes
msvcr90d.dll!_heap_alloc_base(unsigned int size=38) Line 105 + 0x28 bytes C
msvcr90d.dll!_heap_alloc_dbg_impl(unsigned int nSize=2, int nBlockUse=1, const char * szFileName=0x00000000, int nLine=0, int * errno_tmp=0x0052f284) Line 427 + 0x9 bytes C++
msvcr90d.dll!_nh_malloc_dbg_impl(unsigned int nSize=2, int nhFlag=0, int nBlockUse=1, const char * szFileName=0x00000000, int nLine=0, int * errno_tmp=0x0052f284) Line 239 + 0x19 bytes C++
msvcr90d.dll!_nh_malloc_dbg(unsigned int nSize=2, int nhFlag=0, int nBlockUse=1, const char * szFileName=0x00000000, int nLine=0) Line 296 + 0x1d bytes C++
msvcr90d.dll!malloc(unsigned int nSize=2) Line 56 + 0x15 bytes C++
msvcr90d.dll!operator new(unsigned int size=2) Line 59 + 0x9 bytes C++
tempMem.exe!std::_Allocate(unsigned int _Count=2) Line 43 + 0xc bytes C++
tempMem.exe!std::allocator<uint8_t>::allocate(unsigned int _Count=2) Line 145 + 0x13 bytes C++
tempMem.exe!std::::_Buy(unsigned int _Capacity=2) Line 1115 + 0x14 bytes C++
tempMem.exe!std::::vector(const std::vector<uint8_t, std::allocator<uint8_t> > & _Right=[2](80 'P',80 'P')) Line 501 + 0x2b bytes C++
tempMem.exe!DendriteSegment::DendriteSegment() + 0x8b bytes C++
tempMem.exe!std::_Construct(DendriteSegment * _Ptr=0x007e7490, const DendriteSegment & _Val={...}) Line 52 + 0x97 bytes C++
tempMem.exe!std::allocator<DendriteSegment>::construct(DendriteSegment * _Ptr=0x007e7490, const DendriteSegment & _Val={...}) Line 155 + 0x15 bytes C++
tempMem.exe!std::::_Buynode(std::_List_nod<DendriteSegment, std::allocator<DendriteSegment> >::_Node * _Next=0x00637f60, std::_List_nod<DendriteSegment, std::allocator<DendriteSegment> >::_Node * _Prev=0x00bfcb50, const DendriteSegment & _Val={...}) Line 1199 + 0x47 bytes C++
tempMem.exe!std::::_Insert(std::list<DendriteSegment, std::allocator<DendriteSegment> >::_Const_iterator<1> _Where={sequenceSegment=true synapse=[0]() permanence=[0]() ...}, const DendriteSegment & _Val={...}) Line 718 + 0x65 bytes C++
tempMem.exe!std::::push_back(const DendriteSegment & _Val={...}) Line 670 + 0x6f bytes C++
tempMem.exe!TP::adaptSegments(Cell * & cellToAdapt=0x00637ee8, std::list<segUpdate, std::allocator<segUpdate> > & segUpdateList=[1](...), bool posReinforce=true) Line 701 + 0x16 bytes C++
tempMem.exe!TP::phase3() Line 949 + 0x3e bytes C++
tempMem.exe!Col::TPphase3() Line 398 + 0xd bytes C++
tempMem.exe!Level::TPlearning() Line 236 + 0x4a bytes C++
tempMem.exe!Network::runTPlearning() Line 93 + 0xd bytes C++
tempMem.exe!main(int argc=1, char * * argv=0x006f62a0) Line 93 + 0xd bytes C++
tempMem.exe!__tmainCRTStartup() Line 582 + 0x19 bytes C
tempMem.exe!mainCRTStartup() Line 399 C
kernel32.dll!#BaseThreadInitThunk#12() + 0x12 bytes
ntdll.dll!___RtlUserThreadStart#8() + 0x27 bytes
ntdll.dll!__RtlUserThreadStart#8() + 0x1b bytes
This is my first time debugging this type of problem and I hope I'm just overlooking/misunderstanding something obvious...
Here's the code for the adaptSegments function that corresponds to this line from the call stack (Release)
tempMem.exe!TP::adaptSegments(Cell * & cellToAdapt=0x005d8450, std::list<segUpdate,std::allocator<segUpdate> > & segUpdateList=[1]({segToUpdate=0x00000000 synapseChanges=[0]() newSynapsesToAdd=[2](0x005bc8f8 {DSlist=[19]({sequenceSegment=true synapse=[3](0x005d79a0 {DSlist={...} connected=0x005d7af8 currentState=0x005d79c0 ...},0x005d7ef8 {DSlist={...} connected=0x005d8050 currentState=0x005d7f18 ...},0x005d8450 {DSlist={...} connected=0x005d85a8 currentState=0x005d8470 ...}) permanence=[3](80 'P',80 'P',80 'P') ...},. ), bool posReinforce=true) Line 701 + 0x1b bytes
bool TP::adaptSegments(Cell *& cellToAdapt,
std::list<segUpdate> & segUpdateList, bool posReinforce)
{
std::list<segUpdate>::iterator curSegUpdate;
std::list<activeSynapsePair>::iterator curSyn;
std::list<Cell *>::iterator synToAdd;
int size = 0;
//for each segUpdate element in the cell's segUpdateList
for (curSegUpdate = segUpdateList.begin();
curSegUpdate != segUpdateList.end(); ++curSegUpdate)
{
//if the segment already exists
if (curSegUpdate->segToUpdate != NULL)
{
//if sequence segment flag is true, set it on DS
if(curSegUpdate->sequenceSegment == true)
{curSegUpdate->segToUpdate->sequenceSegment = true;}
if (posReinforce == true)
{
//for each synapses permanence pair in segUpdate
for (curSyn = (curSegUpdate->
synapseChanges.begin());
curSyn !=(curSegUpdate->synapseChanges.end());
++curSyn)
{
//decrement inactive synapses
if (curSyn->second == false)
{
if (*(curSyn->first)-
permanenceDec < 0)
{*(curSyn->first) = 0;}
else
{*(curSyn->first)-=
permanenceDec;}
}
//increment active synapses
else if (curSyn->second == true)
{
if (*(curSyn->first)+
permanenceInc > 100)
{*(curSyn->first) =100;}
else
{*(curSyn->first)+=
permanenceInc;}
}
}
}
else if (posReinforce == false)
{
//for each synapses permanence pair in segUpdate
for (curSyn = (curSegUpdate->
synapseChanges.begin());
curSyn !=(curSegUpdate->synapseChanges.end());
++curSyn)
{
//decrement active synapses
if (curSyn->second == true)
{
if (*(curSyn->first)-
permanenceDec < 0)
{*(curSyn->first) = 0;}
else
{*(curSyn->first)-=
permanenceDec;}
}
}
}
//if adding synapses to an existing segment
if (curSegUpdate->newSynapsesToAdd.empty()==false)
{
if (curSegUpdate->segToUpdate->synapse.size()
<MAX_NUM_SYN)
{
//for each synapses in newSynapses
for (synToAdd =
curSegUpdate->newSynapsesToAdd.begin();
synToAdd != curSegUpdate->
newSynapsesToAdd.end(); ++synToAdd)
{
//add new synapse to list
curSegUpdate->segToUpdate->
synapse.push_back(*synToAdd);
//and permenance with initialPerm
curSegUpdate->segToUpdate->
permanence.push_back(initialPerm);
}
}//if less than MAX_NUM_SYN
}
}//end if segment already exists
//if segment doesn't exist, create a new segment & add synapses
else if (curSegUpdate->segToUpdate == NULL)
{
size = curSegUpdate->newSynapsesToAdd.size();
if (size != 0)
{
DendriteSegment myNewSeg; //create a new DS
//set sequenceSegment flag if it is true
if (curSegUpdate->sequenceSegment == true)
{myNewSeg.sequenceSegment = true;}
std::copy(curSegUpdate->newSynapsesToAdd.begin(),
curSegUpdate->newSynapsesToAdd.end(),
std::back_inserter(myNewSeg.synapse));
myNewSeg.permanence.resize(size, initialPerm);
//then add it to the cells list of DS
cellToAdapt->DSlist.push_back(myNewSeg);
}//if size not 0
}
}
return true;}
My next step is to try Application Verifier. I tried Intel Inspector XE 2011 but it doesn't seem to detect any relevant memory problems.
Update: Using gflags I found that the cause of my woes was due to having pointers to elements in a std::vector. myVector.push_back(newElem) was used to add elements to the vectors causing pointers to elements in the vector to become bad. I replaced the vectors with std::list which does not have the same problem (see here)
Try gflags in the Microsoft Debugging Tools for Windows (http://www.google.ca/search?sourceid=chrome&ie=UTF-8&q=debugging+tools+for+windows) toolkit. It allows you to run with a debug heap that catches problems as they occur rather than random problems that occur long after the site where the crash occurred.

Resources