This comes up regularly. Functions coded up using generics are signifficnatly slower in scala. See example below. Type specific version performs about a 1/3 faster than the generic version. This is doubly surprising given that the generic component is outside of the expensive loop. Is there a known explanation for this?
def xxxx_flttn[T](v: Array[Array[T]])(implicit m: Manifest[T]): Array[T] = {
val I = v.length
if (I <= 0) Array.ofDim[T](0)
else {
val J = v(0).length
for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
val flt = Array.ofDim[T](I * J)
for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
flt
}
}
def flttn(v: Array[Array[Double]]): Array[Double] = {
val I = v.length
if (I <= 0) Array.ofDim[Double](0)
else {
val J = v(0).length
for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
val flt = Array.ofDim[Double](I * J)
for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
flt
}
}
You can't really tell what you're measuring here--not very well, anyway--because the for loop isn't as fast as a pure while loop, and the inner operation is quite inexpensive. If we rewrite the code with while loops--the key double-iteration being
var i = 0
while (i<I) {
var j = 0
while (j<J) {
flt(i * J + j) = v(i)(j)
j += 1
}
i += 1
}
flt
then we see that the bytecode for the generic case is actually dramatically different. Non-generic:
133: checkcast #174; //class "[D"
136: astore 6
138: iconst_0
139: istore 5
141: iload 5
143: iload_2
144: if_icmpge 191
147: iconst_0
148: istore 4
150: iload 4
152: iload_3
153: if_icmpge 182
// The stuff above implements the loop; now we do the real work
156: aload 6
158: iload 5
160: iload_3
161: imul
162: iload 4
164: iadd
165: aload_1
166: iload 5
168: aaload // v(i)
169: iload 4
171: daload // v(i)(j)
172: dastore // flt(.) = _
173: iload 4
175: iconst_1
176: iadd
177: istore 4
// Okay, done with the inner work, time to jump around
179: goto 150
182: iload 5
184: iconst_1
185: iadd
186: istore 5
188: goto 141
It's just a bunch of jumps and low-level operations (daload and dastore being the key ones that load and store a double from an array). If we look at the key inner part of the generic bytecode, it instead looks like
160: getstatic #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
163: aload 7
165: iload 6
167: iload 4
169: imul
170: iload 5
172: iadd
173: getstatic #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
176: aload_1
177: iload 6
179: aaload
180: iload 5
182: invokevirtual #107; //Method scala/runtime/ScalaRunTime$.array_apply:(Ljava/lang/Object;I)Ljava/lang/Object;
185: invokevirtual #111; //Method scala/runtime/ScalaRunTime$.array_update:(Ljava/lang/Object;ILjava/lang/Object;)V
188: iload 5
190: iconst_1
191: iadd
192: istore 5
which, as you can see, has to call methods to do the array apply and update. The bytecode for that is a huge mess of stuff like
2: aload_3
3: instanceof #98; //class "[Ljava/lang/Object;"
6: ifeq 18
9: aload_3
10: checkcast #98; //class "[Ljava/lang/Object;"
13: iload_2
14: aaload
15: goto 183
18: aload_3
19: instanceof #100; //class "[I"
22: ifeq 37
25: aload_3
26: checkcast #100; //class "[I"
29: iload_2
30: iaload
31: invokestatic #106; //Method scala/runtime/BoxesRunTime.boxToInteger:
34: goto 183
37: aload_3
38: instanceof #108; //class "[D"
41: ifeq 56
44: aload_3
45: checkcast #108; //class "[D"
48: iload_2
49: daload
50: invokestatic #112; //Method scala/runtime/BoxesRunTime.boxToDouble:(
53: goto 183
which basically has to test each type of array and box it if it's the type you're looking for. Double is pretty near the front (3rd of 10), but it's still a pretty major overhead, even if the JVM can recognize that the code ends up being box/unbox and therefore doesn't actually need to allocate memory. (I'm not sure it can do that, but even if it could it wouldn't solve the problem.)
So, what to do? You can try [#specialized T], which will expand your code tenfold for you, as if you wrote each primitive array operation by yourself. Specialization is buggy in 2.9 (should be less so in 2.10), though, so it may not work the way you hope. If speed is of the essence--well, first, write while loops instead of for loops (or at least compile with -optimise which helps for loops out by a factor of two or so!), and then consider either specialization or writing the code by hand for the types you require.
This is due to boxing, when you apply the generic to a primitive type and use containing arrays (or the type appearing plain in method signatures or as member).
Example
In the following trait, after compilation, the process method will take an erased Array[Any].
trait Foo[A]{
def process(as: Array[A]): Int
}
If you choose A to be a value/primitive type, like Double it has to be boxed. When writing the trait in a non-generic way (e.g. with A=Double), process is compiled to take an Array[Double], which is a distinct array type on the JVM. This is more efficient, since in order to store a Double inside the Array[Any], the Double has to be wrapped (boxed) into an object, a reference to which gets stored inside the array. The special Array[Double] can store the Double directly in memory as a 64-Bit value.
The #specialized-Annotation
If you feel adventerous, you can try the #specialized keyword (it's pretty buggy and crashes the compiler often). This makes scalac compile special versions of a class for all or selected primitive types. This only makes sense, if the type parameter appears plain in type signatures (get(a: A), but not get(as: Seq[A])) or as a type paramter to Array. I think you'll receive a warning if speicialization is pointless.
Related
I'm trying to understand the output of the gcov tool. Running it with -a options makes sense, and want to understand the block coverage options. Unfortunately it's hard to make sense of what the blocks do and why they aren't taken. Below is the output.
I have run add function in my calculator program once. I have no clue why it shows block0.
-: 0:Source:calculator.c
-: 0:Graph:calculator.gcno
-: 0:Data:calculator.gcda
-: 0:Runs:1
-: 0:Programs:1
-: 1:#include "calculator.h"
-: 2:#include <stdio.h>
-: 3:#include <stdlib.h>
-: 4:
1: 5:int main(int argc, char *argv[])
1: 5-block 0
-: 6:{
-: 7: int a,b, result;
-: 8: char opr;
-: 9:
1: 10: if(argc!=4)
1: 10-block 0
-: 11: {
#####: 12: printf("Invalid arguments...\n");
$$$$$: 12-block 0
#####: 13: return -1;
-: 14: }
-: 15:
-: 16: //get values
1: 17: a = atoi(argv[1]);
1: 18: b = atoi(argv[3]);
-: 19:
-: 20: //get operator
1: 21: opr=argv[2][0];
-: 22:
-: 23: //calculate according to operator
1: 24: switch(opr)
1: 24-block 0
-: 25: {
1: 26: case '+':
1: 27: result = add_(a, b);
1: 27-block 0
-: 28:
1: 29: break;
#####: 30: case '-':
#####: 31: result=sub_(a,b);
$$$$$: 31-block 0
#####: 32: break;
#####: 33: case '_':
#####: 34: result=multiply_(a,b);
$$$$$: 34-block 0
#####: 35: break;
#####: 36: case '/':
#####: 37: result = div_(a,b);
$$$$$: 37-block 0
#####: 38: default:
#####: 39: result=0;
#####: 40: break;
$$$$$: 40-block 0
-: 41: }
-: 42:
1: 43: if(opr=='+' || opr=='-' || opr=='_'|| opr== '/')
1: 43-block 0
$$$$$: 43-block 1
$$$$$: 43-block 2
$$$$$: 43-block 3
1: 44: printf("Result: %d %c %d = %d\n",a,opr,b,result);
1: 44-block 0
-: 45: else
#####: 46: printf("Undefined Operator...\n");
$$$$$: 46-block 0
-: 47:
1: 48: return 0;
1: 48-block 0
-: 49:}
-: 50:
-: 51:/**
-: 52: * Function to add two numbers
-: 53: */
1: 54:float add_(float num1, float num2)
1: 54-block 0
-: 55:{
1: 56: return num1 + num2;
1: 56-block 0
-: 57:}
-: 58:
-: 59:/**
-: 60: * Function to subtract two numbers
-: 61: */
#####: 62:float sub_(float num1, float num2)
$$$$$: 62-block 0
-: 63:{
#####: 64: return num1 - num2;
$$$$$: 64-block 0
-: 65:}
-: 66:
-: 67:/**
-: 68: * Function to multiply two numbers
-: 69: */
#####: 70:float multiply_(float num1, float num2)
$$$$$: 70-block 0
-: 71:{
#####: 72: return num1 * num2;
$$$$$: 72-block 0
-: 73:}
-: 74:
-: 75:/**
-: 76: * Function to divide two numbers
-: 77: */
#####: 78:float div_(float num1, float num2)
$$$$$: 78-block 0
-: 79:{
#####: 80: return num1 / num2;
$$$$$: 80-block 0
-: 81:}
If anyone knows how to decipher the block info, specially lines 5,12,13,43 ,64 or knows of any detailed documentation on what it all means, I'd appreciate the help.
Each block is marked by a line with the same line number as the last line of the block and the number of branch and calls in the block. A block is created using a pair of curly braces({}). Line 5 marks the beginning of main block...then as I mentioned for every branch or function call block number is mentioned...Now your if statement has four conditions that means there will be 4 additional blocks which are labeled as 0,1,2,3..All the blocks which are not executed are marked $$$$$, which is true here as you must have passed '+' as the argument so the program never takes the path of other operators and hence block 1,2,3 are marked as $$$$$.
Hope this helps.
Context
I have a device tree in which one of the node is:
gpio#41210000 {
#gpio-cells = <0x2>;
#interrupt-cells = <0x2>;
compatible = "xlnx,xps-gpio-1.00.a,generic-uio,BTandSW";
gpio-controller;
interrupt-controller;
interrupt-parent = <0x4>;
//interrupt-parent =<&gic>;
interrupts = <0x0 0x1d 0x4>;
reg = <0x41210000 0x10000>;
xlnx,all-inputs = <0x1>;
xlnx,all-inputs-2 = <0x1>;
xlnx,all-outputs = <0x0>;
xlnx,all-outputs-2 = <0x0>;
xlnx,dout-default = <0x0>;
xlnx,dout-default-2 = <0x0>;
xlnx,gpio-width = <0x4>;
xlnx,gpio2-width = <0x2>;
xlnx,interrupt-present = <0x1>;
xlnx,is-dual = <0x1>;
xlnx,tri-default = <0xffffffff>;
xlnx,tri-default-2 = <0xffffffff>;
};
Once the kernel is running, performing the
cat /proc/interrupts
the results are:
root#linaro-developer:~# cat /proc/interrupts
CPU0 CPU1
16: 0 0 GIC-0 27 Edge gt
17: 0 0 GIC-0 43 Level ttc_clockevent
18: 1588 1064 GIC-0 29 Edge twd
21: 43 0 GIC-0 39 Level f8007100.adc
24: 0 0 GIC-0 35 Level f800c000.ocmc
25: 506 0 GIC-0 59 Level xuartps
26: 0 0 GIC-0 51 Level e000d000.spi
27: 0 0 GIC-0 54 Level eth5
28: 4444 0 GIC-0 56 Level mmc0
29: 0 0 GIC-0 45 Level f8003000.dmac
30: 0 0 GIC-0 46 Level f8003000.dmac
31: 0 0 GIC-0 47 Level f8003000.dmac
32: 0 0 GIC-0 48 Level f8003000.dmac
33: 0 0 GIC-0 49 Level f8003000.dmac
34: 0 0 GIC-0 72 Level f8003000.dmac
35: 0 0 GIC-0 73 Level f8003000.dmac
36: 0 0 GIC-0 74 Level f8003000.dmac
37: 0 0 GIC-0 75 Level f8003000.dmac
38: 0 0 GIC-0 40 Level f8007000.devcfg
45: 0 0 GIC-0 41 Edge f8005000.watchdog
IPI1: 0 0 Timer broadcast interrupts
IPI2: 1731 2206 Rescheduling interrupts
IPI3: 29 36 Function call interrupts
IPI4: 0 0 CPU stop interrupts
IPI5: 0 0 IRQ work interrupts
IPI6: 0 0 completion interrupts
Err: 0
Questions
Once the kernel is running, should it recognize, automatically, and update the data in the file /proc/interrupt?
However, I wrote a .probe function in this way:
static int SWITCH_of_probe(struct platform_device *pdev)
{
int ret=0;
struct irq_data data_tmp;
SWITCH_01_devices->temp_res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!(SWITCH_01_devices->temp_res)) {
dev_err(&pdev->dev, "TRY could not get IO memory\n");
return -ENXIO;
}
PDEBUG("resource : regs.start=%#x,regs.end=%#x\n",SWITCH_01_devices->temp_res->start,SWITCH_01_devices->temp_res->end);
//SWITCH_01_devices->irq_line = platform_get_irq(pdev, 0);
SWITCH_01_devices->irq_line = irq_of_parse_and_map(pdev->dev.of_node, 0);
if (SWITCH_01_devices->irq_line < 0) {
dev_err(&pdev->dev, "could not get IRQ\n");
printk(KERN_ALERT "could not get IRQ\n");
return SWITCH_01_devices->irq_line;
}
PDEBUG(" resource VIRTUAL IRQ NUMBER : irq=%#x\n",SWITCH_01_devices->irq_line);
ret = request_irq((SWITCH_01_devices->irq_line), SWITCH_01_interrupt, IRQF_SHARED , DRIVER_NAME, NULL);
if (ret) {
printk(KERN_ALERT "NEW SWITCH_01: can't get assigned irq, ret= %d\n", SWITCH_01_devices->irq_line, ret);
SWITCH_01_devices->irq_line = -1;
}
SWITCH_01_devices->mem_region_requested = request_mem_region((SWITCH_01_devices->temp_res->start),resource_size(SWITCH_01_devices->temp_res),"SWITCH_01");
if(SWITCH_01_devices->mem_region_requested == NULL){
printk(KERN_WARNING "[LEO] SWITCH: FaiSWITCH request_mem_region(res.start,resource_size(&(SWITCH_01_devices->res)),...);\n");
}
else
PDEBUG(" [+] request_mem_region\n");
return 0; /* Success */
}
When inserting the module in the kernel I have this output from dmesg:
[ 1249.777189] SWITCH_01: loading out-of-tree module taints kernel.
[ 1249.777787] [LEO] SWITCH_01: dinamic allocation of major number
[ 1249.777801] [LEO] cdev initialized
[ 1249.777988] [LEO] resource : regs.start=0x41210000,regs.end=0x4121ffff
[ 1249.777994] [LEO] resource : irq=0x2e
[ 1249.778000] NEW SWITCH_01: can't get assigned irq, ret= -22
[ 1249.782531] [LEO] [+] request_mem_region
What am I doing wrong? Why I cannot perform a correct request_irq?
Note: the interrupts = <0x0 0x1d 0x4> field of the device tree and the irq_number detected are different. As it was point out here, I changed the platform_get_irq(pdev, 0); with irq_of_parse_and_map(pdev->dev.of_node, 0); but the result is the same.
Once the kernel is running, should it recognize, automatically, and
update the data in the file /proc/interrupt?
Yes it will update, once the interrupt is registered.
[ 1249.777189] SWITCH_01: loading out-of-tree module taints kernel.
[ 1249.777787] [LEO] SWITCH_01: dinamic allocation of major number
[ 1249.777801] [LEO] cdev initialized
[ 1249.777988] [LEO] resource : regs.start=0x41210000,regs.end=0x4121ffff
[ 1249.777994] [LEO] resource : irq=0x2e
[ 1249.778000] NEW SWITCH_01: can't get assigned irq, ret= -22
[ 1249.782531] [LEO] [+] request_mem_region
What am I doing wrong? Why I cannot perform a correct request_irq?
A shared interrupt (IRQF_SHARED) must pass the dev_id (which you are passing NULL in request_irq()), if NULL, -EINVAL is returned back, so make sure you pass a non-null valid dev_id
I'm new to scala and I wonder whether it is possible to define generic math function that works with both BigInt and Int and in the case of Int the arguments of function will be treated as primitives (without any boxing and unboxing in function body).
So, for example I can do something like
def foo[#specialized(Int) T: Numeric](a: T, b: T) = {
val n = implicitly[Numeric[T]]
import n._
//some code with the use of operators '+-*/'
a * b - a * a + b * b * b
}
//works for primitive Int
val i1 : Int = 1
val i2 : Int = 2
val i3 : Int = foo(i1, i2)
//works for BigInt
val b1 : BigInt = BigInt(1)
val b2 : BigInt = BigInt(2)
val b3 : BigInt = foo(b1, b2)
Here in foo I can use all math operators for both primitive ints and BigInts (that is what I need). However, function foo(Int, Int) compiles to the following:
public int foo$mIc$sp(int a, int b, Numeric<Object> evidence$1) {
Numeric n = (Numeric)Predef..MODULE$.implicitly(evidence$1);
return BoxesRunTime.unboxToInt((Object)n.mkNumericOps(n.mkNumericOps(n.mkNumericOps((Object)BoxesRunTime.boxToInteger((int)a)).$times((Object)BoxesRunTime.boxToInteger((int)b))).$minus(n.mkNumericOps((Object)BoxesRunTime.boxToInteger((int)a)).$times((Object)BoxesRunTime.boxToInteger((int)a)))).$plus(n.mkNumericOps(n.mkNumericOps((Object)BoxesRunTime.boxToInteger((int)b)).$times((Object)BoxesRunTime.boxToInteger((int)b))).$times((Object)BoxesRunTime.boxToInteger((int)b))));
}
instead of plain:
//this is what I really need and expect from `#specialized(Int)`
public int foo$mIc$sp(int a, int b) {
return a * b - a * a + b * b * b;
}
which makes #specialized(Int) useless because the performance is unacceptably low with all these (un)boxings and unnecessary invocations n.mkNumericOps(...).
So, is there a way to implement such generic function as foo that will compile to "just as is" code for primitive types?
The problem is that the Numeric typeclass is not specialized.
If you want to do generic math with high performance, I highly recommend the spire math library.
It has a very elaborate mathematical type class hierarchy, instead of just Numeric.
Here is how your example would look using spire:
import spire.implicits._ // typeclass instances etc.
import spire.syntax._ // syntax such as +-*/
import spire.algebra._ // typeclassses such as Field
def foo[#specialized T: Field](a: T, b: T) = {
//some code with the use of operators '+-*/'
a * b - a * a + b * b * b
}
Here you are saying that there has to be a Field instance for T. Field refers to the algebraic concept.
Spire is highly modular:
spire.algebra contains many well known algebraic concepts such as groups, fields etc, encoded as scala typeclasses
spire.syntax contains the implicit conversions to add operators and other syntax to types for which typeclass instances are available
spire.implicits contains instances for the typeclasses in spire.algebra for common types such as JVM primitives.
This is why you need the three imports.
Regarding the performance: if your code is specialized, and you are using primitives, the performance will be exactly the same as working with primitives directly.
Here is the code of the foo method when specialized for Int:
public int foo$mIc$sp(int, int, spire.algebra.Field<java.lang.Object>);
Code:
0: aload_3
1: aload_3
2: aload_3
3: iload_1
4: iload_2
5: invokeinterface #116, 3 // InterfaceMethod spire/algebra/Field.times$mcI$sp:(II)I
10: aload_3
11: iload_1
12: iload_1
13: invokeinterface #116, 3 // InterfaceMethod spire/algebra/Field.times$mcI$sp:(II)I
18: invokeinterface #119, 3 // InterfaceMethod spire/algebra/Field.minus$mcI$sp:(II)I
23: aload_3
24: aload_3
25: iload_2
26: iload_2
27: invokeinterface #116, 3 // InterfaceMethod spire/algebra/Field.times$mcI$sp:(II)I
32: iload_2
33: invokeinterface #116, 3 // InterfaceMethod spire/algebra/Field.times$mcI$sp:(II)I
38: invokeinterface #122, 3 // InterfaceMethod spire/algebra/Field.plus$mcI$sp:(II)I
43: ireturn
Note that there is no boxing, and the invokeinterface calls will be inlined by the JVM.
This kind of usage is common while writing loops.
I was wondering if i >=0 will need more CPU cycles as it has two conditions greater than OR equal to when compared to i > -1. Is one known to be better than the other, and if so, why?
This is not correct. The JIT will implement both tests as a single machine language instruction.
And the number of CPU clock cycles is not determined by the number of comparisons to zero or -1, because the CPU should do one comparison and set flags to indicate whether the result of the comparison is <, > or =.
It's possible that one of those instructions will be more efficient on certain processors, but this kind of micro-optimization is almost always not worth doing. (It's also possible that the JIT - or javac - will actually generate the same instructions for both tests.)
On the contrary, comparsions (including non-strict) with zero takes one CPU instruction less. x86 architecture supports conditional jumps after any arithmetic or loading operation. It is reflected in Java bytecode instruction set, there is a group of instructions to compare the value on the top of the stack and jump: ifeq/ifgt/ifge/iflt/ifle/ifne. (See the full list). Comparsion with -1 requires additional iconst_m1 operation (loading -1 constant onto the stack).
The are two loops with different comparsions:
#GenerateMicroBenchmark
public int loopZeroCond() {
int s = 0;
for (int i = 1000; i >= 0; i--) {
s += i;
}
return s;
}
#GenerateMicroBenchmark
public int loopM1Cond() {
int s = 0;
for (int i = 1000; i > -1; i--) {
s += i;
}
return s;
}
The second version is one byte longer:
public int loopZeroCond();
Code:
0: iconst_0
1: istore_1
2: sipush 1000
5: istore_2
6: iload_2
7: iflt 20 //
10: iload_1
11: iload_2
12: iadd
13: istore_1
14: iinc 2, -1
17: goto 6
20: iload_1
21: ireturn
public int loopM1Cond();
Code:
0: iconst_0
1: istore_1
2: sipush 1000
5: istore_2
6: iload_2
7: iconst_m1 //
8: if_icmple 21 //
11: iload_1
12: iload_2
13: iadd
14: istore_1
15: iinc 2, -1
18: goto 6
21: iload_1
22: ireturn
It is slightly more performant on my machine (to my surprise. I expected JIT to compile these loops into identical assembly.)
Benchmark Mode Thr Mean Mean error Units
t.LoopCond.loopM1Cond avgt 1 0,319 0,004 usec/op
t.LoopCond.loopZeroCond avgt 1 0,302 0,004 usec/op
Сonclusion
Compare with zero whenever sensible.
I was wondering if anyone could suggest to me how to implement this loop in the following pseudocode:
8: loop
9: while f[0] = 0 do
10: for i = 1 to N do
11: f[i ¡ 1] = f[i]
12: c[N + 1 - i] = c[N - i]
13: end for
14: f[N] = 0
15: c[0] = 0
16: k = k + 1
17: end while
18: if deg(f) = 0 then
19: goto Step 32
20: end if
......... ...... ....
31: end loop
My question is how I should implement the loop that starts on line 8 and ends on 31; I am comfortable with the statements between lines 8 to 31, but what kind of loop do I use on line 8, and what conditions do I give for the loop?
Thanks in advance.
That's an infinite loop. No conditions, just loop forever. The only way out is to get to step 19. In C-like languages you can write that as while (true) or for (;;):
for (;;) {
// ...
if (deg(f) == 0) {
goto afterLoop;
}
// ...
}
afterLoop:
// ...
goto is frowned upon, though. It'd be better to replace goto Step 32 with a break statement, which exits a loop immediately:
for (;;) {
// ...
if (deg(f) == 0) {
break;
}
// ...
}
For what it's worth, if you didn't have steps 21-30 you could use a do/while loop, where the loop condition goes at the bottom of the loop instead of the top:
do {
// ...
}
while (deg(f) != 0);
That would work if lines 18-20 were the final lines in the loop. Since they're not, it looks like option #2 is the one to go with.
If you are going to write pseudocode in such detail, you might as well write in the target language. Pseudocode should be a much broader brush - something like this (not related to your code):
for each bank account
check balance as of last month
if balance greater than promotion limit
send out valued customer pack
endif
endfor