Loop vs Double IF statement - performance

What would be more efficient:
For i = 0 to 2
if x[i] == y[i] then do something
//or
if x[0] == y[0] do something
if x[1] == y[1] do something
If I am only doing it twice. Also, ignore the readability.

What would be more efficient:
I think there will be absolutely no difference between the two as your for loop is just from 0 to 2 so you may prefer which ever if more readable to you. However if you for loop is huge(ie, index is very large) then I would recommend to use for loop as it would be more readable.
Also, ignore the readability.
I would not recommend that as it is always advisable to all the programmers to write the code which is more readable for oneself and also for others.

It depends. That's almost always the answer to these things.
There will be several effects at play here.
Firstly, having a loop (ie actually a loop in the machine code, the high level source is really irrelevant except that it may influence the machine code, a loop by 2 has an extremely high chance of being unrolled by a compiler) clearly executes more branches in total, and while a correctly predicted branch usually has no latency, they typically do have a limited throughput.
Secondly, having two distinct branches means that distinct branch prediction histories can be attached to them. That can improve their predictability, particularly if the patterns taken separately both fit in a branch history buffer, but taken together the aggregate pattern is too long to fit. That's very machine dependent, does not happen on all microarchitectures, and very rare in any case since it requires predictable patterns of behaviour of a carefully balanced "long enough but not too long" length.
Thirdly, unrolling that loop likely leads to more code (unless of course the loop overhead is more than the loop body). That puts more pressure on the code cache and the decoders. This effect, unlike the first two, favours the loop.
Lastly, all of these effects are small. In the presence of just about anything else (such as cache misses), they're likely to completely disappear in the noise.

This isn't exactly the same question but I had been wondering what the performance benefit is between doing more work within the same loop and iterating over the set twice. I assumed that was that it would be faster to loop once but I wasn't sure how much or if optimization would equal them out at all.
I finally did a simple test and as you would expect, more work in the same loop is a bit faster.
The test was pretty simple and was done in Swift 3.1 and run with optimization ON:
let itr = 10000000
let passes = 10
print("Running \(itr) iterations through \(passes) passes")
for run in 0..<passes {
print("---------")
var time = CFAbsoluteTimeGetCurrent()
var val1 = 0
for i in 0..<itr {
val1 += i
}
for i in 0..<itr {
val1 += i
}
let t1 = CFAbsoluteTimeGetCurrent() - time
print("\(run).1 - \(val1) -- \(t1)")
time = CFAbsoluteTimeGetCurrent()
var val2 = 0
for i in 0..<itr {
val2 += i
val2 += i
}
let t2 = CFAbsoluteTimeGetCurrent() - time
print("\(run).2 - \(val2) -- \(t2)")
}
And the results:
Running 10000000 iterations through 10 passes
---------
0.1 - 99999990000000 -- 0.127476990222931
0.2 - 99999990000000 -- 0.0763950347900391
---------
1.1 - 99999990000000 -- 0.121748030185699
1.2 - 99999990000000 -- 0.0743749737739563
---------
2.1 - 99999990000000 -- 0.123345971107483
2.2 - 99999990000000 -- 0.0756909847259521
---------
3.1 - 99999990000000 -- 0.11965000629425
3.2 - 99999990000000 -- 0.0711749792098999
---------
4.1 - 99999990000000 -- 0.117263972759247
4.2 - 99999990000000 -- 0.0712859630584717
---------
5.1 - 99999990000000 -- 0.116972029209137
5.2 - 99999990000000 -- 0.0708900094032288
---------
6.1 - 99999990000000 -- 0.121819019317627
6.2 - 99999990000000 -- 0.0748890042304993
---------
7.1 - 99999990000000 -- 0.124098002910614
7.2 - 99999990000000 -- 0.0734890103340149
---------
8.1 - 99999990000000 -- 0.122666001319885
8.2 - 99999990000000 -- 0.07710200548172
---------
9.1 - 99999990000000 -- 0.121197044849396
9.2 - 99999990000000 -- 0.0715969800949097

Related

Dynamic number system in Qlik Sense

My data consists of large numbers, I have a column say - 'amount', while using it in charts(sum of amount in Y axis) it shows something like 1.4G, I want to show them as if is billion then e.g. - 2.8B, or in millions then 80M or if it's in thousands (14,000) then simply- 14k.
I have used - if(sum(amount)/1000000000 > 1, Num(sum(amount)/1000000000, '#,###B'), Num(sum(amount)/1000000, '#,###M')) but it does not show the M or B at the end of the figure and also How to include thousand in the same code.
EDIT: Updated to include the dual() function.
This worked for me:
=dual(
if(sum(amount) < 1, Num(sum(amount), '#,##0.00'),
if(sum(amount) < 1000, Num(sum(amount), '#,##0'),
if(sum(amount) < 1000000, Num(sum(amount)/1000, '#,##0k'),
if(sum(amount) < 1000000000, Num(sum(amount)/1000000, '#,##0M'),
Num(sum(amount)/1000000000, '#,##0B')
))))
, sum(amount)
)
Here are some example outputs using this script to format it:
=sum(amount)
Formatted
2,526,163,764
3B
79,342,364
79M
5,589,255
5M
947,470
947k
583
583
0.6434
0.64
To get more decimals for any of those, like 2.53B instead of 3B, you can format them like '#,##0.00B' by adding more zeroes at the end.
Also make sure that the Number Formatting property is set to Auto or Measure expression.

xarray too slow for performance critical code

I planned to use xarray extensively in some numerically intensive scientific code that I am writing. So far, it makes the code very elegant, but I think I will have to abandon it as the performance cost is far too high.
Here is an example, which creates two arrays and multiplies parts of them together using xarray (with several indexing schemes), and numpy. I used num_comp=2 and num_x=10000:
Line # Hits Time Per Hit % Time Line Contents
4 #profile
5 def xr_timing(num_comp, num_x):
6 1 4112 4112.0 10.1 da1 = xr.DataArray(np.random.random([num_comp, num_x]).astype(np.float32), dims=['component', 'x'], coords={'component': ['a', 'b'], 'x': np.linspace(0, 1, num_x)})
7 1 438 438.0 1.1 da2 = da1.copy()
8 1 1398 1398.0 3.4 da2[:] = np.random.random([num_comp, num_x]).astype(np.float32)
9 1 7148 7148.0 17.6 da3 = da1.isel(component=0).drop('component') * da2.isel(component=0).drop('component')
10 1 6298 6298.0 15.5 da4 = da1[dict(component=0)].drop('component') * da2[dict(component=0)].drop('component')
11 1 7541 7541.0 18.6 da5 = da1.sel(component='a').drop('component') * da2.sel(component='a').drop('component')
12 1 7184 7184.0 17.7 da6 = da1.loc[dict(component='a')].drop('component') * da2.loc[dict(component='a')].drop('component')
13 1 6479 6479.0 16.0 da7 = da1[0, :].drop('component') * da2[0, :].drop('component')
15 #profile
16 def np_timing(num_comp, num_x):
17 1 1027 1027.0 50.2 da1 = np.random.random([num_comp, num_x]).astype(np.float32)
18 1 977 977.0 47.8 da2 = np.random.random([num_comp, num_x]).astype(np.float32)
19 1 41 41.0 2.0 da3 = da1[0, :] * da2[0, :]
The fastest xarray multiplication takes about 150X the time of the numpy version. This is just one of the operations in my code, but I find most of them are many times slower than the numpy equivalent, which is unfortunate as xarray makes the code so much clearer. Am I doing something wrong?
Update: Even da1[0, :].values * da2[0, :].values (which loses many of the benefits of using xarray) takes 2464 time units.
I am using xarray 0.9.6, pandas 0.21.0, numpy 1.13.3, and Python 3.5.2.
Update 2:
As requested by #Maximilian, here is a re-run with num_x=1000000:
Line # Hits Time Per Hit % Time Line Contents
# xarray
9 5 408596 81719.2 11.3 da3 = da1.isel(component=0).drop('component') * da2.isel(component=0).drop('component')
10 5 407003 81400.6 11.3 da4 = da1[dict(component=0)].drop('component') * da2[dict(component=0)].drop('component')
11 5 411248 82249.6 11.4 da5 = da1.sel(component='a').drop('component') * da2.sel(component='a').drop('component')
12 5 411730 82346.0 11.4 da6 = da1.loc[dict(component='a')].drop('component') * da2.loc[dict(component='a')].drop('component')
13 5 406757 81351.4 11.3 da7 = da1[0, :].drop('component') * da2[0, :].drop('component')
14 5 48800 9760.0 1.4 da8 = da1[0, :].values * da2[0, :].values
# numpy
20 5 37476 7495.2 2.9 da3 = da1[0, :] * da2[0, :]
The performance difference has decreased substantially, as expected (only about 10X slower now), but I am still glad that the issue will be mentioned in the next release of the documentation as even this amount of difference may surprise some people.
Yes, this is a known limitation for xarray. Performance sensitive code that uses small arrays is much slower for xarray than NumPy. I wrote a new section about this in our docs for the next version:
http://xarray.pydata.org/en/stable/computation.html#wrapping-custom-computation
You basically have two options:
Write your performance sensitive code on unwrapped arrays, and then wrap them back in xarray data structures. Xarray v0.10 has a new helper function (apply_ufunc) that makes this a little easier. See the link above if you are interested in this.
Use something other than xarray/Python to do your computation. This could also make sense because Python itself adds significant overhead. Julia's AxisArrays.jl looks like interesting, though I haven't tried it myself.
I suppose option 3 would be to rewrite xarray itself in C++ (e.g., on top of xtensor), but that would be much more involved!

Optimizing Groovy Performance

I'm working on groovy code perfomance optimization. I've used jvisualvm to connect to running applicaton and gather CPU samples. Samples say that org.codehaus.groovy.reflection.CachedMethod.inkove takes the most CPU time. I don't see any other application methods in samples.
What is the right way to dig into CachedMethod.invoke and understand what code lines really give perfomance penalties?
Thanks.
UPD:
I do use Indy, it didn't help me.
I didn't try to introduce #CompileStatic since I want to find my bottlenecks before rewriting groovy to java.
My problem a bit similar to this thread: Call site caching faster than invokedynamic?
I have a code that dynamically composes groovy script. Script template looks this way:
def evaluateExpression(Map context){
def user = context.user
%s
}
where %s replaced with
user.attr1 == '1' || user.attr2 == '2' || user.attr3 = '3'
There is a set (20 in total) of replacements have taken from Databases.
The code gets replacements from DB, creates GroovyScript and evaluates it.
I suppose the bottleneck is in the script execution. What is the right way to fix it?
So, I've tried various things
groovy-indy, doesn't work
groovy-indy with some code "optimization", doesn't work. BTW, I'started to play around with try/catch and it as a result I made my "hotspot" run 4 times faster. I'm not good at JVM internals, but internet says - try/catch prevents optimizations. I assumed it as a ground truth. Need to g deeper to understand who it really works.
I gave up, turned off invokedynamic and rewrote my "hottest" code with #CompileStatic. It took about 3-4 hours and I my code runs 100 time faster now.
Here are initial metrics with "invokedynamic support"
count = 83043
mean rate = 395.52 calls/second
1-minute rate = 555.30 calls/second
5-minute rate = 217.78 calls/second
15-minute rate = 82.92 calls/second
min = 0.29 milliseconds
max = 12.98 milliseconds
mean = 1.59 milliseconds
stddev = 1.08 milliseconds
median = 1.39 milliseconds
75% <= 2.46 milliseconds
95% <= 3.14 milliseconds
98% <= 3.44 milliseconds
99% <= 3.76 milliseconds
99.9% <= 12.19 milliseconds
Here are #CompileStatic metrics with ind turned off. BTW, there is no reason to use #CompileStatic if "indy" is turned on.
count = 139724
mean rate = 8950.43 calls/second
1-minute rate = 2011.54 calls/second
5-minute rate = 426.96 calls/second
15-minute rate = 143.76 calls/second
min = 0.02 milliseconds
max = 24.18 milliseconds
mean = 0.08 milliseconds
stddev = 0.72 milliseconds
median = 0.06 milliseconds
75% <= 0.08 milliseconds
95% <= 0.11 milliseconds
98% <= 0.15 milliseconds
99% <= 0.20 milliseconds
99.9% <= 1.27 milliseconds

WINBUGS : adding time and product fixed effects in a hierarchical data

I am working on a Hierarchical panel data using WinBugs. Assuming a data on school performance - logs with independent variable logp & rank. All schools are divided into three categories (cat) and I need beta coefficient for each category (thus HLM). I am wanting to account for time-specific and school specific effects in the model. One way can be to have dummy variables in the list of variables under mu[i] but that would get messy because my number of schools run upto 60. I am sure there must be a better way to handle that.
My data looks like the following:
school time logs logp cat rank
1 1 4.2 8.9 1 1
1 2 4.2 8.1 1 2
1 3 3.5 9.2 1 1
2 1 4.1 7.5 1 2
2 2 4.5 6.5 1 2
3 1 5.1 6.6 2 4
3 2 6.2 6.8 3 7
#logs = log(score)
#logp = log(average hours of inputs)
#rank - rank of school
#cat = section red, section blue, section white in school (hierarchies)
My WinBUGS code is given below.
model {
# N observations
for (i in 1:n){
logs[i] ~ dnorm(mu[i], tau)
mu[i] <- bcons +bprice*(logp[i])
+ brank[cat[i]]*(rank[i])
}
}
}
# C categories
for (c in 1:C) {
brank[c] ~ dnorm(beta, taub)}
# priors
bcons ~ dnorm(0,1.0E-6)
bprice ~ dnorm(0,1.0E-6)
bad ~ dnorm(0,1.0E-6)
beta ~ dnorm(0,1.0E-6)
tau ~ dgamma(0.001,0.001)
taub ~dgamma(0.001,0.001)
}
As you can see in the data sample above, I have multiple observations for school over time. How can I modify the code to account for time and school specific fixed effects. I have used STATA in the past and we get fe,be,i.time options to take care of fixed effects in a panel data. But here I am lost.

How to deduce from synthesis report

I had coded the 80c51 architecture in VHDL using xilinx. In an attempt to increase the clock frequency, I had pipelined all the 80c51 instructions. The instructions were able to execute as desired, for eg. when the 1st instruction is being processed, the second instruction gets fetched.
However, I only get a slightly higher clock frequency of (around +/-10Hz) despite creating a pipeline depth of 3, from the synthesis report. I figured out that the bottleneck is due to one operation as specified by the synthesis report, but I could not understand synthesis report.
May I ask what is the data path from 'SEQ/decode_3 to SEQ/i_ram_addr_7' trying to do?
(From my guess, i deduce that the use a case, when statement to check the 100+ relevant opcode but not sure if that is the bottleneck. But I am clueless)
Hence, my only 2 queries are:
Firstly, is it possible that pipelining does not increase the clock frequency and the testbench is the only way to explain the reduce in timing?
Secondly, how could I deduce which path in my code that is the bottleneck from 'SEQ/decode_3 to SEQ/i_ram_addr_7'.
Thank you for anyone who can help to explain my doubts!
Timing Summary:
---------------
Speed Grade: -4
Minimum period: 12.542ns (Maximum Frequency: 79.730MHz)
Minimum input arrival time before clock: 10.501ns
Maximum output required time after clock: 5.698ns
Maximum combinational path delay: No path found
Timing Detail:
--------------
All values displayed in nanoseconds (ns)
=========================================================================
Timing constraint: Default period analysis for Clock 'clk'
Clock period: 12.542ns (frequency: 79.730MHz)
Total number of paths / destination ports: 113114 / 2670
-------------------------------------------------------------------------
Delay: 12.542ns (Levels of Logic = 10)
Source: SEQ/decode_3 (FF)
Destination: SEQ/i_ram_addr_7 (FF)
Source Clock: clk rising
Destination Clock: clk rising
Data Path: SEQ/decode_3 to SEQ/i_ram_addr_7
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDC:C->Q 102 0.591 1.364 SEQ/decode_3 (SEQ/decode_3)
LUT4_D:I1->O 10 0.643 0.885 SEQ/de_state_cmp_eq002111 (N314)
LUT4:I3->O 7 0.648 0.740 SEQ/de_state_cmp_eq00711 (SEQ/de_state_cmp_eq0071)
LUT4:I2->O 3 0.648 0.534 SEQ/i_ram_addr_mux0000<0>11111 (N2301)
LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0_F (N1284)
MUXF5:I0->O 1 0.276 0.423 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0 (N955)
LUT4_D:I3->O 6 0.648 0.701 SEQ/i_ram_addr_mux0000<0>11270 (SEQ/i_ram_addr_mux0000<0>11270)
LUT3_L:I2->LO 1 0.648 0.103 SEQ/i_ram_addr_mux0000<7>221_SW2_SW0 (N1208)
LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>351_SW1 (N1085)
LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>2 (SEQ/i_ram_addr_mux0000<7>2)
LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<7>167 (SEQ/i_ram_addr_mux0000<7>)
FDE:D 0.252 SEQ/i_ram_addr_7
----------------------------------------
Total 12.542ns (6.946ns logic, 5.596ns route)
(55.4% logic, 44.6% route)
=========================================================================
Timing constraint: Default OFFSET IN BEFORE for Clock 'clk'
Total number of paths / destination ports: 154 / 154
-------------------------------------------------------------------------
Offset: 8.946ns (Levels of Logic = 6)
Source: rst (PAD)
Destination: SEQ/i_ram_diByte_1 (FF)
Destination Clock: clk rising
Data Path: rst to SEQ/i_ram_diByte_1
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
IBUF:I->O 444 0.849 1.392 rst_IBUF (REG/ext_int/fd_out1_0__or0000)
BUF:I->O 445 0.648 1.425 rst_IBUF_1 (rst_IBUF_1)
LUT3:I2->O 4 0.648 0.730 ROM/data<1>1 (i_rom_data<1>)
LUT4:I0->O 1 0.648 0.500 SEQ/i_ram_diByte_mux0000<1>17_SW0 (N1262)
LUT4:I1->O 1 0.643 0.563 SEQ/i_ram_diByte_mux0000<1>32 (SEQ/i_ram_diByte_mux0000<1>32)
LUT4:I0->O 1 0.648 0.000 SEQ/i_ram_diByte_mux0000<1>60 (SEQ/i_ram_diByte_mux0000<1>)
FDE:D 0.252 SEQ/i_ram_diByte_1
----------------------------------------
Total 8.946ns (4.336ns logic, 4.610ns route)
(48.5% logic, 51.5% route)
=========================================================================
To allow me to be more specfic, I will give a snipplet of an example code in the decode phase of 1 opcode.
The following is 1 such case when decoding an opdcode, which is a mov instruction. There are about 100+ opcodes (100+ instructions), which means this case statements has over 100 when statements.
case OPCODE is
--MOV A, Rn
when "11101000" | "11101001" | "11101010" | "11101011" | "11101100" | "11101101" |
"11101110" | "11101111" => case de_state is
when E7 =>
de_state <= E8;
when E8 =>
de_state <= E9;
when E9 =>
de_state <= E10;
when E10 =>
--Draw PSW
i_ram_addr <= xD0;
i_ram_rdByte <= '1';
de_state <= E11;
when E11 =>
--Draw from Rn
i_ram_addr <= "000" & i_ram_doByte(4 downto 3)& opcode(2 downto 0);
i_ram_rdByte <= '1';
de_state <= E12;
when E12 =>
--Place into EDR
EDR <= i_ram_doByte;
--close rdByte
i_ram_rdByte <= '0';
when others =>
end case;
I hope you could have a better idea of my vhdl code. I would appreciate any form of help. Thank you!
Since you're using Xilinx, I presume you also have access to PlanAhead? Try "Analyze Timing / Floorplan Design (PlanAhead)" (under "Implement Design" -> "Place & Route").
PlanAhead should open, and give you a view of your timing results in the bottom. Pick the critical path (the one with the least slack), right click it and choose "Schematic", which will bring up a graphical view of the involved primitives. You can then right-click the primitives and choose "Expand Cone" -> "To Flops" to get a view of the surrounding components too.
This should help you get a much better idea of what signals are involved. Try tracing the input and output signals to your VHDL code, and focus on that path for optimization.
There will be no good answers from this information only; we can only guess what source code produced this hardware.
But it is clear that you need to examine the source, make a hypothesis why it is slow, take action to correct the problem, and test the solution.
And repeat until fast enough.
My guess, given your hint that there is a case statement to decode the opcodes...
one of the arms is something like:
when <some expression involving decode> =>
address <= <some address calculation>;
The problem is that often the two expressions are inter-related so that they are evaluated in the same cycle. An example solution would be to precompute the address expression (i.e. in the previous cycle) into a register, and rewrite the case arm as:
when <some expression involving decode> =>
address <= register;
If you guessed right, the result will be slightly faster and you have another (similar) bottleneck to fix. Repeat until fast enough...
But without the source AND the timing analysis, don't expect a more specific answer.
EDIT : having posted a fraction of source code, the picture is a little clearer :
you have two nested Case statements, each quite large. You clearly need some simplification...
I note that only 2 of the inner case arms assign to i_ram_addr, yet the timing analysis shows a huge and complex mux on i_ram_addr; clearly there are a lot of other case arms that contribute terms to i_ram_addr...
I would suggest that you might have to treat i_ram_addr separately from the main Case statement and write the simplest machine you can to generate i_ram_addr alone.
For example I would note that the OPCODE case arm is equivalent to:
if OPCODE(7 downto 3) = "11101" then ...
and ask how simple you can get a decoder for i_ram_addr alone.
You may find that a lot of other case arms do very similar things with i_ram_addr (the original 8051 designers would have jumped at the chance to simplify logic!).
Synthesis tools can be quite clever at simplifying logic, but when things get too complex they can miss opportunities.
(At this stage I would comment out the i_ram_addr assignments and leave the rest of the decoder alone)

Resources