gfortran program runs faster than ifort

gfortran program runs faster than ifort - performance

I have some fortran code, that when compiled with gfortran is faster than when compiled with ifort. I usually find on the internet about the opposite case...
I tried to run intel vtune to identify different hotspots between the executables, but I couldn't manage to solve these.
I'm not sure what can cause this difference. Here is the perf output:
gfortran:
Performance counter stats for 'build/gnuRelease/NBODY inputFile temp' (10 runs):
2,489.36 msec task-clock:u # 0.986 CPUs utilized ( +- 0.21% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
589 page-faults:u # 0.237 K/sec ( +- 0.05% )
10,678,130,527 cycles:u # 4.290 GHz ( +- 0.20% )
31,102,858,644 instructions:u # 2.91 insn per cycle ( +- 0.00% )
3,537,572,458 branches:u # 1421.078 M/sec ( +- 0.00% )
566,054 branch-misses:u # 0.02% of all branches ( +- 5.14% )
2.5235 +- 0.0150 seconds time elapsed ( +- 0.59% )
ifort:
Performance counter stats for 'build/ifortRelease/NBODY inputFile temp' (10 runs):
2,834.44 msec task-clock:u # 0.978 CPUs utilized ( +- 0.14% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
2,600 page-faults:u # 0.917 K/sec ( +- 0.01% )
12,146,500,211 cycles:u # 4.285 GHz ( +- 0.14% )
36,441,911,065 instructions:u # 3.00 insn per cycle ( +- 0.00% )
2,936,917,079 branches:u # 1036.154 M/sec ( +- 0.00% )
339,226 branch-misses:u # 0.01% of all branches ( +- 3.74% )
2.8991 +- 0.0165 seconds time elapsed ( +- 0.57% )
The page-faults metric caught my eye but I'm not sure what does it mean...
UPDATE:
gfortran version: 10.2.0
ifort version: 19.1.3.304
intel Xeon(R)
UPDATE:
similar example: Puzzling performance difference between ifort and gfortran
from this example:
When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.
seem to be relevant to this case too

Related

How to do proper performance testing and analysis of alacritty and zsh

I've been working with my setup lately and have been trying to determine where my 2.3s terminal load times have been coming from. I'm fairly new to linux performance testing in general but I have determined a few things.
The first thing I should mention is that terminal is a shell program with the following:
#!/bin/sh
WINIT_X11_SCALE_FACTOR=1.5 alacritty "$#"
The stats on launching the terminal program (alacritty) and its shell (zsh -l):
> perf stat -r 10 -d terminal -e $SHELL -slc exit
Performance counter stats for 'terminal -e /usr/bin/zsh -slc exit' (10 runs):
602.55 msec task-clock # 0.261 CPUs utilized ( +- 1.33% )
957 context-switches # 1.532 K/sec ( +- 0.42% )
92 cpu-migrations # 147.298 /sec ( +- 1.89% )
68,150 page-faults # 109.113 K/sec ( +- 0.13% )
2,188,445,151 cycles # 3.504 GHz ( +- 0.17% )
3,695,337,515 instructions # 1.70 insn per cycle ( +- 0.08% )
791,333,786 branches # 1.267 G/sec ( +- 0.06% )
14,007,258 branch-misses # 1.78% of all branches ( +- 0.09% )
10,893,173,535 slots # 17.441 G/sec ( +- 0.13% )
3,574,546,556 topdown-retiring # 30.5% Retiring ( +- 0.11% )
2,888,937,632 topdown-bad-spec # 24.0% Bad Speculation ( +- 0.41% )
3,125,577,758 topdown-fe-bound # 27.1% Frontend Bound ( +- 0.16% )
2,189,183,796 topdown-be-bound # 18.4% Backend Bound ( +- 0.47% )
924,852,782 L1-dcache-loads # 1.481 G/sec ( +- 0.07% )
38,308,478 L1-dcache-load-misses # 4.16% of all L1-dcache accesses ( +- 0.09% )
3,445,566 LLC-loads # 5.517 M/sec ( +- 0.20% )
725,990 LLC-load-misses # 20.97% of all LL-cache accesses ( +- 0.36% )
2.30683 +- 0.00331 seconds time elapsed ( +- 0.14% )
The stats on launching just the shell (zsh):
Performance counter stats for '/usr/bin/zsh -i -c exit' (10 runs):
1,548.56 msec task-clock # 0.987 CPUs utilized ( +- 3.28% )
525 context-switches # 323.233 /sec ( +- 21.17% )
16 cpu-migrations # 9.851 /sec ( +- 11.33% )
90,616 page-faults # 55.791 K/sec ( +- 2.63% )
6,559,830,564 cycles # 4.039 GHz ( +- 3.18% )
11,317,955,247 instructions # 1.68 insn per cycle ( +- 3.69% )
2,351,473,571 branches # 1.448 G/sec ( +- 3.46% )
46,539,165 branch-misses # 1.91% of all branches ( +- 1.31% )
32,783,001,655 slots # 20.184 G/sec ( +- 3.18% )
10,776,867,769 topdown-retiring # 32.5% Retiring ( +- 3.28% )
5,729,353,491 topdown-bad-spec # 18.2% Bad Speculation ( +- 6.90% )
11,083,567,578 topdown-fe-bound # 33.3% Frontend Bound ( +- 2.34% )
5,458,201,823 topdown-be-bound # 15.9% Backend Bound ( +- 4.51% )
3,180,211,376 L1-dcache-loads # 1.958 G/sec ( +- 3.10% )
126,282,947 L1-dcache-load-misses # 3.85% of all L1-dcache accesses ( +- 2.37% )
14,347,257 LLC-loads # 8.833 M/sec ( +- 1.48% )
2,386,047 LLC-load-misses # 16.33% of all LL-cache accesses ( +- 0.77% )
1.5682 +- 0.0550 seconds time elapsed ( +- 3.51% )
The stats on launching the shell (zsh) with zmodload zsh/zprof:
num calls time self name
-----------------------------------------------------------------------------------
1) 31 78.54 2.53 77.09% 50.07 1.62 49.14% antigen
2) 2 23.24 11.62 22.81% 15.93 7.96 15.63% compinit
3) 2 7.31 3.66 7.18% 7.31 3.66 7.18% compaudit
4) 1 8.27 8.27 8.12% 7.29 7.29 7.16% _autoenv_source
5) 1 6.93 6.93 6.80% 6.93 6.93 6.80% detect-clipboard
6) 1 5.18 5.18 5.08% 5.18 5.18 5.08% _autoenv_hash_pair
7) 1 2.49 2.49 2.45% 2.45 2.45 2.41% _zsh_highlight_load_highlighters
8) 2 1.01 0.51 0.99% 1.01 0.51 0.99% _autoenv_stack_entered_contains
9) 10 0.91 0.09 0.89% 0.91 0.09 0.89% add-zsh-hook
10) 1 0.94 0.94 0.92% 0.87 0.87 0.85% _autoenv_stack_entered_add
11) 1 0.85 0.85 0.84% 0.85 0.85 0.84% async_init
12) 1 0.49 0.49 0.49% 0.49 0.49 0.48% _zsh_highlight__function_callable_p
13) 1 0.45 0.45 0.44% 0.45 0.45 0.44% colors
14) 3 0.38 0.13 0.37% 0.35 0.12 0.35% add-zle-hook-widget
15) 6 0.34 0.06 0.34% 0.34 0.06 0.34% is-at-least
16) 2 15.14 7.57 14.86% 0.27 0.13 0.26% _autoenv_chpwd_handler
17) 1 5.46 5.46 5.36% 0.26 0.26 0.26% _autoenv_authorized_env_file
18) 1 0.23 0.23 0.22% 0.23 0.23 0.22% regexp-replace
19) 11 0.19 0.02 0.19% 0.19 0.02 0.19% _autoenv_debug
20) 2 0.10 0.05 0.10% 0.10 0.05 0.10% wrap_clipboard_widgets
21) 16 0.09 0.01 0.09% 0.09 0.01 0.09% compdef
22) 1 0.08 0.08 0.08% 0.08 0.08 0.08% (anon) [/home/nate-wilkins/.antigen/bundles/zsh-users/zsh-autosuggestions/zsh-autosuggestions.zsh:458]
23) 2 0.05 0.02 0.05% 0.05 0.02 0.05% bashcompinit
24) 1 0.06 0.06 0.06% 0.04 0.04 0.04% _autoenv_stack_entered_remove
25) 1 5.50 5.50 5.40% 0.03 0.03 0.03% _autoenv_check_authorized_env_file
26) 1 0.04 0.04 0.04% 0.03 0.03 0.03% complete
27) 1 0.88 0.88 0.87% 0.03 0.03 0.03% async
28) 1 0.03 0.03 0.02% 0.03 0.03 0.02% (anon) [/usr/share/zsh/functions/Misc/add-zle-hook-widget:28]
29) 2 0.01 0.00 0.01% 0.01 0.00 0.01% env_default
30) 1 0.01 0.01 0.01% 0.01 0.01 0.01% _zsh_highlight__is_function_p
31) 1 0.00 0.00 0.00% 0.00 0.00 0.00% _zsh_highlight_bind_widgets
Lastly I have a perf run with a corresponding flamegraph:
perf-run --out alacritty --command "terminal -e $SHELL -slc exit"
But I'm not sure how to interpret the flamegraph since it seems to have everything in it and not just the command that was run.
So my question is:
What is taking up the most time in my terminal setup and is there another approach I could use to better determine where the problem is coming from?

Time.time in unity

i see a video of how to move a cube like snake game move
HI
in this video ( https://www.youtube.com/watch?v=aT2zNLSFQEk&list=PLLH3mUGkfFCVNs51eK8ftCAlI3hZQ95tC&index=11 ) he declare float name **lastMove **with no value (zero by default) and use it in condition and **minus **it with Time.time then assign it to **lastMove **.
my question is what is the effect of lastMove in condition when it has no value?
if i clear it from "if statement" the game run fast but if remain in "if statement" time passed very slower

What he does is check continuously if time - lastMove is bigger than a given predefined interval (timeInBetweenMoves). Time keeps increasing each frame while lastMove is fixed. So at some point this condition will be true. When it is, he updates lastMove with the value of time to "reset the loop" = to make the minus difference lower than the interval again.The point of doing this is to move only at a fixed interval (0.25 secs) instead of every frame. Like this:
interval = 0.25 (timeBetweenMoves)
time (secs) | lastMove | time - lastMove
-----------------------------------------
0.00 | 0 | 0
0.05 | 0 | 0.05
0.10 | 0 | 0.10
0.15 | 0 | 0.15
0.20 | 0 | 0.20
0.25 | 0 | 0.25
0.30 | 0 | 0.30 ---> bigger than interval: MOVE and set lastMove to this (0.30)
0.35 | 0.30 | 0.5
0.40 | 0.30 | 0.10
0.45 | 0.30 | 0.15
0.50 | 0.30 | 0.20
0.55 | 0.30 | 0.25
0.60 | 0.30 | 0.30 ---> bigger than interval: MOVE and set lastMove to time (0.60)
0.65 | 0.60 | 0.5
0.70 | 0.60 | 0.10
...
This is kind of a throttling.

Trouble understanding and comparing CPU performance metrics

When running toplev, from pmu-tools on a piece of software (compiled with gcc: gcc -g -O3) I get this output:
FE Frontend_Bound: 37.21 +- 0.00 % Slots
BAD Bad_Speculation: 23.62 +- 0.00 % Slots
BE Backend_Bound: 7.33 +- 0.00 % Slots below
RET Retiring: 31.82 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 26.55 +- 0.00 % Slots
FE Frontend_Bound.Frontend_Bandwidth: 10.62 +- 0.00 % Slots
BAD Bad_Speculation.Branch_Mispredicts: 23.72 +- 0.00 % Slots
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 1.59 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 5.73 +- 0.00 % Slots below
RET Retiring.Base: 31.54 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.28 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.70 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.62 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.04 +- 0.00 % Clocks_Estimated <==
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.57 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.76 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.36 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 26.79 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 6.53 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.03 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.37 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 2.46 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 0.22 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.01 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.53 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
This binary takes around 4.7 seconds to run.
If I add the following flag to gcc: -falign-loops=32, the binary takes now around 3.8 seconds to run, and this is the output from toplev:
FE Frontend_Bound: 17.47 +- 0.00 % Slots below
BAD Bad_Speculation: 28.55 +- 0.00 % Slots
BE Backend_Bound: 12.02 +- 0.00 % Slots
RET Retiring: 34.21 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 6.10 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Bandwidth: 11.31 +- 0.00 % Slots below
BAD Bad_Speculation.Branch_Mispredicts: 29.19 +- 0.00 % Slots <==
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 4.58 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 7.44 +- 0.00 % Slots below
RET Retiring.Base: 33.70 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.50 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.55 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.58 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.72 +- 0.00 % Clocks_Estimated below
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.17 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.40 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.68 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 42.01 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 7.60 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.04 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.70 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 0.71 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 1.85 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.02 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 17.38 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
By adding that flag, the Frontend Latency has improved (as we can see from the toplev output). I understand that by adding that flag, now the loops are aligned to 32 bytes, and the DSB is hit more frequently when running tight loops (the code spends its time mostly in a couple of small loops).
However I don't understand why the metric Frontend_Bound.Frontend_Bandwidth.DSB has gone up (the description for that metric is: "This metric represents Core fraction of cycles in which CPU
was likely limited due to DSB (decoded uop cache) fetch
pipeline"). I would have expected that metric to go down, as the use of the DSB is precisely what I'm improving by adding the gcc flag.
PS: when running toplev I've used --no-multiplex, to minimize errors caused by multiplexing.
The target architecture is Broadwell, and the assembly of the loops is the following (Intel syntax):
606: eb 15 jmp 61d <main+0x7d>
608: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
60f: 00
610: 48 83 c6 01 add rsi,0x1
614: 48 81 fe 01 20 00 00 cmp rsi,0x2001
61b: 74 ad je 5ca <main+0x2a>
61d: 41 80 3c 30 00 cmp BYTE PTR [r8+rsi*1],0x0
622: 74 ec je 610 <main+0x70>
624: 48 8d 0c 36 lea rcx,[rsi+rsi*1]
628: 48 81 f9 00 20 00 00 cmp rcx,0x2000
62f: 77 20 ja 651 <main+0xb1>
631: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
636: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
63d: 00 00 00
640: 41 c6 04 08 00 mov BYTE PTR [r8+rcx*1],0x0
645: 48 01 f1 add rcx,rsi
648: 48 81 f9 00 20 00 00 cmp rcx,0x2000
64f: 7e ef jle 640 <main+0xa0>

Your assembly code reveals why the bandwidth DSB metric is very high (i.e., in 42.01% of all core cycles in which the DSB is active, the DSB delivers less than 4 uops). The issue seems to exist in the following loop:
610: 48 83 c6 01 add rsi,0x1
614: 48 81 fe 01 20 00 00 cmp rsi,0x2001
61b: 74 ad je 5ca <main+0x2a>
61d: 41 80 3c 30 00 cmp BYTE PTR [r8+rsi*1],0x0
622: 74 ec je 610 <main+0x70>
This loop is aligned on a 16 byte boundary despite passing -falign-loops=32 to the compiler. Also the last instruction crosses a 32-byte boundary, which means that it will be stored in a different cache set in the DSB. The DSB can only deliver uops to the IDQ from one set in the same cycle. So it will deliver add and cmp/je in one cycle and the second cmp/je in the next cycle. In both cycles, the DSB bandwdith is less than 4 uops.
However, the LSD is supposed to hide such limitations. But it seems that it's not active. The loop contains two jump instructions. The first one seems to check whether the size of the array (0x2001 bytes) has been reached and the second one seems to check whether a non-zero byte-wide element has been reached. A maximum trip count of 0x2001 gives ample time for the LSD to detect the loop and lock it down in the IDQ. On the other hand, if the probability that a non-zero element is found before the LSD detects a loop, then the uops will be either delivered from the DSB path or the MITE path. In this case, it seems that they are being delivered from the DSB path. And because the loop body crosses a 32-byte boundary, it takes 2 cycles to execute one iteration (compared to a maximum of one cycle if the loop had been 32-byte aligned since there are two jump execution ports on Broadwell). I think if you align this loop to 32 bytes, the bandwidth DSB metric will improve, not because the DSB will deliver 4 uops per cycle (it will deliver only 3 uops per cycle) but because it may take a smaller number of cycles to execute the loop.
Even if you somehow changed the code so that the uops get delivered from the LSD instead, you can still not do better than 1 cycle per iteration, despite the fact that the LSD in Broadwell can deliver uops across loop iterations (in contrast to the DSB I think). That's because you will hit another bottleneck: at most two jumps can be allocated in one cycle (See: Can the LSD issue uOPs from the next iteration of the detected loop?). So the bandwidth LSD metric will become larger while the bandwidth DSB metric will become smaller. This just changes the bottleneck, but does not improve performance (although it may improve power consumption). There is no way to improve the frontend bandwidth of this loop other than moving work from some place to the loop.
For information on the LSD, see Why jnz requires 2 cycles to complete in an inner loop.

Why does ruby-prof list "Kernel#`" as a resource hog?

I'm using ruby-prof to figure out where my CPU time is going for a small 2D game engine I'm building in Ruby. Everything looks normal here aside from the main Kernel#` entry. The Ruby docs here would suggest that this is a function for getting the STDOUT of a command running in a subshell:
Measure Mode: wall_time
Thread ID: 7966920
Fiber ID: 16567620
Total: 7.415271
Sort by: self_time
%self total self wait child calls name
28.88 2.141 2.141 0.000 0.000 476 Kernel#`
10.72 1.488 0.795 0.000 0.693 1963500 Tile#draw
9.35 0.693 0.693 0.000 0.000 1963976 Gosu::Image#draw
6.67 7.323 0.495 0.000 6.828 476 Gosu::Window#_tick
1.38 0.102 0.102 0.000 0.000 2380 Gosu::Font#draw
0.26 4.579 0.019 0.000 4.560 62832 *Array#each
0.15 0.011 0.011 0.000 0.000 476 Gosu::Window#caption=
0.09 6.873 0.007 0.000 6.867 476 PlayState#draw
0.07 0.005 0.005 0.000 0.000 476 String#gsub
0.06 2.155 0.004 0.000 2.151 476 GameWindow#memory_usage
0.06 4.580 0.004 0.000 4.576 1904 Hash#each
0.04 0.003 0.003 0.000 0.000 476 String#chomp
0.04 0.038 0.003 0.000 0.035 476 Gosu::Window#protected_update
0.04 0.004 0.003 0.000 0.001 3167 Gosu::Window#button_down?
0.04 0.005 0.003 0.000 0.002 952 Enumerable#map
0.03 0.015 0.003 0.000 0.012 476 Player#update
0.03 4.596 0.002 0.000 4.593 476 <Module::Gosu>#scale
0.03 0.002 0.002 0.000 0.000 5236 Fixnum#to_s
0.03 7.326 0.002 0.000 7.324 476 Gosu::Window#tick
0.03 0.003 0.002 0.000 0.001 952 Player#coord_facing
0.03 4.598 0.002 0.000 4.597 476 <Module::Gosu>#translate
0.02 0.002 0.002 0.000 0.000 952 Array#reject
Any suggestions as to why this might be happening? I'm fairly confident that I'm not using it in my code - unless it's being called indirectly somehow. Not sure where to start looking for that sort of thing.

I've solved my problem. Though it wasn't exactly clear to me given the ruby documentation I linked in the question, the source of the problem is how ruby-prof categorizes the usage of the #{} shortcut, also known as 'string interpolation'. I had semi-intensive debugging logic being executed within these shortcuts.
Turning off my debugging text solves my problem.

AWK selection of the specified columns

I have a file fith big number of colums like
ASN 1 | R ASN 1 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | 0.045 +/- 0.034 | -0.045 +/- 0.034 | 0.000 +/- 0.000 | 0.000 +/- 0.001
HID 2 | R HID 2 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | 0.001 +/- 0.002 | -0.001 +/- 0.002 | 0.000 +/- 0.000 | 0.000 +/- 0.001
PRO 3 | R PRO 3 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | 0.001 +/- 0.004 | -0.001 +/- 0.004 | 0.000 +/- 0.000 | -0.000 +/- 0.001
LYS 4 | R LYS 4 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | 0.182 +/- 0.073 | -0.176 +/- 0.072 | 0.000 +/- 0.000 | 0.005 +/- 0.003
MET 5 | R MET 5 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | -0.004 +/- 0.004 | 0.006 +/- 0.004 | 0.000 +/- 0.000 | 0.002 +/- 0.001
from this file I need to extract of only first and last column removing from the last column error value (+/- value ) to obtain smth like:
ASN 1 0.000
its strange that below command works good with the exemption that it could not remove error from the last column
gawk -F'[|]' '{print $1, $NF}' $file
ASN 1 0.000 +/- 0.001
HID 2 -0.000 +/- 0.001
PRO 3 -0.000 +/- 0.001
LYS 4 0.000 +/- 0.001
MET 5 -0.000 +/- 0.001
GLU 6 -0.000 +/- 0.001
MET 7 0.000 +/- 0.001
ILE 8 0.000 +/- 0.001
LEU 9 0.001 +/- 0.001
alternatively when I replace it with
gawk -F'[|,+/-]' '{print $1, $(NF-1)}' $file
it didn't replace column before last column (value) but did subtraction -1 from the last (error) column:
ASN 1 -0.999
HID 2 -0.999
PRO 3 -0.999
LYS 4 -0.997
what should I correct here to fix the script ?

Your regex for field separator is wrong. Use like this:
gawk -F'\\||\\+/-' 'NF>1{print $1, $(NF-1)}' file
ASN 1 0.000
HID 2 0.000
PRO 3 -0.000
LYS 4 0.005
MET 5 0.002
i.e. use double escaping for regex meta characters like | or +.
Code Demo

When you use -F'[|]', you are stating that | is a field separator. Using -F[|+/-] means you're using any of these characters as a field separator: |, +, /, or -.
You have two choices:
Use spaces, but then understand that you need to calculate your columns a bit differently since +/- is now a column. I print columns 1, 2, and the third from the last.
For example:
$ awk '{printf ("%-5.5s %2d %10.3f\n", $1, $2, $(NF - 2))}' test.txt
ASN 1 0.001
HID 2 0.001
PRO 3 0.001
LYS 4 0.003
MET 5 0.001
Or, you can use a fancier regular expression that says you want to separate fields via *\| * or *+/- *. Note I include the spaces in my regular expression field separator. This way, spaces are stripped from my columns:
Note my regular expression:
$ awk -F' *\| *| *\+/- *' \
'{printf ("%-5.5s %2d %10.3f\n", $1, $2, $NF)}' file
ASN 1 0.001
HID 2 0.001
PRO 3 0.001
LYS 4 0.003
MET 5 0.001
This works with standard awk on BSD and nawk on Solaris. gawk might do things a bit differently.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio