Trouble understanding and comparing CPU performance metrics - performance

When running toplev, from pmu-tools on a piece of software (compiled with gcc: gcc -g -O3) I get this output:
FE Frontend_Bound: 37.21 +- 0.00 % Slots
BAD Bad_Speculation: 23.62 +- 0.00 % Slots
BE Backend_Bound: 7.33 +- 0.00 % Slots below
RET Retiring: 31.82 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 26.55 +- 0.00 % Slots
FE Frontend_Bound.Frontend_Bandwidth: 10.62 +- 0.00 % Slots
BAD Bad_Speculation.Branch_Mispredicts: 23.72 +- 0.00 % Slots
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 1.59 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 5.73 +- 0.00 % Slots below
RET Retiring.Base: 31.54 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.28 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.70 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.62 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.04 +- 0.00 % Clocks_Estimated <==
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.57 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.76 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.36 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 26.79 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 6.53 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.03 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.37 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 2.46 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 0.22 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.01 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.53 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
This binary takes around 4.7 seconds to run.
If I add the following flag to gcc: -falign-loops=32, the binary takes now around 3.8 seconds to run, and this is the output from toplev:
FE Frontend_Bound: 17.47 +- 0.00 % Slots below
BAD Bad_Speculation: 28.55 +- 0.00 % Slots
BE Backend_Bound: 12.02 +- 0.00 % Slots
RET Retiring: 34.21 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 6.10 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Bandwidth: 11.31 +- 0.00 % Slots below
BAD Bad_Speculation.Branch_Mispredicts: 29.19 +- 0.00 % Slots <==
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 4.58 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 7.44 +- 0.00 % Slots below
RET Retiring.Base: 33.70 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.50 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.55 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.58 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.72 +- 0.00 % Clocks_Estimated below
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.17 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.40 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.68 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 42.01 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 7.60 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.04 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.70 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 0.71 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 1.85 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.02 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 17.38 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
By adding that flag, the Frontend Latency has improved (as we can see from the toplev output). I understand that by adding that flag, now the loops are aligned to 32 bytes, and the DSB is hit more frequently when running tight loops (the code spends its time mostly in a couple of small loops).
However I don't understand why the metric Frontend_Bound.Frontend_Bandwidth.DSB has gone up (the description for that metric is: "This metric represents Core fraction of cycles in which CPU
was likely limited due to DSB (decoded uop cache) fetch
pipeline"). I would have expected that metric to go down, as the use of the DSB is precisely what I'm improving by adding the gcc flag.
PS: when running toplev I've used --no-multiplex, to minimize errors caused by multiplexing.
The target architecture is Broadwell, and the assembly of the loops is the following (Intel syntax):
606: eb 15 jmp 61d <main+0x7d>
608: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
60f: 00
610: 48 83 c6 01 add rsi,0x1
614: 48 81 fe 01 20 00 00 cmp rsi,0x2001
61b: 74 ad je 5ca <main+0x2a>
61d: 41 80 3c 30 00 cmp BYTE PTR [r8+rsi*1],0x0
622: 74 ec je 610 <main+0x70>
624: 48 8d 0c 36 lea rcx,[rsi+rsi*1]
628: 48 81 f9 00 20 00 00 cmp rcx,0x2000
62f: 77 20 ja 651 <main+0xb1>
631: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
636: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
63d: 00 00 00
640: 41 c6 04 08 00 mov BYTE PTR [r8+rcx*1],0x0
645: 48 01 f1 add rcx,rsi
648: 48 81 f9 00 20 00 00 cmp rcx,0x2000
64f: 7e ef jle 640 <main+0xa0>

Your assembly code reveals why the bandwidth DSB metric is very high (i.e., in 42.01% of all core cycles in which the DSB is active, the DSB delivers less than 4 uops). The issue seems to exist in the following loop:
610: 48 83 c6 01 add rsi,0x1
614: 48 81 fe 01 20 00 00 cmp rsi,0x2001
61b: 74 ad je 5ca <main+0x2a>
61d: 41 80 3c 30 00 cmp BYTE PTR [r8+rsi*1],0x0
622: 74 ec je 610 <main+0x70>
This loop is aligned on a 16 byte boundary despite passing -falign-loops=32 to the compiler. Also the last instruction crosses a 32-byte boundary, which means that it will be stored in a different cache set in the DSB. The DSB can only deliver uops to the IDQ from one set in the same cycle. So it will deliver add and cmp/je in one cycle and the second cmp/je in the next cycle. In both cycles, the DSB bandwdith is less than 4 uops.
However, the LSD is supposed to hide such limitations. But it seems that it's not active. The loop contains two jump instructions. The first one seems to check whether the size of the array (0x2001 bytes) has been reached and the second one seems to check whether a non-zero byte-wide element has been reached. A maximum trip count of 0x2001 gives ample time for the LSD to detect the loop and lock it down in the IDQ. On the other hand, if the probability that a non-zero element is found before the LSD detects a loop, then the uops will be either delivered from the DSB path or the MITE path. In this case, it seems that they are being delivered from the DSB path. And because the loop body crosses a 32-byte boundary, it takes 2 cycles to execute one iteration (compared to a maximum of one cycle if the loop had been 32-byte aligned since there are two jump execution ports on Broadwell). I think if you align this loop to 32 bytes, the bandwidth DSB metric will improve, not because the DSB will deliver 4 uops per cycle (it will deliver only 3 uops per cycle) but because it may take a smaller number of cycles to execute the loop.
Even if you somehow changed the code so that the uops get delivered from the LSD instead, you can still not do better than 1 cycle per iteration, despite the fact that the LSD in Broadwell can deliver uops across loop iterations (in contrast to the DSB I think). That's because you will hit another bottleneck: at most two jumps can be allocated in one cycle (See: Can the LSD issue uOPs from the next iteration of the detected loop?). So the bandwidth LSD metric will become larger while the bandwidth DSB metric will become smaller. This just changes the bottleneck, but does not improve performance (although it may improve power consumption). There is no way to improve the frontend bandwidth of this loop other than moving work from some place to the loop.
For information on the LSD, see Why jnz requires 2 cycles to complete in an inner loop.

Related

How to do proper performance testing and analysis of alacritty and zsh

I've been working with my setup lately and have been trying to determine where my 2.3s terminal load times have been coming from. I'm fairly new to linux performance testing in general but I have determined a few things.
The first thing I should mention is that terminal is a shell program with the following:
#!/bin/sh
WINIT_X11_SCALE_FACTOR=1.5 alacritty "$#"
The stats on launching the terminal program (alacritty) and its shell (zsh -l):
> perf stat -r 10 -d terminal -e $SHELL -slc exit
Performance counter stats for 'terminal -e /usr/bin/zsh -slc exit' (10 runs):
602.55 msec task-clock # 0.261 CPUs utilized ( +- 1.33% )
957 context-switches # 1.532 K/sec ( +- 0.42% )
92 cpu-migrations # 147.298 /sec ( +- 1.89% )
68,150 page-faults # 109.113 K/sec ( +- 0.13% )
2,188,445,151 cycles # 3.504 GHz ( +- 0.17% )
3,695,337,515 instructions # 1.70 insn per cycle ( +- 0.08% )
791,333,786 branches # 1.267 G/sec ( +- 0.06% )
14,007,258 branch-misses # 1.78% of all branches ( +- 0.09% )
10,893,173,535 slots # 17.441 G/sec ( +- 0.13% )
3,574,546,556 topdown-retiring # 30.5% Retiring ( +- 0.11% )
2,888,937,632 topdown-bad-spec # 24.0% Bad Speculation ( +- 0.41% )
3,125,577,758 topdown-fe-bound # 27.1% Frontend Bound ( +- 0.16% )
2,189,183,796 topdown-be-bound # 18.4% Backend Bound ( +- 0.47% )
924,852,782 L1-dcache-loads # 1.481 G/sec ( +- 0.07% )
38,308,478 L1-dcache-load-misses # 4.16% of all L1-dcache accesses ( +- 0.09% )
3,445,566 LLC-loads # 5.517 M/sec ( +- 0.20% )
725,990 LLC-load-misses # 20.97% of all LL-cache accesses ( +- 0.36% )
2.30683 +- 0.00331 seconds time elapsed ( +- 0.14% )
The stats on launching just the shell (zsh):
Performance counter stats for '/usr/bin/zsh -i -c exit' (10 runs):
1,548.56 msec task-clock # 0.987 CPUs utilized ( +- 3.28% )
525 context-switches # 323.233 /sec ( +- 21.17% )
16 cpu-migrations # 9.851 /sec ( +- 11.33% )
90,616 page-faults # 55.791 K/sec ( +- 2.63% )
6,559,830,564 cycles # 4.039 GHz ( +- 3.18% )
11,317,955,247 instructions # 1.68 insn per cycle ( +- 3.69% )
2,351,473,571 branches # 1.448 G/sec ( +- 3.46% )
46,539,165 branch-misses # 1.91% of all branches ( +- 1.31% )
32,783,001,655 slots # 20.184 G/sec ( +- 3.18% )
10,776,867,769 topdown-retiring # 32.5% Retiring ( +- 3.28% )
5,729,353,491 topdown-bad-spec # 18.2% Bad Speculation ( +- 6.90% )
11,083,567,578 topdown-fe-bound # 33.3% Frontend Bound ( +- 2.34% )
5,458,201,823 topdown-be-bound # 15.9% Backend Bound ( +- 4.51% )
3,180,211,376 L1-dcache-loads # 1.958 G/sec ( +- 3.10% )
126,282,947 L1-dcache-load-misses # 3.85% of all L1-dcache accesses ( +- 2.37% )
14,347,257 LLC-loads # 8.833 M/sec ( +- 1.48% )
2,386,047 LLC-load-misses # 16.33% of all LL-cache accesses ( +- 0.77% )
1.5682 +- 0.0550 seconds time elapsed ( +- 3.51% )
The stats on launching the shell (zsh) with zmodload zsh/zprof:
num calls time self name
-----------------------------------------------------------------------------------
1) 31 78.54 2.53 77.09% 50.07 1.62 49.14% antigen
2) 2 23.24 11.62 22.81% 15.93 7.96 15.63% compinit
3) 2 7.31 3.66 7.18% 7.31 3.66 7.18% compaudit
4) 1 8.27 8.27 8.12% 7.29 7.29 7.16% _autoenv_source
5) 1 6.93 6.93 6.80% 6.93 6.93 6.80% detect-clipboard
6) 1 5.18 5.18 5.08% 5.18 5.18 5.08% _autoenv_hash_pair
7) 1 2.49 2.49 2.45% 2.45 2.45 2.41% _zsh_highlight_load_highlighters
8) 2 1.01 0.51 0.99% 1.01 0.51 0.99% _autoenv_stack_entered_contains
9) 10 0.91 0.09 0.89% 0.91 0.09 0.89% add-zsh-hook
10) 1 0.94 0.94 0.92% 0.87 0.87 0.85% _autoenv_stack_entered_add
11) 1 0.85 0.85 0.84% 0.85 0.85 0.84% async_init
12) 1 0.49 0.49 0.49% 0.49 0.49 0.48% _zsh_highlight__function_callable_p
13) 1 0.45 0.45 0.44% 0.45 0.45 0.44% colors
14) 3 0.38 0.13 0.37% 0.35 0.12 0.35% add-zle-hook-widget
15) 6 0.34 0.06 0.34% 0.34 0.06 0.34% is-at-least
16) 2 15.14 7.57 14.86% 0.27 0.13 0.26% _autoenv_chpwd_handler
17) 1 5.46 5.46 5.36% 0.26 0.26 0.26% _autoenv_authorized_env_file
18) 1 0.23 0.23 0.22% 0.23 0.23 0.22% regexp-replace
19) 11 0.19 0.02 0.19% 0.19 0.02 0.19% _autoenv_debug
20) 2 0.10 0.05 0.10% 0.10 0.05 0.10% wrap_clipboard_widgets
21) 16 0.09 0.01 0.09% 0.09 0.01 0.09% compdef
22) 1 0.08 0.08 0.08% 0.08 0.08 0.08% (anon) [/home/nate-wilkins/.antigen/bundles/zsh-users/zsh-autosuggestions/zsh-autosuggestions.zsh:458]
23) 2 0.05 0.02 0.05% 0.05 0.02 0.05% bashcompinit
24) 1 0.06 0.06 0.06% 0.04 0.04 0.04% _autoenv_stack_entered_remove
25) 1 5.50 5.50 5.40% 0.03 0.03 0.03% _autoenv_check_authorized_env_file
26) 1 0.04 0.04 0.04% 0.03 0.03 0.03% complete
27) 1 0.88 0.88 0.87% 0.03 0.03 0.03% async
28) 1 0.03 0.03 0.02% 0.03 0.03 0.02% (anon) [/usr/share/zsh/functions/Misc/add-zle-hook-widget:28]
29) 2 0.01 0.00 0.01% 0.01 0.00 0.01% env_default
30) 1 0.01 0.01 0.01% 0.01 0.01 0.01% _zsh_highlight__is_function_p
31) 1 0.00 0.00 0.00% 0.00 0.00 0.00% _zsh_highlight_bind_widgets
Lastly I have a perf run with a corresponding flamegraph:
perf-run --out alacritty --command "terminal -e $SHELL -slc exit"
But I'm not sure how to interpret the flamegraph since it seems to have everything in it and not just the command that was run.
So my question is:
What is taking up the most time in my terminal setup and is there another approach I could use to better determine where the problem is coming from?

Time.time in unity

i see a video of how to move a cube like snake game move
HI
in this video ( https://www.youtube.com/watch?v=aT2zNLSFQEk&list=PLLH3mUGkfFCVNs51eK8ftCAlI3hZQ95tC&index=11 ) he declare float name **lastMove **with no value (zero by default) and use it in condition and **minus **it with Time.time then assign it to **lastMove **.
my question is what is the effect of lastMove in condition when it has no value?
if i clear it from "if statement" the game run fast but if remain in "if statement" time passed very slower
What he does is check continuously if time - lastMove is bigger than a given predefined interval (timeInBetweenMoves). Time keeps increasing each frame while lastMove is fixed. So at some point this condition will be true. When it is, he updates lastMove with the value of time to "reset the loop" = to make the minus difference lower than the interval again.The point of doing this is to move only at a fixed interval (0.25 secs) instead of every frame. Like this:
interval = 0.25 (timeBetweenMoves)
time (secs) | lastMove | time - lastMove
-----------------------------------------
0.00 | 0 | 0
0.05 | 0 | 0.05
0.10 | 0 | 0.10
0.15 | 0 | 0.15
0.20 | 0 | 0.20
0.25 | 0 | 0.25
0.30 | 0 | 0.30 ---> bigger than interval: MOVE and set lastMove to this (0.30)
0.35 | 0.30 | 0.5
0.40 | 0.30 | 0.10
0.45 | 0.30 | 0.15
0.50 | 0.30 | 0.20
0.55 | 0.30 | 0.25
0.60 | 0.30 | 0.30 ---> bigger than interval: MOVE and set lastMove to time (0.60)
0.65 | 0.60 | 0.5
0.70 | 0.60 | 0.10
...
This is kind of a throttling.

Understanding ruby-prof output

I ran ruby-profiler on one of my programs. I'm trying to figure out what each fields mean. I'm guessing everything is CPU time (and not wall clock time), which is fantastic. I want to understand what the "---" stands for. Is there some sort of stack information in there. What does calls a/b mean?
Thread ID: 81980260
Total Time: 0.28
%total %self total self wait child calls Name
--------------------------------------------------------------------------------
0.28 0.00 0.00 0.28 5/6 FrameParser#receive_data
100.00% 0.00% 0.28 0.00 0.00 0.28 6 FrameParser#read_frames
0.28 0.00 0.00 0.28 4/4 ChatServerClient#receive_frame
0.00 0.00 0.00 0.00 5/47 Fixnum#+
0.00 0.00 0.00 0.00 1/2 DebugServer#receive_frame
0.00 0.00 0.00 0.00 10/29 String#[]
0.00 0.00 0.00 0.00 10/21 <Class::Range>#allocate
0.00 0.00 0.00 0.00 10/71 String#index
--------------------------------------------------------------------------------
100.00% 0.00% 0.28 0.00 0.00 0.28 5 FrameParser#receive_data
0.28 0.00 0.00 0.28 5/6 FrameParser#read_frames
0.00 0.00 0.00 0.00 5/16 ActiveSupport::CoreExtensions::String::OutputSafety#add_with_safety
--------------------------------------------------------------------------------
0.28 0.00 0.00 0.28 4/4 FrameParser#read_frames
100.00% 0.00% 0.28 0.00 0.00 0.28 4 ChatServerClient#receive_frame
0.28 0.00 0.00 0.28 4/6 <Class::Lal>#safe_call
--------------------------------------------------------------------------------
0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call
0.00 0.00 0.00 0.00 1/6 DebugServer#receive_frame
0.28 0.00 0.00 0.28 4/6 ChatServerClient#receive_frame
100.00% 0.00% 0.28 0.00 0.00 0.28 6 <Class::Lal>#safe_call
0.21 0.00 0.00 0.21 2/4 ChatUserFunction#register
0.06 0.00 0.00 0.06 2/2 ChatUserFunction#packet
0.01 0.00 0.00 0.01 4/130 Class#new
0.00 0.00 0.00 0.00 1/1 DebugServer#profile_stop
0.00 0.00 0.00 0.00 1/33 String#==
0.00 0.00 0.00 0.00 1/6 <Class::Lal>#safe_call
0.00 0.00 0.00 0.00 5/5 JSON#parse
0.00 0.00 0.00 0.00 5/8 <Class::Log>#log
0.00 0.00 0.00 0.00 5/5 String#strip!
--------------------------------------------------------------------------------
Each section of the ruby-prof output is broken up into the examination of a particular function. for instance, look at the first section of your output. The read_frames method on FrameParser is the focus and it is basically saying the following:
100% of the execution time that was profiled was spent inside of FrameParser#read_frames
FrameParser#read_frames was called 6 times.
5 out of the 6 calls to read_frames came from FrameParser#receive_data and this accounted 100% of the execution time (this is the line above the read_frames line).
The lines below the read_frames (but within that first section) method are all of the methods that FrameParser#read_frames calls (you should be aware of that since this seems like it's your code), how many of that methods total calls read_frames is responsible for (the a/b calls column), and how much time those calls took. They are ordered by which of them took up the most execution time. In your case, that is receive_frame method on the ChatServer class.
You can then look down at the section focusing on receive_frames (2 down and centered with the '100%' line on receive_frame) and see how it's performance is broken down. each section is set up the same way and usually the subsequent function call which took the most time is the focus of the next section down. ruby-prof will continue doing this through the full call stack. You can go as deep as you want until you find the bottleneck you'd like to resolve.

Understanding Assembly

I've got some assembly of a sorting algorithm and I want to figure out how exactly it functions.
I'm a little confused on some of the instructions, particularly the cmp and jle instructions, so I'm looking for help. This assembly sorts an array of three elements.
0.00 : 4009f8: 48 8b 07 mov (%rdi),%rax
0.00 : 4009fb: 48 8b 57 08 mov 0x8(%rdi),%rdx
0.00 : 4009ff: 48 8b 4f 10 mov 0x10(%rdi),%rcx
0.00 : 400a03: 48 39 d0 cmp %rdx,%rax
0.00 : 400a06: 7e 2b jle 400a33 <b+0x3b>
0.00 : 400a08: 48 39 c8 cmp %rcx,%rax
0.00 : 400a0b: 7e 1a jle 400a27 <b+0x2f>
0.00 : 400a0d: 48 39 ca cmp %rcx,%rdx
0.00 : 400a10: 7e 0c jle 400a1e <b+0x26>
0.00 : 400a12: 48 89 0f mov %rcx,(%rdi)
0.00 : 400a15: 48 89 57 08 mov %rdx,0x8(%rdi)
0.00 : 400a19: 48 89 47 10 mov %rax,0x10(%rdi)
0.00 : 400a1d: c3 retq
0.00 : 400a1e: 48 89 17 mov %rdx,(%rdi)
0.00 : 400a21: 48 89 4f 08 mov %rcx,0x8(%rdi)
0.00 : 400a25: eb f2 jmp 400a19 <b+0x21>
0.00 : 400a27: 48 89 17 mov %rdx,(%rdi)
0.00 : 400a2a: 48 89 47 08 mov %rax,0x8(%rdi)
0.00 : 400a2e: 48 89 4f 10 mov %rcx,0x10(%rdi)
0.00 : 400a32: c3 retq
0.00 : 400a33: 48 39 ca cmp %rcx,%rdx
0.00 : 400a36: 7e 1d jle 400a55 <b+0x5d>
0.00 : 400a38: 48 39 c8 cmp %rcx,%rax
0.00 : 400a3b: 7e 0c jle 400a49 <b+0x51>
0.00 : 400a3d: 48 89 0f mov %rcx,(%rdi)
0.00 : 400a40: 48 89 47 08 mov %rax,0x8(%rdi)
0.00 : 400a44: 48 89 57 10 mov %rdx,0x10(%rdi)
0.00 : 400a48: c3 retq
0.00 : 400a49: 48 89 07 mov %rax,(%rdi)
0.00 : 400a4c: 48 89 4f 08 mov %rcx,0x8(%rdi)
0.00 : 400a50: 48 89 57 10 mov %rdx,0x10(%rdi)
0.00 : 400a54: c3 retq
0.00 : 400a55: 48 89 07 mov %rax,(%rdi)
0.00 : 400a58: 48 89 57 08 mov %rdx,0x8(%rdi)
0.00 : 400a5c: 48 89 4f 10 mov %rcx,0x10(%rdi)
0.00 : 400a60: c3 retq
0.00 : 400a61: 90 nop
If someone can walk me through it, it'd be very helpful. I kind of get confused around the operands like 0x8(%rdi) and the cmp and jle instructions. Thanks.
Here are what the instructions mean:
mov : move
cmp : compare
jle : jump if less or equal (branch)
ret : return from procedure
nop : no-op
%r** are registers. They are usually %e** (eg: %eax, %edx, ...), but these are 64-bit registers.
As far as de-compiling the whole thing, that will take some more work.
See this: http://www.x86-64.org/documentation/assembly
It helps to replace the register names with proper names to trace the flow of data, and add branch labels for the control flow.
0.00 : 4009f8: 48 8b 07 mov (%argptr),%var1
0.00 : 4009fb: 48 8b 57 08 mov 0x8(%argptr),%var2
0.00 : 4009ff: 48 8b 4f 10 mov 0x10(%argptr),%var3
0.00 : 400a03: 48 39 d0 cmp %var2,%var1
0.00 : 400a06: 7e 2b jle #v2le1
0.00 : 400a08: 48 39 c8 cmp %var3,%var1
0.00 : 400a0b: 7e 1a jle #v3le1
0.00 : 400a0d: 48 39 ca cmp %var3,%var2
0.00 : 400a10: 7e 0c jle #v3le2
# Now we know that 2 > 1 and 3 > 1 and 3 > 2. Write them to memory in order.
etc
0.00 : 4009f8: 48 8b 07 mov (%rdi),%rax
The register RDI contains the address to a location in memory of your array. This line above copy the contents of the RAX register into the first element of your array. Since pointers in x64 are 0x8 bytes, the following two lines:
0.00 : 4009fb: 48 8b 57 08 mov 0x8(%rdi),%rdx
0.00 : 4009ff: 48 8b 4f 10 mov 0x10(%rdi),%rcx
will copy the contents of the RDX and RCX registers into the second and third elements into your array respectively. Now we need to start comparing the values to see where we need to swap.
0.00 : 400a03: 48 39 d0 cmp %rdx,%rax
0.00 : 400a06: 7e 2b jle 400a33 <b+0x3b>
The cmp will compare the value of RDX with the value of RAX (essentially array[1] against array[0]). If RDX is less than or equal to RAX then this will jump program execution directly to line 400a33. You can think of this as if (array[1] > array[0]). And so it continues, comparing the values.
It appears that the code is trying to sort in descending order. Very roughly, the C code will probably look something like this:
array[0] = rax;
array[1] = rdx;
array[2] = rcx;
if (rdx > rax)
{
if (rcx > rax)
{
if (rcx > rdx)
{
rcx = array[0];
rdx = array[1];
LABEL:
rax = array[2];
}
else
{
rdx = array[0];
rcx = array[1];
GOTO LABEL;
}
}
else
{
rdx = array[0];
rax = array[1];
rcx = array[2];
}
}
else
{
if (rcx > rdx)
{
if (rcx > rax)
{
rcx = array[0];
rax = array[1];
rdx = array[2];
}
else
{
rax = array[0];
rdx = array[1];
rcx = array[2];
}
}
else
{
rax = array[0];
rdx = array[1];
rcx = array[2];
}
}

Extract individual column from a HIVE table

Below is a select query from a HIVE table:
select * from test_aviation limit 5;
OK
2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10
2015 1 1 2 5 2015-01-02 AA 19805 AA N795AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0850 -10.00 0.00 0.00 -1 0900-0959 15.00 0905 1202 9.00 1230 1211 -19.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 381.00 357.00 1.00 2475.00 10
2015 1 1 3 6 2015-01-03 AA 19805 AA N788AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 15.00 0908 1138 13.00 1230 1151 -39.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 358.00 330.00 1.00 2475.00 10
2015 1 1 4 7 2015-01-04 AA 19805 AA N791AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 14.00 0907 1159 19.00 1230 1218 -12.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 385.00 352.00 1.00 2475.00 10
2015 1 1 5 1 2015-01-05 AA 19805 AA N783AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 27.00 0920 1158 24.00 1230 1222 -8.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 389.00 338.00 1.00 2475.00 10
Time taken: 0.067 seconds, Fetched: 5 row(s)
Structure of HIVE table
hive> describe test_aviation;
OK
col_value string
Time taken: 0.221 seconds, Fetched: 1 row(s)
I want to segregate the entire table in different columns.I have written a query like below to extract 12th column:
SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 12) from test_aviation;
Output:
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1437067221195_0008, Tracking URL = http://localhost:8088/proxy/application_1437067221195_0008/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1437067221195_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-07-17 02:46:56,215 Stage-1 map = 0%, reduce = 0%
2015-07-17 02:47:27,650 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1437067221195_0008 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://localhost:8088/proxy/application_1437067221195_0008/
Examining task ID: task_1437067221195_0008_m_000000 (and more) from job job_1437067221195_0008
Task with the most failures(4):
-----
Task ID:
task_1437067221195_0008_m_000000
URL:
http://localhost:8088/taskdetails.jsp?jobid=job_1437067221195_0008&tipid=task_1437067221195_0008_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract#4def4616 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10 :java.lang.String, ^(?:([^,]*),?){1}:java.lang.String, 12:java.lang.Integer} of size 3
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1243)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:166)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:79)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1219)
... 18 more
Caused by: java.lang.IndexOutOfBoundsException: No group 12
at java.util.regex.Matcher.group(Matcher.java:487)
at org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(UDFRegExpExtract.java:56)
... 23 more
Please help me to extract different columns from a HIVE table.
Try this:
select split(col_value,' ')[11] as column_12 from test_aviation;
Assuming you have space delimiters.
'\\t' if tab
'\\|' for pipe...
':'
and so on

Resources