I'm developing (NASM + GCC targetting ELF64) a PoC that uses a spectre gadget that measures the time to access a set of cache lines (FLUSH+RELOAD).
How can I make a reliable spectre gadget?
I believe I understand the theory behind the FLUSH+RELOAD technique, however in practice, despiste some noise, I'm unable to produce a working PoC.
Since I'm using the Timestamp counter and the loads are very regular I use this script to disable the prefetchers, the turbo boost and to fix/stabilize the CPU frequency:
#!/bin/bash
sudo modprobe msr
#Disable turbo
sudo wrmsr -a 0x1a0 0x4000850089
#Disable prefetchers
sudo wrmsr -a 0x1a4 0xf
#Set performance governor
sudo cpupower frequency-set -g performance
#Minimum freq
sudo cpupower frequency-set -d 2.2GHz
#Maximum freq
sudo cpupower frequency-set -u 2.2GHz
I have a continuous buffer, aligned on 4KiB, large enough to span 256 cache lines separated by an integral number GAP of lines.
SECTION .bss ALIGN=4096
buffer: resb 256 * (1 + GAP) * 64
I use this function to flush the 256 lines.
flush_all:
lea rdi, [buffer] ;Start pointer
mov esi, 256 ;How many lines to flush
.flush_loop:
lfence ;Prevent the previous clflush to be reordered after the load
mov eax, [rdi] ;Touch the page
lfence ;Prevent the current clflush to be reordered before the load
clflush [rdi] ;Flush a line
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi
jnz .flush_loop ;Repeat
lfence ;clflush are ordered with respect of fences ..
;.. and lfence is ordered (locally) with respect of all instructions
ret
The function loops through all the lines, touching every page in between (each page more than once) and flushing each line.
Then I use this function to profile the accesses.
profile:
lea rdi, [buffer] ;Pointer to the buffer
mov esi, 256 ;How many lines to test
lea r8, [timings_data] ;Pointer to timings results
mfence ;I'm pretty sure this is useless, but I included it to rule out ..
;.. silly, hard to debug, scenarios
.profile:
mfence
rdtscp
lfence ;Read the TSC in-order (ignoring stores global visibility)
mov ebp, eax ;Read the low DWORD only (this is a short delay)
;PERFORM THE LOADING
mov eax, DWORD [rdi]
rdtscp
lfence ;Again, read the TSC in-order
sub eax, ebp ;Compute the delta
mov DWORD [r8], eax ;Save it
;Advance the loop
add r8, 4 ;Move the results pointer
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi ;Advance the loop
jnz .profile
ret
An MCVE is given in appendix and a repository is available to clone.
When assembled with GAP set to 0, linked and executed with taskset -c 0 the cycles necessary to fetch each line are shown below.
Only 64 lines are loaded from memory.
The output is stable across different runs.
If I set GAP to 1 only 32 lines are fetched from memory, ofcourse 64 * (1+0) * 64 = 32 * (1+1) * 64 = 4096, so this may be related to paging?
If a store is executed before the profiling (but after the flush) to one of the first 64 lines, the output changes to this
Any store the the other lines gives the first type of output.
I suspect the math in the is broken but I need another couple of eyes find out where.
EDIT
Hadi Brais pointed out a misuse of a volatile register, after fixing that the output is now inconsistent.
I see prevalently runs where the timings are low (~50 cycles) and sometimes runs where the timing are higher (~130 cycles).
I don't know where the 130 cycles figure come from (too low for memory, too high for the cache?).
Code is fixed in the MCVE (and the repository).
If a store to any of the first lines is executed before the profiling, no change is reflected in the output.
APPENDIX - MCVE
BITS 64
DEFAULT REL
GLOBAL main
EXTERN printf
EXTERN exit
;Space between lines in the buffer
%define GAP 0
SECTION .bss ALIGN=4096
buffer: resb 256 * (1 + GAP) * 64
SECTION .data
timings_data: TIMES 256 dd 0
strNewLine db `\n0x%02x: `, 0
strHalfLine db " ", 0
strTiming db `\e[48;5;16`,
.importance db "0",
db `m\e[38;5;15m%03u\e[0m `, 0
strEnd db `\n\n`, 0
SECTION .text
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES
;
;
flush_all:
lea rdi, [buffer] ;Start pointer
mov esi, 256 ;How many lines to flush
.flush_loop:
lfence ;Prevent the previous clflush to be reordered after the load
mov eax, [rdi] ;Touch the page
lfence ;Prevent the current clflush to be reordered before the load
clflush [rdi] ;Flush a line
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi
jnz .flush_loop ;Repeat
lfence ;clflush are ordered with respect of fences ..
;.. and lfence is ordered (locally) with respect of all instructions
ret
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER
;
;
profile:
lea rdi, [buffer] ;Pointer to the buffer
mov esi, 256 ;How many lines to test
lea r8, [timings_data] ;Pointer to timings results
mfence ;I'm pretty sure this is useless, but I included it to rule out ..
;.. silly, hard to debug, scenarios
.profile:
mfence
rdtscp
lfence ;Read the TSC in-order (ignoring stores global visibility)
mov ebp, eax ;Read the low DWORD only (this is a short delay)
;PERFORM THE LOADING
mov eax, DWORD [rdi]
rdtscp
lfence ;Again, read the TSC in-order
sub eax, ebp ;Compute the delta
mov DWORD [r8], eax ;Save it
;Advance the loop
add r8, 4 ;Move the results pointer
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi ;Advance the loop
jnz .profile
ret
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;SHOW THE RESULTS
;
;
show_results:
lea rbx, [timings_data] ;Pointer to the timings
xor r12, r12 ;Counter (up to 256)
.print_line:
;Format the output
xor eax, eax
mov esi, r12d
lea rdi, [strNewLine] ;Setup for a call to printf
test r12d, 0fh
jz .print ;Test if counter is a multiple of 16
lea rdi, [strHalfLine] ;Setup for a call to printf
test r12d, 07h ;Test if counter is a multiple of 8
jz .print
.print_timing:
;Print
mov esi, DWORD [rbx] ;Timing value
;Compute the color
mov r10d, 60 ;Used to compute the color
mov eax, esi
xor edx, edx
div r10d ;eax = Timing value / 78
;Update the color
add al, '0'
mov edx, '5'
cmp eax, edx
cmova eax, edx
mov BYTE [strTiming.importance], al
xor eax, eax
lea rdi, [strTiming]
call printf WRT ..plt ;Print a 3-digits number
;Advance the loop
inc r12d ;Increment the counter
add rbx, 4 ;Move to the next timing
cmp r12d, 256
jb .print_line ;Advance the loop
xor eax, eax
lea rdi, [strEnd]
call printf WRT ..plt ;Print a new line
ret
.print:
call printf WRT ..plt ;Print a string
jmp .print_timing
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;E N T R Y P O I N T
;
;
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
main:
;Flush all the lines of the buffer
call flush_all
;Test the access times
call profile
;Show the results
call show_results
;Exit
xor edi, edi
call exit WRT ..plt
The buffer is allocated from the bss section and so when the program is loaded, the OS will map all of the buffer cache lines to the same CoW physical page. After flushing all of the lines, only the accesses to the first 64 lines in the virtual address space miss in all cache levels1 because all2 later accesses are to the same 4K page. That's why the latencies of the first 64 accesses fall in the range of the main memory latency and the latencies of all later accesses are equal to the L1 hit latency3 when GAP is zero.
When GAP is 1, every other line of the same physical page is accessed and so the number of main memory accesses (L3 misses) is 32 (half of 64). That is, the first 32 latencies will be in the range of the main memory latency and all later latencies will be L1 hits. Similarly, when GAP is 63, all accesses are to the same line. Therefore, only the first access will miss all caches.
The solution is to change mov eax, [rdi] in flush_all to mov dword [rdi], 0 to ensure that the buffer is allocated in unique physical pages. (The lfence instructions in flush_all can be removed because the Intel manual states that clflush cannot be reordered with writes4.) This guarantees that, after initializing and flushing all lines, all accesses will miss all cache levels (but not the TLB, see: Does clflush also remove TLB entries?).
You can refer to Why are the user-mode L1 store miss events only counted when there is a store initialization loop? for another example where CoW pages can be deceiving.
I suggested in the previous version of this answer to remove the call to flush_all and use a GAP value of 63. With these changes, all of the access latencies appeared to be very high and I have incorrectly concluded that all of the accesses are missing all cache levels. Like I said above, with a GAP value of 63, all of the accesses become to the same cache line, which is actually resident in the L1 cache. However, the reason that all of the latencies were high is because every access was to a different virtual page and the TLB didn't have any of mappings for each of these virtual pages (to the same physical page) because by removing the call to flush_all, none of the virtual pages were touched before. So the measured latencies represent the TLB miss latency, even though the line being accessed is in the L1 cache.
I also incorrectly claimed in the previous version of this answer that there is an L3 prefetching logic that cannot be disabled through MSR 0x1A4. If a particular prefetcher is turned off by setting its flag in MSR 0x1A4, then it does fully get switched off. Also there are no data prefetchers other than the ones documented by Intel.
Footnotes:
(1) If you don't disable the DCU IP prefetcher, it will actually prefetch back all the lines into the L1 after flushing them, so all accesses will still hit in the L1.
(2) In rare cases, the execution of interrupt handlers or scheduling other threads on the same core may cause some of the lines to be evicted from the L1 and potentially other levels of the cache hierarchy.
(3) Remember that you need to subtract the overhead of the rdtscp instructions. Note that the measurement method you used actually doesn't enable you to reliably distinguish between an L1 hit and an L2 hit. See: Memory latency measurement with time stamp counter.
(4) The Intel manual doesn't seem to specify whether clflush is ordered with reads, but it appears to me that it is.
Related
In Jonesforth, a dictionary entry is laid out as follows:
<--- DICTIONARY ENTRY (HEADER) ----------------------->
+------------------------+--------+---------- - - - - +----------- - - - -
| LINK POINTER | LENGTH/| NAME | DEFINITION
| | FLAGS | |
+--- (4 bytes) ----------+- byte -+- n bytes - - - - +----------- - - - -
We can take a peek at one of these entries using GDB. (See this question for details on using GDB with Jonesforth.)
Let's display the first 16 bytes of the dictionary entry for SWAP as characters:
>>> x/16cb &name_SWAP
0x105cc: -68 '\274' 5 '\005' 1 '\001' 0 '\000' 4 '\004' 83 'S' 87 'W' 65 'A'
0x105d4: 80 'P' 0 '\000' 0 '\000' 0 '\000' 43 '+' 0 '\000' 1 '\001' 0 '\000'
You can kind of see what's going on here.
The first four bytes are the pointer to the previous word in the dictionary:
-68 '\274' 5 '\005' 1 '\001' 0 '\000'
Then comes the length of the name:
4 '\004'
Then we see the characters of the word name, "SWAP":
83 'S' 87 'W' 65 'A' 80 'P'
And finally some padding to align on a 32-bit boundary:
0 '\000' 0 '\000' 0 '\000'
It would be nice if there was a way to format the word entry in a nicer manner.
If we do the following:
>>> x/1xw &name_SWAP
0x105cc: 0x000105bc
we note that name_SWAP is at 0x105cc.
Let's use GDB's printf to display the word entry:
>>> printf "link: %#010x name length: %i name: %s\n", *(0x105cc), (char)*(0x105cc+4), (0x105cc+5)
link: 0x000105bc name length: 4 name: SWAP
OK, that's not bad! We see the link, the name length, and name, all nicely displayed and labeled.
The downside here is that I have to use the explicit address in the call to printf:
printf "link: %#010x name length: %i name: %s\n", *(0x105cc), (char)*(0x105cc+4), (0x105cc+5)
Ideally, I'd just be able to say something like:
show_forth_word name_SWAP
and it'd display the above.
What's the best way to go about this? Is this doable with a GDB user-defined command? Or is it something more appropriate for the GDB Python interface?
My question is, what's the best way to go about this?
It depends on whether GDB knows about the type of name_SWAP. If it does, the Python pretty-printer is the way to go.
If it doesn't, something as simple as user-defined command is likely easier. Assuming 32-bit mode:
define print_key
set var $v = (char*)$arg0
printf "link: %#010x name length: %i name: %s\n", *((char**)$v), *($v+4), ($v+5)
end
I just used nm , the good old Unix command.
I am having some inconsistencies when using hexdump and xxd. When I run the following command:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| xxd
it returns the following results:
00000000: c2a4 2dc2 9dc3 bec2 8fc2 9351 5d0d 5f60 ..-........Q]._`
00000010: c28a 5760 44c3 8e4c 61c3 a61e ..W`D..La...
Note the "c2" characters. This also happens with I run xxd -p
When I run the same command except with hexdump -C:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| hexdump -C
I get the same results (as far as including the "c2" character):
00000000 c2 a4 2d c2 9d c3 be c2 8f c2 93 51 5d 0d 5f 60 |..-........Q]._`|
00000010 c2 8a 57 60 44 c3 8e 4c 61 c3 a6 1e |..W`D..La...|
However, when I run hexdump with no arguments:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| hexdump
I get the following [correct] results:
0000000 a4c2 c22d c39d c2be c28f 5193 0d5d 605f
0000010 8ac2 6057 c344 4c8e c361 1ea6
For the purpose of this script, I'd rather use xxd as opposed to hexdump. Thoughts?
The problem that you observe is due to UTF-8 encoding and little-endiannes.
First, note that when you try to print any Unicode character in AWK, like 0xA4 (CURRENCY SIGN), it actually produces two bytes of output, like the two bytes 0xC2 0xA4 that you see in your output:
$ echo 1 | awk 'BEGIN { printf("%c", 0xA4) }' | hexdump -C
Output:
00000000 c2 a4 |..|
00000002
This holds for any character bigger than 0x7F and it is due to UTF-8 encoding, which is probably the one set in your locale. (Note: some AWK implementations will have different behavior for the above code.)
Secondly, when you use hexdump without argument -C, it displays each pair of bytes in swapped order due to little-endianness of your machine. This is because each pair of bytes is then treated as a single 16-bit word, instead of treating each byte separately, as done by xxd and hexdump -C commands. So the xxd output that you get is actually the correct byte-for-byte representation of input.
Thirdly, if you want to produce the precise byte string that is encoded in the hexadecimal string that you are feeding to sed, you can use this Python solution:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" | sed 's/\(..\)/0x\1,/g' | python3 -c "import sys;[open('tmp','wb').write(bytearray(eval('[' + line + ']'))) for line in sys.stdin]" && cat tmp | xxd
Output:
00000000: a42d 9dfe 8f93 515d 0d5f 608a 5760 44ce .-....Q]._`.W`D.
00000010: 4c61 e61e La..
Why not use xxd with -r and -p?
echo a42d9dfe8f93515d0d5f608a576044ce4c61e61e | xxd -r -p | xxd
output
0000000: a42d 9dfe 8f93 515d 0d5f 608a 5760 44ce .-....Q]._`.W`D.
0000010: 4c61 e61e La..
I'd like to generate a lot of integers between 0 and 1 using bash.
I tried shuf but the generation is very slow. Is there another way to generate numbers ?
This will output an infinite stream of bytes, written in binary and separated by a space :
cat /dev/urandom | xxd -b | cut -d" " -f 2-7 | tr "\n" " "
As an example :
10100010 10001101 10101110 11111000 10011001 01111011 11001010 00011010 11101001 01111101 10100111 00111011 10100110 01010110 11101110 01000011 00101011 10111000 01010110 10011101 01000011 00000010 10100001 11000110 11101100 11001011 10011100 10010001 01000111 01000010 01001011 11001101 11000111 11110111 00101011 00111011 10110000 01110101 01001111 01101000 01100000 11011101 11111111 11110001 10001011 11100001 11100110 10101100 11011001 11010100 10011010 00010001 00111001 01011010 00100101 00100100 00000101 10101010 00001011 10101101 11000001 10001111 10010111 01000111 11011000 01111011 10010110 00111100 11010000 11110000 11111011 00000110 00011011 11110110 00011011 11000111 11101100 11111001 10000110 11011101 01000000 00010000 00111111 11111011 01001101 10001001 00000010 10010000 00000001 10010101 11001011 00001101 00101110 01010101 11110101 10111011 01011100 00110111 10001001 00100100 01111001 01101101 10011011 00100001 01101101 01001111 01101000 00100001 10100011 00011000 01000001 00100100 10001101 10110110 11111000 01110111 10110111 11001000 00101000 01101000 01001100 10000001 11011000 11101110 11001010 10001101 00010011^C
If you don't want spaces between bytes (thanks #Chris):
cat /dev/urandom | xxd -b | head | cut -d" " -f 2-7 | tr -d "\n "
1000110001000101011111000010011011011111111001000000011000000100111101000001110110011011000000001101111111011000000100101001001110110001111000010100100100010110110000100111111110111011111100101000011000010010111010010001001001111000010101000110010010011011110000000011100110000000100111010001110000000011001011010101111001
tr -dc '01' < /dev/urandom is a quick and dirty way to do this.
If you're on OSX, tr can work a little weird, so you can use perl instead: perl -pe 'tr/01//dc' < /dev/urandom
Just for fun --
A native-bash function to print a specified number of random bits, extracted from the smallest possible number of evaluations of $RANDOM:
randbits() {
local x x_bits num_bits
num_bits=$1
while (( num_bits > 0 )); do
x=$RANDOM
x_bits="$(( x % 2 ))$(( x / 2 % 2 ))$(( x / 4 % 2 ))$(( x / 8 % 2 ))$(( x / 16 % 2 ))$(( x / 32 % 2 ))$(( x / 64 % 2 ))$(( x / 128 % 2 ))$(( x / 256 % 2 ))$(( x / 512 % 2 ))$(( x / 1024 % 2 ))$(( x / 2048 % 2 ))$(( x / 4096 % 2))$(( x / 8192 % 2 ))$(( x / 16384 % 2 ))"
if (( ${#x_bits} < $num_bits )); then
printf '%s' "$x_bits"
(( num_bits -= ${#x_bits} ))
else
printf '%s' "${x_bits:0:num_bits}"
break
fi
done
printf '\n'
}
Usage:
$ randbits 64
1011010001010011010110010110101010101010101011101100011101010010
Because this uses $RANDOM, its behavior can be made reproducible by assigning a seed value to $RANDOM before invoking it. This can be handy if you want to be able to reproduce bugs in software that uses "random" inputs.
Since the question asks for integers between 1 and 0, there is this extremely random and very fast method. A good one-liner for sure:
echo "0.$(printf $(date +'%N') | md5sum | tr -d '[:alpha:][:punct:]')"
This command will give you an output similar to this when thrown inside a for loop with 10 iterations:
0.97238535471032972041395
0.8642459339189067551494
0.18109959700829495487820
0.39135471514800072505703651
0.624084503017958530984255
0.41997456791539740171
0.689027289676627803
0.22698852059605560195614
0.037745437519184791498537
0.428629619193662260133
And if you need to print random strings of 1's and 0's, as others have assumed, you can make a slight change to the command like this:
printf $(date +'%N') | sha512sum | tr -d '[2-9][:alpha:][:punct:]'
Which will yield an output of random 0's and 1's similar to this when thrown into a for loop with 10 iterations:
011101001110
001110011011
0010100010111111
0000001101101001111011111111
1110101100
00010110100
1100101101110010
101100110101100
1100010100
0000111101100010001001
To my knowledge, and from what I have found online, this is the closest to true randomness we can get in bash. I have even made a game of dice (where the dice has 10 sides 0-9) to test the randomness, using this method for generating a single number from 0 to 9. Out of 100 dice throws, each side lands almost a perfect 10 times. Out of 1000 throws, each side hits around 890-1100 times. The variation of what side lands doesn't change much after 1000 throws. So you can be very sure that this method is highly ideal, at least for bash tools generating pseudo-random numbers, for the job.
And if you need just an absolute mind-blowingly ridiculous amount of randomness, the simple md5sum checksum command can be compounded upon itself many, many times and still be very fast. As an example:
printf $(date +'%N') | md5sum | md5sum | md5sum | tr -d '[:punct:][:space:]'
This will have a not-so-random number, obtained from printing the date command's nanosecond option, piped into md5sum. Then that md5 hash is piped into md5sum and then "that" hash is sent into md5sum for a last time. The output is a completely randomized hash that you can use tools like awk, sed, grep, and tr to control what you want printed.
Hope this helps.
What difference does it make to use NOP instead of stall.
Both happen to do the same task in case of pipelining. I cant understand
I think you've got your terminology confused.
A stall is injected into the pipeline by the processor to resolve data hazards (situations where the data required to process an instruction is not yet available. A NOP is just an instruction with no side-effect.
Stalls
Recall the 5 pipeline stage classic RISC pipeline:
IF - Instruction Fetch (Fetch the next instruction from memory)
ID - Instruction Decode (Figure out which instruction this is and what the operands are)
EX - Execute (Perform the action)
MEM - Memory Access (Store or read from memory)
WB - Write back (Write a result back to a register)
Consider the code snippet:
add $t0, $t1, $t1
sub $t2, $t0, $t0
From here it is obvious that the second instruction relies on the result of the first. This is a data hazard: Read After Write (RAW); a true dependency.
The sub requires the value of the add during its EX phase, but the add will only be in its MEM phase - the value will not be available until the WB phase:
+------------------------------+----+----+----+-----+----+---+---+---+---+
| | CPU Cycles |
+------------------------------+----+----+----+-----+----+---+---+---+---+
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+------------------------------------------------------------------------+
| 0 | add $t0, $t1, $t1 | IF | ID | EX | MEM | WB | | | | |
| 1 | sub $t2, $t0, $t0 | | IF | ID | EX | | | | | |
+---------+--------------------+----+----+----+-----+----+---+---+---+---+
One solution to this problem is for the processor to insert stalls or bubble the pipeline until the data is available.
+------------------------------+----+----+----+-----+----+----+-----+---+----+
| | CPU Cycles |
+------------------------------+----+----+----+-----+----+----+-----+----+---+
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+----------------------------------------------------------------------------+
| 0 | add $t0, $t1, $t1 | IF | ID | EX | MEM | WB | | | | |
| 1 | sub $t2, $t0, $t0 | | IF | ID | S | S | EX | MEM | WB | |
+----------+-------------------+----+----+----+-----+----+---+---+---+-------+
NOPs
A NOP is an instruction that does nothing (has no side-effect). MIPS assembler often support a nop instruction but in MIPS this is equivalent to sll $zero $zero 0.
This instruction will take up all 5 stages of pipeline. It is most commonly used to fill the branch delay slot of jumps or branches when there is nothing else useful that can be done in that slot.
j label
nop # nothing useful to put here
If you are using a MIPS simulator you may need to enable branch delay slot simulation to see this. (For example, in spim use the -delayed_branches argument)
We should not use NOP in place of the stall and vice-versa.
We will use the stall when there is a dependency causing hazard which results in the particular stage of the pipeline to wait until it gets the data required whereas by using NOP in case of stall it will just pass that stage of the instruction without doing anything. However, after the completion of the stage by using NOP the data required by the stage is available and we need to start the instruction from the beginning which will increase the average CPI of the processor results in performance reduction. Also, in some cases the data required by that instruction might be modified by another instruction before restarting the instruction which will result in faulty execution.
Also, in the same way if we use the stall in the place of the NOP.
whenever a non-mask-able interrupt occurs like (divide by zero) in execution stage we need to pass the stages after the exception without changing the state of the processor here we use NOP to pass the remaining stages of the pipeline without any changes to the processor state (like writing something into the register or the memory which is a false value generated to the exception).
Here, we cannot use stall because the next instruction will wait for the stall to be completed and the stall will not be completed as it is a non-mask-able interrupt (user cannot control these type of instructions) and the pipeline enters deadlock.
I have 2 files of which I currently manipulate each one in awk:
======================= File 1: ===================
0x0002 RUNNING EXISTS foo 253 65535
0x0003 RUNNING EXISTS foo 252 5
0x0004 RUNNING EXISTS foo 251 3
I'm interested in the first field and the last 2.
Field 1: vdisk(in hex). Last two fields are the possible Cdisks for each vdisk. At least 1 must exist. the values are decimal.
If the number "65535" appears, it means that the 2nd cdisk is non-existent.
I use this awk to display a user friendly table:
awk 'BEGIN {print "vdisk cdisk Mr_cdisk"}
{
if ( $3 ~ /EXISTS|THIS_AGENT_ONLINE/ ) {
sub("65535", "N/A")
printf "%-11s %-6s %s\n",$1,$(NF-1),$(NF)
}
}' ${FILE}
Will produce this table:
vdisk cdisk Mr_cdisk
0x0002 253 N/A
0x0003 252 5
0x0004 1 3
======================= File 2: ===================
0x0000 Cmp cli Foo 0 SOME 0 0x0 0x0 0x0
0x0001 Cmp own Foo 1 NONE 0 0x0 0x0 0x0
0x0002 Cmp cli Foo 0 SOME 0 0x0 0x1 0x0
0x0003 Cmp own Foo 0 NONE 0 0x0 0x0 0x1
0x0004 Cmp cli Foo 0 SOME 0 0x0 0x0 0x0
0x0005 Cmp own Foo 1 NONE 0 0x1 0x0 0x0
I'm interested in the "Cmp own" lines, in which the first field is the Cdisk (in hex). The 5th field from the end (just before the SOME/NONE text), is the instance number. It's either 0 or 1.
I use this awk to display a user friendly table:
awk 'BEGIN {print "cdisk(hex) RACE_Instance"}
/Cmp own/ {
printf "%-11s %-10s\n",$1,$(NF-5)
}' ${FILE};
This will produce the following table:
cdisk(hex) Instance
0x0001 1
0x0003 0
0x0005 1
++++++++++++++++++++++++++++++++++++++
What would I like to display a merged table. Preferably, directly from the original files.
It should spread the first data into 2 lines (if there's more than 1 cdisk). This will be the base for the merge. Then print the Instance number, if exist per this cdisk.
vdisk(hex) cdisk(hex) Instance
0x0002 0x00fd N/A
0x0003 0x00fc N/A
0x0003 0x0005 1
0x0004 0x0001 0
0x0004 0x0003 1
I would definitely prefer a solution with awk. :)
Thanks!
EDIT: added some more info and correction to one data table.
EDIT2: Simplified input
I couldn't figure out what the mapping is from your 2 input files to your output but this should point you in the right direction:
$ cat tst.awk
NR==FNR {
v2c[$1] = sprintf("0x%04x",$5)
v2m[$1] = ( $6==65535 ? "N/A" : sprintf("0x%04x",$6) )
next
}
$1 in v2c {
print $1, v2c[$1], $5
print $1, v2m[$1], $5
}
$
$ awk -f tst.awk file1 file2
0x0002 0x00fd 0
0x0002 N/A 0
0x0003 0x00fc 0
0x0003 0x0005 0
0x0004 0x00fb 0
0x0004 0x0003 0