I am trying to understand how GDB works and how memory is being allocated. When I run the following command, it is suppose to write 72 A's into memory, but when I counted in memory, it only writes 68 A's. Then there's 4 bytes of some random memory before it writes memory of B. When I counted the A's in the print statement, it shows 72 A's.
0xbffff080: 0x14 0x84 0x04 0x08 0x41 0x41 0x41 0x41
0xbffff088: 0x42 0x42 0x42 0x42 0x42 0x42 0x42 0x42
Full command below.
(gdb) run $( python -c "print('A'*72+'BBBB')" )
Starting program: /home/ubuntu/Desktop/test $( python -c "print('A'*72+'BBBB')" )
Breakpoint 2, 0x08048473 in getName (
name=0xbffff32c 'A' <repeats 72 times>, "BBBB") at sample1.c:7
7 printf("Your name is: %s \n", myName);
(gdb) c
Continuing.
Your name is: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBB
Program received signal SIGSEGV, Segmentation fault.
0xbffff32c in ?? ()
(gdb) x/150xb $sp-140
0xbffff038: 0x50 0xf0 0xff 0xbf 0x54 0x82 0x04 0x08
0xbffff040: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff048: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff050: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff058: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff060: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff068: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff070: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff078: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff080: 0x14 0x84 0x04 0x08 0x41 0x41 0x41 0x41
0xbffff088: 0x42 0x42 0x42 0x42 0x42 0x42 0x42 0x42
0xbffff090: 0x2c 0xf3 0xff 0xbf 0x00 0xf0 0xff 0xb7
When I did further testing, and add an additional 4 bytes (4 C's), it shows it properly in memory as well as in the print statement.
(gdb) run $( python -c "print('A'*72+'BBBB'+'CCCC')" )
Starting program: /home/ubuntu/Desktop/test $( python -c "print('A'*72+'BBBB'+'CCCC')" )
Breakpoint 2, 0x08048473 in getName (name=0xbffff300 "") at sample1.c:7
7 printf("Your name is: %s \n", myName);
(gdb) c
Continuing.
Your name is: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBCCCC
Program received signal SIGSEGV, Segmentation fault.
0x43434343 in ?? ()
(gdb) x/150xb $sp-140
0xbffff02c: 0x54 0x82 0x04 0x08 0x41 0x41 0x41 0x41
0xbffff034: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff03c: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff044: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff04c: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff054: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff05c: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff064: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff06c: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0xbffff074: 0x41 0x41 0x41 0x41 0x42 0x42 0x42 0x42
0xbffff07c: 0x43 0x43 0x43 0x43 0x00 0xf3 0xff 0xbf
0xbffff084: 0x00 0xf0 0xff 0xb7 0xab 0x84
Here is the code:
#include <stdio.h>
#include <string.h>
void getName (char* name) {
char myName[64];
strcpy(myName, name);
printf("Your name is: %s \n", myName);
}
int main (int argc, char* argv[]) {
getName(argv[1]);
return 0;
}
A disassembly of getName which shows that 88 bytes were added to the buffer:
Reading symbols from test...done.
(gdb) disas getName
Dump of assembler code for function getName:
0x0804844d <+0>: push %ebp
0x0804844e <+1>: mov %esp,%ebp
0x08048450 <+3>: sub $0x58,%esp
0x08048453 <+6>: mov 0x8(%ebp),%eax
0x08048456 <+9>: mov %eax,0x4(%esp)
0x0804845a <+13>: lea -0x48(%ebp),%eax
0x0804845d <+16>: mov %eax,(%esp)
0x08048460 <+19>: call 0x8048320 <strcpy#plt>
0x08048465 <+24>: lea -0x48(%ebp),%eax
0x08048468 <+27>: mov %eax,0x4(%esp)
0x0804846c <+31>: movl $0x8048530,(%esp)
0x08048473 <+38>: call 0x8048310 <printf#plt>
0x08048478 <+43>: leave
0x08048479 <+44>: ret
End of assembler dump.
Unoptimized code may see extra padding on the stack because of inefficiencies, but most often padding is a result of the compiler trying to align data on the stack. GCC generally tries to allocate arrays on addresses evenly divisible by 16.
After EBP is pushed 0x58 bytes (88 bytes) are allocated. We can see that the buffer starts at EBP-0x48 because of this instruction:
lea -0x48(%ebp),%eax
The address EBP-0x48 is then used to set the parameters on the stack for both the call to strcpy and printf. 0x48 = 72 bytes, despite the buffer being 64 bytes. There are an additional 8 bytes of padding. Why the padding there? Because the compiler has tried to ensure that the beginning of the myName buffer is on a 16 byte boundary.
GCC can keep track of what is on the stack, but an important piece of information about alignment is derived from the calling convention (64-bit System V ABI) that says upon a call to a function (in this case getName) the stack must be 16 byte aligned. The call instruction pushes 4 bytes for a return address and then EBP is pushed for an additional 4. The compiler knows after the PUSH EBP it is misaligned by 8 bytes. 64 + 8 bytes of padding + 4 for EBP + 4 return address = 80. 80 is evenly divisible by 16 (16*5=80). The use of 8 bytes wasn't arbitrary.
In the GDB output you can see the myName array starts on a hexadecimal address ending in 0. Any hexadecimal address that ends in 0 is evenly divisible by 16 and you can see the buffer starts at 0xbffff040:
0xbffff038: 0x50 0xf0 0xff 0xbf 0x54 0x82 0x04 0x08
0xbffff040: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
With all that being said if you are looking to overwrite the return address it will be at an offset from the beginning of myName that is equal to 64 (array size) + 8 (padding) + 4(EBP on stack) = 76 bytes. You will have to write 76 bytes of data before reaching the point where you can replace the return address.
Note: You may wonder why the myname array has an additional 16 bytes beneath it on the stack (88-72=16 bytes). That space is where the compiler places values for the function calls like strcpy and printf and ensure that the function calls that are made have a stack 16 byte aligned to conform to the 64-bit System V ABI.
Reason for Unusual Data in Middle of myName
I confirmed the following observations by reproducing exactly what you saw on my own Ubuntu 14.04 system.
You were also wondering about the fact that when you inserted 72 A's and 4 B's that you had 4 unexpected bytes in the buffer:
0xbffff080:[0x14 0x84 0x04 0x08] 0x41 0x41 0x41 0x41
0xbffff088: 0x42 0x42 0x42 0x42 0x42 0x42 0x42 0x42
I've marked the 4 bytes with []. You are right that you might expect those 4 bytes to be 0x41 (The letter A) like the rest. What has happened is that although the input you gave on the command line was 76 characters (72+4) strcpy appended a NUL(\0) on the end as a 77th character. This overwrote the lower byte of the return address with 0! You used the c command to continue running after the breakpoint. The debugger terminated when it hit a segmentation fault. What happened was the RET instruction didn't return back to where you expected in main, it returned to a slightly lower location in memory because of the NUL byte being written into the return address. It just so happened that what you didn't see was all the instructions that executed after the RET that placed data back onto the stack. That included writing 32-bits of data into what was once your myName array.
When you wrote 72 A's, 4 B's, and 4 C's you ended up overwriting the return address with CCCC and you got a segmentation fault when the RET tried to start executing code at 0x43434343 as seen here:
0x43434343 in ?? ()
0x43434343 wasn't a valid address where you had execute permissions so it faulted. Because the RET failed to execute any more code the program didn't have a chance to overwrite the myName array. This explains why the buffer wasn't overwritten like the previous test.
Related
I am trying to implement a minimal hypervisor on ARMv8A (Cortext A53 on QEMU Version 6.2.0).I have written a minimal hypervisor code in EL2 and the Linux boots successfully in EL1. Now I want to enable stage-2 MMU. I have written basic page tables in stage2 (Only the necessary page table entries to map to 1GB RAM). If I disable PCI in DTB the kernel boots successfully.The QEMU command line is given below.
qemu-system-aarch64 -machine virt,gic-version=2,virtualization=on -cpu cortex-a53 -nographic -smp 1 -m 4096 -kernel hypvisor/bin/hypervisor.elf -device loader,file=linux-5.10.155/arch/arm64/boot/Image,addr=0x80200000 -device loader,file=1gb_1core.dtb,addr=0x88000000
When the PCI is enabled in DTB, I am getting a kernel panic as shown below.
[ 0.646801] pci_bus 0000:00: root bus resource [mem 0x8000000000-0xffffffffff]
[ 0.647909] Unable to handle kernel paging request at virtual address 0000000093810004
[ 0.648109] Mem abort info:
[ 0.648183] ESR = 0x96000004
[ 0.648282] EC = 0x25: DABT (current EL), IL = 32 bits
[ 0.648403] SET = 0, FnV = 0
[ 0.648484] EA = 0, S1PTW = 0
[ 0.648568] Data abort info:
[ 0.648647] ISV = 0, ISS = 0x00000004
[ 0.648743] CM = 0, WnR = 0
[ 0.648885] [0000000093810004] user address but active_mm is swapper
[ 0.653399] Call trace:
[ 0.653598] pci_generic_config_read+0x38/0xe0
[ 0.653729] pci_bus_read_config_dword+0x80/0xe0
[ 0.653845] pci_bus_generic_read_dev_vendor_id+0x34/0x1b0
[ 0.653974] pci_bus_read_dev_vendor_id+0x4c/0x70
[ 0.654090] pci_scan_single_device+0x80/0x100
I set a GDB breakpoint in 'pci_generic_config_read' and observed that the faulting instruction is
>0xffff80001055d5c8 <pci_generic_config_read+56> ldr w1, [x0]
The value of register X0 is given below
(gdb) p /x $x0
$4 = 0xffff800020000000
The hardware (host) is configured to have 4GB in total and the Linux (guest) is supplied 1GB through command line and DTB. This is a single core system with 'kaslr' disabled.
Excerpt from the DTB containing PCI part is given below.
pcie#10000000 {
interrupt-map-mask = <0x1800 0x00 0x00 0x07>;
interrupt-map = <0x00 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x03 0x04 0x00 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x04 0x04 0x00 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x05 0x04 0x00 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x06 0x04 0x800 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x04 0x04 0x800 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x05 0x04 0x800 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x06 0x04 0x800 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x03 0x04 0x1000 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x05 0x04 0x1000 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x06 0x04 0x1000 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x03 0x04 0x1000 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x04 0x04 0x1800 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x06 0x04 0x1800 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x03 0x04 0x1800 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x04 0x04 0x1800 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x05 0x04>;
#interrupt-cells = <0x01>;
ranges = <0x1000000 0x00 0x00 0x00 0x3eff0000 0x00 0x10000 0x2000000 0x00 0x10000000 0x00 0x10000000 0x00 0x2eff0000 0x3000000 0x80 0x00 0x80 0x00 0x80 0x00>;
reg = <0x40 0x10000000 0x00 0x10000000>;
msi-parent = <0x8002>;
dma-coherent;
bus-range = <0x00 0xff>;
linux,pci-domain = <0x00>;
#size-cells = <0x02>;
#address-cells = <0x03>;
device_type = "pci";
compatible = "pci-host-ecam-generic";
};
If my interpretation of DTB is right, the PCI device is mapped to the address range '0x40_1000_0000' (offset) '0x1000_0000' (size 256MB). that is, it starts from 100GB in the physical address space.
I have written a page table entry mapping to this physical address as well (as a device memory).
Is it right for the PCI to map to such a higher address in the physical address space? Any hints on debugging this issue is greatly appreciated.
Yes, for a 64-bit CPU this is the expected place to find the PCI controller ECAM region. The virt board puts some "large" device memory regions beyond the 4GB mark (specifically, PCIE ECAM, a seconD PCIE MMIO window, and redistributors for CPUs above 123). (You can turn this off with -machine highmem=off if you like, though that will limit the amount of RAM you can give the VM to 3GB.)
Depending on what your hypervisor is doing, you might or might not want it to be talking directly to the host PCI controller anyway.
I am trying to figure out this question. I posted my code below. It doesn't work properly. it seems like its not multiplying the 2 least significant nibbles. I don't know AVR very well.
Write AVR that generates a multiplication table for SRAM addresses 0x0100 to 0x01FF. The value at each address is the product of the two least significant nibbles of the address. For example, at address 0x0123, the multiplicand is 3 and the multiplier is 2. calculate the product (6 in this case) and store it at address 0x0123. The answer should be about 10-12 lines of code with a loop
.include "C:\VMLAB\include\m168def.inc"
ldi r27, 0x01
ldi r26, 0x00
ldi r30, 0xff
main:
mov r16, r26
andi r16,0x0f
mov r17,r27
andi r17,0xf0
swap r17
mul r17, r16
st x+, r16
dec r30
brne main
According to the manual, the MUL instruction stores its result in R0 and R1. So you need to read R0 after MUL, not R16.
You have three errors in your code:
mul stores results to R1:R0 pair registers
if use X as the index register and content mark symbolically as ABCD then CD is located in XL register, not in XH (r27). AB is constant (01)
you loops only 255 times not 256
ldi xh, 0x01
ldi xl, 0x00
ldi r30, 0x00
main:
mov r16, xl ;put D to R16
andi r16,0x0f
mov r17, xl ;put C to R17
andi r17,0xf0
swap r17
mul r17, r16 ;multiply C x D
st x+, r0 ;store result low byte to memory
dec r30 ;repeat 256 times
brne main
macOS 10.14.3, Xcode 9.4.1:
I am trying to use podofo in my project. I get the following error:
ld: warning: ignoring file /usr/local/lib/libcrypto.dylib, file was built for unsupported file format ( 0x62 0x6F 0x6F 0x6B 0x00 0x00 0x00 0x00 0x6D 0x61 0x72 0x6B 0x00 0x00 0x00 0x00 ) which is not the architecture being linked (x86_64): /usr/local/lib/libcrypto.dylib
ld: warning: ignoring file /usr/local/lib/libcrypto.dylib, file was built for unsupported file format ( 0x62 0x6F 0x6F 0x6B 0x00 0x00 0x00 0x00 0x6D 0x61 0x72 0x6B 0x00 0x00 0x00 0x00 ) which is not the architecture being linked (x86_64): /usr/local/lib/libcrypto.dylib
Undefined symbols for architecture x86_64:
"_EVP_CIPHER_CTX_free", referenced from:
PoDoFo::RC4CryptoEngine::~RC4CryptoEngine() in libpodofo.a(PdfEncrypt.o)
PoDoFo::AESCryptoEngine::~AESCryptoEngine() in libpodofo.a(PdfEncrypt.o)
"_EVP_CIPHER_CTX_new", referenced from:
PoDoFo::RC4CryptoEngine::RC4CryptoEngine() in libpodofo.a(PdfEncrypt.o)
PoDoFo::AESCryptoEngine::AESCryptoEngine() in libpodofo.a(PdfEncrypt.o)
What am I doing wrong? What does that mean?
Thank you so much for your time.
0x62 0x6F 0x6F 0x6B 0x00 0x00 0x00 0x00 0x6D 0x61 0x72 0x6B 0x00 0x00 0x00 0x00
To Ascii is BOOK MARK .
Is /usr/local/lib/libcrypto.dylib here actually the library you want to link in the program?
I'm currently in the process of writing a Gameboy emulator, and I've noticed something that seems strange to me.
My emulator is hitting a jump instruction 0xCD, for example CD B6 FF, but my understanding was that a jump should only be jumping to an address within cartridge ROM (0x7FFF maximum), because I'm assuming the CPU can only execute instructions from ROM, not RAM. The ROM in question is Dr. Mario, which I'd expect to only be carrying out valid operations. 0xFFB6 is in high RAM, which seems odd to me.
Am I correct in my thinking? If I am, presumably that means my program counter is somehow ending up at the wrong address and that the CB is actually part of another instruction's data, and not an instruction itself?
I'd be grateful for some clarification, thanks.
For reference, I've been using Gameboy Opcodes and CPU docs to implement the instructions. I know they contain a few errors, and I think I've accounted for them (for example, 0xE2 being listed as a two-byte instruction, when it's only one)
Just checked Dr. Mario 1.1, it copies the VBlank int routine at hFFB6 at startup, then when VBlank happens, the routine at 0:01A6 is called, which calls the OAM DMA transfer routine.
During OAM DMA transfer, the CPU can only access HRAM, so writing a short routine in HRAM that will wait for the transfer to be completed is required. The OAM DMA transfer takes 160 µs, so you usually make a loop that will wait this amount of time after specifying the OAM transfer source.
This is the part of the initialization routine run at startup that copies the DMA transfer routine to HRAM:
...
ROM0:027E 0E B6 ld c,B6 ;destination hFFB6
ROM0:0280 06 0A ld b,0A ;length 0xA
ROM0:0282 21 86 23 ld hl,2386 ;source 0:2386
ROM0:0285 2A ldi a,(hl) ;copy OAM DMA transfer routine from source
ROM0:0286 E2 ld (ff00+c),a ;paste to destination
ROM0:0287 0C inc c ;destination++
ROM0:0288 05 dec b ;length--
ROM0:0289 20 FA jr nz,0285 ;loop until DMA transfer routine is copied
...
When VBlank happens, it jumps to the routine at 0:01A6:
ROM0:0040 C3 A6 01 jp 01A6
Which contains a call to our OAM DMA transfer routine, waiting for DMA to be completed:
ROM0:01A6 F5 push af
ROM0:01A7 C5 push bc
ROM0:01A8 D5 push de
ROM0:01A9 E5 push hl
ROM0:01AA F0 B1 ld a,(ff00+B1)
ROM0:01AC A7 and a
ROM0:01AD 28 0B jr z,01BA
ROM0:01AF FA F1 C4 ld a,(C4F1)
ROM0:01B2 A7 and a
ROM0:01B3 28 05 jr z,01BA
ROM0:01B5 F0 EF ld a,(ff00+EF)
ROM0:01B7 A7 and a
ROM0:01B8 20 09 jr nz,01C3
ROM0:01BA F0 E1 ld a,(ff00+E1)
ROM0:01BC FE 03 cp a,03
ROM0:01BE 28 03 jr z,01C3
ROM0:01C0 CD B6 FF call FFB6 ;OAM DMA transfer routine is in HRAM
...
OAM DMA transfer routine:
HRAM:FFB6 3E C0 ld a,C0
HRAM:FFB8 E0 46 ld (ff00+46),a ;source is wC000
HRAM:FFBA 3E 28 ld a,28 ;loop start
HRAM:FFBC 3D dec a
HRAM:FFBD 20 FD jr nz,FFBC ;wait for the OAM DMA to be completed
HRAM:FFBF C9 ret ;ret to 0:01C3
Here is my analysis:
Looking for CD B6 FF in the raw ROM I can only find it in one place of the memory which is 0x01C0 (448 in decimal).
So I decided to disassemble the ROM, to see if it is a valid instruction.
I used gb-disasm to disassemble the ROM. Here are the values from 0x150 (ROM start) to address 0x201.
[0x00000100] 0x00 NOP
[0x00000101] 0xC3 0x50 0x01 JP $0150
[0x00000150] 0xC3 0xE8 0x01 JP $01E8
[0x00000153] 0x01 0x0E 0xD0 LD BC,$D00E
[0x00000156] 0x0A LD A,[BC]
[0x00000157] 0xA7 AND A
[0x00000158] 0x20 0x0D JR NZ,$0D ; 0x167
[0x0000015A] 0xF0 0xCF LDH A,[$CF] ; HIMEM
[0x0000015C] 0xFE 0xFE CP $FE
[0x0000015E] 0x20 0x04 JR NZ,$04 ; 0x164
[0x00000160] 0x3E 0x01 LD A,$01
[0x00000162] 0x18 0x01 JR $01 ; 0x165
[0x00000164] 0xAF XOR A
[0x00000165] 0x02 LD [BC],A
[0x00000166] 0xC9 RET
[0x00000167] 0xFA 0x46 0xD0 LD A,[$D046]
[0x0000016A] 0xE0 0x01 LDH [$01],A ; SB
[0x0000016C] 0x18 0xF6 JR $F6 ; 0x164
[0x000001E8] 0xAF XOR A
[0x000001E9] 0x21 0xFF 0xDF LD HL,$DFFF
[0x000001EC] 0x0E 0x10 LD C,$10
[0x000001EE] 0x06 0x00 LD B,$00
[0x000001F0] 0x32 LD [HLD],A
[0x000001F1] 0x05 DEC B
[0x000001F2] 0x20 0xFC JR NZ,$FC ; 0x1F0
[0x000001F4] 0x0D DEC C
[0x000001F5] 0x20 0xF9 JR NZ,$F9 ; 0x1F0
[0x000001F7] 0x3E 0x0D LD A,$0D
[0x000001F9] 0xF3 DI
[0x000001FA] 0xE0 0x0F LDH [$0F],A ; IF
[0x000001FC] 0xE0 0xFF LDH [$FF],A ; IE
[0x000001FE] 0xAF XOR A
[0x000001FF] 0xE0 0x42 LDH [$42],A ; SCY
[0x00000201] 0xE0 0x43 LDH [$43],A ; SCX
The way we have to disassemble a ROM is by following the flow of instructions. For example, we know that the main program starts at position 0x150. So we should start disassembling there. Then we follow instruction by instruction until we hit any JUMP instruction (JP, JR, CALL, RET, etc). From that moment on the flow of the program is forked in two and we should follow both paths to disassemble.
The think to understand here is that if I show you a random memory position in a ROM, you cannot tell me if it is data or instructions. The only way to find out is by following the program flow. We need to define blocks of code that start in a jump destination and end in another jump instruction.
gb-disasm skips any memory position that is not inside a code block. 0x16C marks the end of a block.
[0x0000016C] 0x18 0xF6 JR $F6 ; 0x164
The next block starts on 0x1E8. We know that because it is the destination address of a jump located on 0x150.
[0x00000150] 0xC3 0xE8 0x01 JP $01E8
Memory block from 0x16E until 0x1E8 is not consider a code block. That's why you don't see the memory position 0x01C0 listed as an instruction.
So there you are, it is very likely that you are interpreting the instructions in a wrong way. If you want to be 100% sure, you can disassemble the whole room and check if any instruction points to 0x16E-0x1E8 and reads it as raw data, such as a tile or something.
Please leave a comment if you agree with the analysis.
I have two 128 bit numbers in memory in hexadecimal, for example (little endian):
x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
I've to perform the unsigned multiplication between these two numbers so my new number will be:
z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Now, I'm aware that I can move the half x and y number into rax and rbx registers and, for example, do the mul operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and <-> shl,1).
Can you give me some input about this problem?
I think the best solution is to take one byte par time.
Let μ = 264, then we can decompose your 128 bit numbers a and b into a = a1μ + a2 and b = b1μ + b2. Then we can compute c = ab with 64 · 64 → 128 bit multiplications by first computing partial products:
q1μ + q2 = a2b2
r1μ + r2 = a1b2
s1μ + s2 = a2b1
t1μ + t2 = a1b1
and then accumulating them into a 256 bit result (watch the overflow when doing the additions!):
c = t1μ3 + (t2 + s1 + r1) μ2 + (s2 + r2 + q1) μ + q2
As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports __int128_t and __uint128_t.
__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }
compiles to (gcc6.2 -O3 on Godbolt)
imul rsi, rdx # a_hi * b_lo
mov rax, rdi
imul rcx, rdi # b_hi * a_lo
mul rdx # a_lo * b_lo widening multiply
add rcx, rsi # add the cross products ...
add rdx, rcx # ... into the high 64 bits.
ret
Since this is targeting the x86-64 System V calling convention, a is in RSI:RDI, while b is in RCX:RDX. The result is returned in RDX:RAX.
Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.
With -march=haswell to enable BMI2, gcc uses MULX to avoid even the one MOV.
Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.
Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).
Is there a 128 bit integer in gcc? for GNU C unsigned __int128
https://learn.microsoft.com/en-us/cpp/intrinsics/umul128?view=msvc-170 MSVC's _umul128 that does 64x64 => 128-bit multiply (on 64-bit CPUs only). Takes args as 64-bit halves, returns two halves.
Getting the high part of 64 bit integer multiplication - Including with MSVC intrinsics, but still only for 64-bit CPUs.
An efficient way to do basic 128 bit integer calculations in C++?