How can I multiply two hex 128 bit numbers in assembly - algorithm

I have two 128 bit numbers in memory in hexadecimal, for example (little endian):
x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
I've to perform the unsigned multiplication between these two numbers so my new number will be:
z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Now, I'm aware that I can move the half x and y number into rax and rbx registers and, for example, do the mul operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and <-> shl,1).
Can you give me some input about this problem?
I think the best solution is to take one byte par time.

Let μ = 264, then we can decompose your 128 bit numbers a and b into a = a1μ + a2 and b = b1μ + b2. Then we can compute c = ab with 64 · 64 → 128 bit multiplications by first computing partial products:
q1μ + q2 = a2b2
r1μ + r2 = a1b2
s1μ + s2 = a2b1
t1μ + t2 = a1b1
and then accumulating them into a 256 bit result (watch the overflow when doing the additions!):
c = t1μ3 + (t2 + s1 + r1) μ2 + (s2 + r2 + q1) μ + q2

As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports __int128_t and __uint128_t.
__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }
compiles to (gcc6.2 -O3 on Godbolt)
imul rsi, rdx # a_hi * b_lo
mov rax, rdi
imul rcx, rdi # b_hi * a_lo
mul rdx # a_lo * b_lo widening multiply
add rcx, rsi # add the cross products ...
add rdx, rcx # ... into the high 64 bits.
ret
Since this is targeting the x86-64 System V calling convention, a is in RSI:RDI, while b is in RCX:RDX. The result is returned in RDX:RAX.
Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.
With -march=haswell to enable BMI2, gcc uses MULX to avoid even the one MOV.
Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.
Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).
Is there a 128 bit integer in gcc? for GNU C unsigned __int128
https://learn.microsoft.com/en-us/cpp/intrinsics/umul128?view=msvc-170 MSVC's _umul128 that does 64x64 => 128-bit multiply (on 64-bit CPUs only). Takes args as 64-bit halves, returns two halves.
Getting the high part of 64 bit integer multiplication - Including with MSVC intrinsics, but still only for 64-bit CPUs.
An efficient way to do basic 128 bit integer calculations in C++?

Related

ARMv8A hypervisor - PCI MMU fault

I am trying to implement a minimal hypervisor on ARMv8A (Cortext A53 on QEMU Version 6.2.0).I have written a minimal hypervisor code in EL2 and the Linux boots successfully in EL1. Now I want to enable stage-2 MMU. I have written basic page tables in stage2 (Only the necessary page table entries to map to 1GB RAM). If I disable PCI in DTB the kernel boots successfully.The QEMU command line is given below.
qemu-system-aarch64 -machine virt,gic-version=2,virtualization=on -cpu cortex-a53 -nographic -smp 1 -m 4096 -kernel hypvisor/bin/hypervisor.elf -device loader,file=linux-5.10.155/arch/arm64/boot/Image,addr=0x80200000 -device loader,file=1gb_1core.dtb,addr=0x88000000
When the PCI is enabled in DTB, I am getting a kernel panic as shown below.
[ 0.646801] pci_bus 0000:00: root bus resource [mem 0x8000000000-0xffffffffff]
[ 0.647909] Unable to handle kernel paging request at virtual address 0000000093810004
[ 0.648109] Mem abort info:
[ 0.648183] ESR = 0x96000004
[ 0.648282] EC = 0x25: DABT (current EL), IL = 32 bits
[ 0.648403] SET = 0, FnV = 0
[ 0.648484] EA = 0, S1PTW = 0
[ 0.648568] Data abort info:
[ 0.648647] ISV = 0, ISS = 0x00000004
[ 0.648743] CM = 0, WnR = 0
[ 0.648885] [0000000093810004] user address but active_mm is swapper
[ 0.653399] Call trace:
[ 0.653598] pci_generic_config_read+0x38/0xe0
[ 0.653729] pci_bus_read_config_dword+0x80/0xe0
[ 0.653845] pci_bus_generic_read_dev_vendor_id+0x34/0x1b0
[ 0.653974] pci_bus_read_dev_vendor_id+0x4c/0x70
[ 0.654090] pci_scan_single_device+0x80/0x100
I set a GDB breakpoint in 'pci_generic_config_read' and observed that the faulting instruction is
>0xffff80001055d5c8 <pci_generic_config_read+56> ldr w1, [x0]
The value of register X0 is given below
(gdb) p /x $x0
$4 = 0xffff800020000000
The hardware (host) is configured to have 4GB in total and the Linux (guest) is supplied 1GB through command line and DTB. This is a single core system with 'kaslr' disabled.
Excerpt from the DTB containing PCI part is given below.
pcie#10000000 {
interrupt-map-mask = <0x1800 0x00 0x00 0x07>;
interrupt-map = <0x00 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x03 0x04 0x00 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x04 0x04 0x00 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x05 0x04 0x00 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x06 0x04 0x800 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x04 0x04 0x800 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x05 0x04 0x800 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x06 0x04 0x800 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x03 0x04 0x1000 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x05 0x04 0x1000 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x06 0x04 0x1000 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x03 0x04 0x1000 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x04 0x04 0x1800 0x00 0x00 0x01 0x8001 0x00 0x00 0x00 0x06 0x04 0x1800 0x00 0x00 0x02 0x8001 0x00 0x00 0x00 0x03 0x04 0x1800 0x00 0x00 0x03 0x8001 0x00 0x00 0x00 0x04 0x04 0x1800 0x00 0x00 0x04 0x8001 0x00 0x00 0x00 0x05 0x04>;
#interrupt-cells = <0x01>;
ranges = <0x1000000 0x00 0x00 0x00 0x3eff0000 0x00 0x10000 0x2000000 0x00 0x10000000 0x00 0x10000000 0x00 0x2eff0000 0x3000000 0x80 0x00 0x80 0x00 0x80 0x00>;
reg = <0x40 0x10000000 0x00 0x10000000>;
msi-parent = <0x8002>;
dma-coherent;
bus-range = <0x00 0xff>;
linux,pci-domain = <0x00>;
#size-cells = <0x02>;
#address-cells = <0x03>;
device_type = "pci";
compatible = "pci-host-ecam-generic";
};
If my interpretation of DTB is right, the PCI device is mapped to the address range '0x40_1000_0000' (offset) '0x1000_0000' (size 256MB). that is, it starts from 100GB in the physical address space.
I have written a page table entry mapping to this physical address as well (as a device memory).
Is it right for the PCI to map to such a higher address in the physical address space? Any hints on debugging this issue is greatly appreciated.
Yes, for a 64-bit CPU this is the expected place to find the PCI controller ECAM region. The virt board puts some "large" device memory regions beyond the 4GB mark (specifically, PCIE ECAM, a seconD PCIE MMIO window, and redistributors for CPUs above 123). (You can turn this off with -machine highmem=off if you like, though that will limit the amount of RAM you can give the VM to 3GB.)
Depending on what your hypervisor is doing, you might or might not want it to be talking directly to the host PCI controller anyway.

Multiplication table in VMLab / AVR

I am trying to figure out this question. I posted my code below. It doesn't work properly. it seems like its not multiplying the 2 least significant nibbles. I don't know AVR very well.
Write AVR that generates a multiplication table for SRAM addresses 0x0100 to 0x01FF. The value at each address is the product of the two least significant nibbles of the address. For example, at address 0x0123, the multiplicand is 3 and the multiplier is 2. calculate the product (6 in this case) and store it at address 0x0123. The answer should be about 10-12 lines of code with a loop
.include "C:\VMLAB\include\m168def.inc"
ldi r27, 0x01
ldi r26, 0x00
ldi r30, 0xff
main:
mov r16, r26
andi r16,0x0f
mov r17,r27
andi r17,0xf0
swap r17
mul r17, r16
st x+, r16
dec r30
brne main
According to the manual, the MUL instruction stores its result in R0 and R1. So you need to read R0 after MUL, not R16.
You have three errors in your code:
mul stores results to R1:R0 pair registers
if use X as the index register and content mark symbolically as ABCD then CD is located in XL register, not in XH (r27). AB is constant (01)
you loops only 255 times not 256
ldi xh, 0x01
ldi xl, 0x00
ldi r30, 0x00
main:
mov r16, xl ;put D to R16
andi r16,0x0f
mov r17, xl ;put C to R17
andi r17,0xf0
swap r17
mul r17, r16 ;multiply C x D
st x+, r0 ;store result low byte to memory
dec r30 ;repeat 256 times
brne main

What do the constraints "Rah" and "Ral" mean in extended inline assembly?

This question is inspired by a question asked by someone on another forum. In the following code what does the extended inline assembly constraint Rah and Ral mean. I haven't seen these before:
#include<stdint.h>
void tty_write_char(uint8_t inchar, uint8_t page_num, uint8_t fg_color)
{
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
}
void tty_write_string(const char *string, uint8_t page_num, uint8_t fg_color)
{
while (*string)
tty_write_char(*string++, page_num, fg_color);
}
/* Use the BIOS to print the first command line argument to the console */
int main(int argc, char *argv[])
{
if (argc > 1)
tty_write_string(argv[1], 0, 0);
return 0;
}
In particular are the use of Rah and Ral as constraints in this code:
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
The GCC Documentation doesn't have an l or h constraint for either simple constraints or x86/x86 machine constraints. R is any legacy register and a is the AX/EAX/RAX register.
What am I not understanding?
What you are looking at is code that is intended to be run in real mode on an x86 based PC with a BIOS. Int 0x10 is a BIOS service that has the ability to write to the console. In particular Int 0x10/AH=0x0e is to write a single character to the TTY (terminal).
That in itself doesn't explain what the constraints mean. To understand the constraints Rah and Ral you have to understand that this code isn't being compiled by a standard version of GCC/CLANG. It is being compiled by a GCC port called ia16-gcc. It is a special port that targets 8086/80186 and 80286 and compatible processors. It doesn't generate 386 instructions or use 32-bit registers in code generation. This experimental version of GCC is to target 16-bit environments like DOS (FreeDOS, MSDOS), and ELKS.
The documentation for ia16-gcc is hard to find online in HTML format but I have produced a copy for the recent GCC 6.3.0 versions of the documentation on GitHub. The documentation was produced by building ia16-gcc from source and using make to generate the HTML. If you review the machine constraints for Intel IA-16—config/ia16 you should now be able to see what is going on:
Ral The al register.
Rah The ah register.
This version of GCC doesn't understand the R constraint by itself anymore. The inline assembly you are looking at matches that of the parameters for Int 0x10/Ah=0xe:
VIDEO - TELETYPE OUTPUT
AH = 0Eh
AL = character to write
BH = page number
BL = foreground color (graphics modes only)
Return:
Nothing
Desc: Display a character on the screen, advancing the cursor
and scrolling the screen as necessary
Other Information
The documentation does list all the constraints that are available for the IA16 target:
Intel IA-16—config/ia16/constraints.md
a
The ax register. Note that for a byte operand,
this constraint means that the operand can go into either al or ah.
b
The bx register.
c
The cx register.
d
The dx register.
S
The si register.
D
The di register.
Ral
The al register.
Rah
The ah register.
Rcl
The cl register.
Rbp
The bp register.
Rds
The ds register.
q
Any 8-bit register.
T
Any general or segment register.
A
The dx:ax register pair.
j
The bx:dx register pair.
l
The lower half of pairs of 8-bit registers.
u
The upper half of pairs of 8-bit registers.
k
Any 32-bit register group with access to the two lower bytes.
x
The si and di registers.
w
The bx and bp registers.
B
The bx, si, di and bp registers.
e
The es register.
Q
Any available segment register—either ds or es (unless one or both have been fixed).
Z
The constant 0.
P1
The constant 1.
M1
The constant -1.
Um
The constant -256.
Lbm
The constant 255.
Lor
Constants 128 … 254.
Lom
Constants 1 … 254.
Lar
Constants -255 … -129.
Lam
Constants -255 … -2.
Uo
Constants 0xXX00 except -256.
Ua
Constants 0xXXFF.
Ish
A constant usable as a shift count.
Iaa
A constant multiplier for the aad instruction.
Ipu
A constant usable with the push instruction.
Imu
A constant usable with the imul instruction except 257.
I11
The constant 257.
N
Unsigned 8-bit integer constant (for in and out instructions).
There are many new constraints and some repurposed ones.
In particular the a constraint for the AX register doesn't work like other versions of GCC that target 32-bit and 64-bit code. The compiler is free to choose either AH or AL with the a constraint if the values being passed are 8 bit values. This means it is possible for the a constraint to appear twice in an extended inline assembly statement.
You could have compiled your code to a DOS EXE with this command:
ia16-elf-gcc -mcmodel=small -mregparmcall -march=i186 \
-Wall -Wextra -std=gnu99 -O3 int10h.c -o int10h.exe
This targets the 80186. You can generate 8086 compatible code by omitting the -march=i186 The generated code for main would look something like:
00000000 <main>:
0: 83 f8 01 cmp ax,0x1
3: 7e 1d jle 22 <tty_write_string+0xa>
5: 56 push si
6: 89 d3 mov bx,dx
8: 8b 77 02 mov si,WORD PTR [bx+0x2]
b: 8a 04 mov al,BYTE PTR [si]
d: 20 c0 and al,al
f: 74 0d je 1e <tty_write_string+0x6>
11: 31 db xor bx,bx
13: b4 0e mov ah,0xe
15: 46 inc si
16: cd 10 int 0x10
18: 8a 04 mov al,BYTE PTR [si]
1a: 20 c0 and al,al
1c: 75 f7 jne 15 <main+0x15>
1e: 31 c0 xor ax,ax
20: 5e pop si
21: c3 ret
22: 31 c0 xor ax,ax
24: c3 ret
When run with the command line int10h.exe "Hello, world!" should print:
Hello, world!
Special Note: The IA16 port of GCC is very experimental and does have some code generation bugs especially when higher optimization levels are used. I wouldn't use it for mission critical applications at this point in time.

Gameboy emulation - Clarification need on CD instruction

I'm currently in the process of writing a Gameboy emulator, and I've noticed something that seems strange to me.
My emulator is hitting a jump instruction 0xCD, for example CD B6 FF, but my understanding was that a jump should only be jumping to an address within cartridge ROM (0x7FFF maximum), because I'm assuming the CPU can only execute instructions from ROM, not RAM. The ROM in question is Dr. Mario, which I'd expect to only be carrying out valid operations. 0xFFB6 is in high RAM, which seems odd to me.
Am I correct in my thinking? If I am, presumably that means my program counter is somehow ending up at the wrong address and that the CB is actually part of another instruction's data, and not an instruction itself?
I'd be grateful for some clarification, thanks.
For reference, I've been using Gameboy Opcodes and CPU docs to implement the instructions. I know they contain a few errors, and I think I've accounted for them (for example, 0xE2 being listed as a two-byte instruction, when it's only one)
Just checked Dr. Mario 1.1, it copies the VBlank int routine at hFFB6 at startup, then when VBlank happens, the routine at 0:01A6 is called, which calls the OAM DMA transfer routine.
During OAM DMA transfer, the CPU can only access HRAM, so writing a short routine in HRAM that will wait for the transfer to be completed is required. The OAM DMA transfer takes 160 µs, so you usually make a loop that will wait this amount of time after specifying the OAM transfer source.
This is the part of the initialization routine run at startup that copies the DMA transfer routine to HRAM:
...
ROM0:027E 0E B6 ld c,B6 ;destination hFFB6
ROM0:0280 06 0A ld b,0A ;length 0xA
ROM0:0282 21 86 23 ld hl,2386 ;source 0:2386
ROM0:0285 2A ldi a,(hl) ;copy OAM DMA transfer routine from source
ROM0:0286 E2 ld (ff00+c),a ;paste to destination
ROM0:0287 0C inc c ;destination++
ROM0:0288 05 dec b ;length--
ROM0:0289 20 FA jr nz,0285 ;loop until DMA transfer routine is copied
...
When VBlank happens, it jumps to the routine at 0:01A6:
ROM0:0040 C3 A6 01 jp 01A6
Which contains a call to our OAM DMA transfer routine, waiting for DMA to be completed:
ROM0:01A6 F5 push af
ROM0:01A7 C5 push bc
ROM0:01A8 D5 push de
ROM0:01A9 E5 push hl
ROM0:01AA F0 B1 ld a,(ff00+B1)
ROM0:01AC A7 and a
ROM0:01AD 28 0B jr z,01BA
ROM0:01AF FA F1 C4 ld a,(C4F1)
ROM0:01B2 A7 and a
ROM0:01B3 28 05 jr z,01BA
ROM0:01B5 F0 EF ld a,(ff00+EF)
ROM0:01B7 A7 and a
ROM0:01B8 20 09 jr nz,01C3
ROM0:01BA F0 E1 ld a,(ff00+E1)
ROM0:01BC FE 03 cp a,03
ROM0:01BE 28 03 jr z,01C3
ROM0:01C0 CD B6 FF call FFB6 ;OAM DMA transfer routine is in HRAM
...
OAM DMA transfer routine:
HRAM:FFB6 3E C0 ld a,C0
HRAM:FFB8 E0 46 ld (ff00+46),a ;source is wC000
HRAM:FFBA 3E 28 ld a,28 ;loop start
HRAM:FFBC 3D dec a
HRAM:FFBD 20 FD jr nz,FFBC ;wait for the OAM DMA to be completed
HRAM:FFBF C9 ret ;ret to 0:01C3
Here is my analysis:
Looking for CD B6 FF in the raw ROM I can only find it in one place of the memory which is 0x01C0 (448 in decimal).
So I decided to disassemble the ROM, to see if it is a valid instruction.
I used gb-disasm to disassemble the ROM. Here are the values from 0x150 (ROM start) to address 0x201.
[0x00000100] 0x00 NOP
[0x00000101] 0xC3 0x50 0x01 JP $0150
[0x00000150] 0xC3 0xE8 0x01 JP $01E8
[0x00000153] 0x01 0x0E 0xD0 LD BC,$D00E
[0x00000156] 0x0A LD A,[BC]
[0x00000157] 0xA7 AND A
[0x00000158] 0x20 0x0D JR NZ,$0D ; 0x167
[0x0000015A] 0xF0 0xCF LDH A,[$CF] ; HIMEM
[0x0000015C] 0xFE 0xFE CP $FE
[0x0000015E] 0x20 0x04 JR NZ,$04 ; 0x164
[0x00000160] 0x3E 0x01 LD A,$01
[0x00000162] 0x18 0x01 JR $01 ; 0x165
[0x00000164] 0xAF XOR A
[0x00000165] 0x02 LD [BC],A
[0x00000166] 0xC9 RET
[0x00000167] 0xFA 0x46 0xD0 LD A,[$D046]
[0x0000016A] 0xE0 0x01 LDH [$01],A ; SB
[0x0000016C] 0x18 0xF6 JR $F6 ; 0x164
[0x000001E8] 0xAF XOR A
[0x000001E9] 0x21 0xFF 0xDF LD HL,$DFFF
[0x000001EC] 0x0E 0x10 LD C,$10
[0x000001EE] 0x06 0x00 LD B,$00
[0x000001F0] 0x32 LD [HLD],A
[0x000001F1] 0x05 DEC B
[0x000001F2] 0x20 0xFC JR NZ,$FC ; 0x1F0
[0x000001F4] 0x0D DEC C
[0x000001F5] 0x20 0xF9 JR NZ,$F9 ; 0x1F0
[0x000001F7] 0x3E 0x0D LD A,$0D
[0x000001F9] 0xF3 DI
[0x000001FA] 0xE0 0x0F LDH [$0F],A ; IF
[0x000001FC] 0xE0 0xFF LDH [$FF],A ; IE
[0x000001FE] 0xAF XOR A
[0x000001FF] 0xE0 0x42 LDH [$42],A ; SCY
[0x00000201] 0xE0 0x43 LDH [$43],A ; SCX
The way we have to disassemble a ROM is by following the flow of instructions. For example, we know that the main program starts at position 0x150. So we should start disassembling there. Then we follow instruction by instruction until we hit any JUMP instruction (JP, JR, CALL, RET, etc). From that moment on the flow of the program is forked in two and we should follow both paths to disassemble.
The think to understand here is that if I show you a random memory position in a ROM, you cannot tell me if it is data or instructions. The only way to find out is by following the program flow. We need to define blocks of code that start in a jump destination and end in another jump instruction.
gb-disasm skips any memory position that is not inside a code block. 0x16C marks the end of a block.
[0x0000016C] 0x18 0xF6 JR $F6 ; 0x164
The next block starts on 0x1E8. We know that because it is the destination address of a jump located on 0x150.
[0x00000150] 0xC3 0xE8 0x01 JP $01E8
Memory block from 0x16E until 0x1E8 is not consider a code block. That's why you don't see the memory position 0x01C0 listed as an instruction.
So there you are, it is very likely that you are interpreting the instructions in a wrong way. If you want to be 100% sure, you can disassemble the whole room and check if any instruction points to 0x16E-0x1E8 and reads it as raw data, such as a tile or something.
Please leave a comment if you agree with the analysis.

Access specific bit in embedded X86 assembly

I am trying to acces a specific bit and modify it.
I have moved 0x01ABCDEF (hex value) into ecx and want to be able to check bit values at specific position.
For example I must take byte 0 of 0x01ABCDEF (0xEF)
check if bit at position 7 is 1
set the middle 4 bits to 1 and the rest to 0.
Under x86 the most simple solution is using bit manipulation instructions like BT (bit test), BTC (bit test and complement), BTR (bit test and reset) and BTS (bit test and set).
Bit test example:
mov dl, 7 //test 7th bit
bt ecx, edx //test 7th bit in register ecx
Remember: only last 5 bits in register edx is used.
or
bt ecx, 7
In both cases the result is stored in carry flag.
It's been years since I've done asm, but you want to and your value with 0x80 and if the result is zero your bit is not set so you jump out, otherwise continue along and set your eax to the value you want (I assume the four bits you mean are the 00111100 in the fourth byte.
For example (treat this as pseudo code as it's been far too long):
and eax, 0x80
jz exit
mov eax, 0x3C
exit:
Most CPUs do not support bit-wise access, so you have to use OR to set and AND to clear bits.
As I'm not really familiar with assembly I will just give you C-ish pseudocode, but you should easily be able to transform that to assembly instructions.
value = 0x01ABCDEF;
last_byte = value & 0xFF; // = 0xEF
if (last_byte & 0x40) { // is the 7th bit set? (0x01 = 1st, 0x02 = 2nd, 0x04 = 3rd, 0x08 = 4th, 0x10 = 5th, 0x20 = 6th, 0x40 = 7th, 0x80 = 8th)
value = value & 0xFFFFFF00; // clear last byte
value = value | 0x3C; // set the byte with 00111100 bits (0x3C is the hex representation of these bits)
}
Not that you can remove the last_byte assignment and directly check value & 0x40. However, if you want to check something which is not the least significant part, you have to do shifting first. For example, to extract ABCD you would use the following:
middle_bytes = (value & 0xFFFF00) >> 8;
value & 0cFFFF00 gets rif og the more significant byte(s) (0x01) and >> 8 shifts the result left by one byte and thus gets rid of the last byte (0xEF).

Resources