explanation of assembly code - syntax

Can anybody explain me this piece of assembly code?
LINEAR_DATA_SEL equ $-gdt
dw 0FFFFh
dw 0
db 0
db 92h ; present, ring 0, data, expand-up, writable
db 0CFh ; page-granular (4 gig limit), 32-bit
db 0
Here I have already googled about the command equ, dw and db but I can't understand what this code actually do(especially the first line). what is this $-gdt and the parameters of dw and db? Kindly explain in detail if possible. Thanks in advance.

It's actually an 8-byte entry in the global descriptor table. It creates a descriptor addressing the entire 4G address space as a selector.
The equ $-gdt sets up a value in the assembler equal to the difference between this location ($) and the gdt label. In other words, it's the offset of this entry within the GDT itself.
The structure of a GDT entry is as follows:
where the individual parts are explained below.
For your specific values:
(a) dw FFFFh
(b) dw 0
(c) db 0
(d) db 92h ; present, ring 0, data, expand-up, writable
(e) db CFh ; page-granular (4 gig limit), 32-bit
(f) db 0
The base address is calculated from the f, c and b fields, from most significant to least - because these are all zero, the base is at zero.
The selector limit is calculated from the rightmost 4 bits of e and all of a, to give 0xfffff in this case. This has 1 added to it to give 0x100000. See point 3 below for what this means.
The top 4 bits of e (the flags) set the granularity (4K rather than 1 byte) and the operand size (32-bit). With a granularity of 4K (12 bits) and page count of 0x100000 (20 bits), that gives you your full 32-bit (4G) address space.
The d field is the access byte and sets the following properties based on 0x92:
Pr present (in-memory) bit to true.
Privl privelege level to 0 (need to be ring 0 to get access).
Ex executable bit 0 (data selector).
DC, direction bit is 0, segment grows up.
RW of 1, memory is writable.
Ac accessed bit set to 0.

db/dw means data word/data byte. This is some data, without context it could mean anything, that's why there are some comments. equ means equal, it is used to store constants. I guess gdt is defined somewhere else as the adress of/pointer to the Global Descriptor Table.
There's a GDT tutorial here that uses the same constants for a function call:
/* Setup a descriptor in the Global Descriptor Table */
void gdt_set_gate(int num, unsigned long base, unsigned long limit, unsigned char access, unsigned char gran)
[...]
/* The third entry is our Data Segment. It's EXACTLY the
* same as our code segment, but the descriptor type in
* this entry's access byte says it's a Data Segment */
gdt_set_gate(2, 0, 0xFFFFFFFF, 0x92, 0xCF);

http://en.wikibooks.org/wiki/X86_Assembly/Global_Descriptor_Table#GDT
dw and db are 'define word' and 'define byte', respectively but NOT 'define' in the c-style sense. They allocate space in memory of the size word and byte (word depends on architecture, byte is 8 bits).

Related

Finding the address of bits in a byte memory location

I have a memory location at 0x31011 that stores 16, 1 bit symbols' data. Each of these symbols has a different starting bit location of where their information is stored:
1st Symbol: 0x31011 --> Starting bit at bit 0
2nd Symbol: 0x31011 --> Starting bit at bit 1
3rd Symbol: 0x31011 --> Starting bit at bit 2
...
14th Symbol: 0x31011 --> Starting bit at bit 13
15th Symbol: 0x31011 --> Starting bit at bit 14
16th Symbol: 0x31011 --> Starting bit at bit 15
The hex to binary conversion of 0x31011 is: 110001000000010001
I would love some enlightenment on my understanding of how one can retrieve the address for each of these symbols, if that is even possible. My understanding is that I will have to use the base address at 0x31011 for each of these symbols and then extract the value from this location and then use bit masking with the AND operator and then shift right. So my original thought of being able to call a subaddress of the base address is not possible.
For example, if I want to find the value of the 1st symbol at bit 0, assume value at 0x31011 is 0000001100110001:
1: Retrieve the value at 0x31011--> 0000001100110001
2: Apply a mask of: C7 --------------> 0000000011000111
-------------------------------------------------------------------------------- &
3: Find value of 1st symbol 0000000000000001
If I want to find the value of the 2nd symbol at bit 1, assume value at 0x31011 is 00000001100110001:
1: Retrieve the value at 0x31011--> 0000001100110001
2: Apply a mask of: C6 --------------> 0000000011000110
-------------------------------------------------------------------------------- &
3: Find value of 2nd symbol ------> 0000000000000000
4: Right shift result >> 1 ----------> 0000000000000000
If I want to find the value of the 3rd symbol at bit 2, assume value at 0x31011 is 00000001100110001:
1: Retrieve the value at 0x31011--> 0000001100110001
2: Apply a mask of: C6 --------------> 0000000011000110
-------------------------------------------------------------------------------- &
3: Find value of 3rd symbol ------> 0000000000000000
4: Right shift result >> 2 ----------> 0000000000000000
....
If I want to find the value of the 16th symbol at bit 15, assume value at 0x31011 is 00000001100110001:
1: Retrieve the value at 0x31011--> 0000001100110001
2: Apply a mask of: 80C6 -----------> 1000000011000110
-------------------------------------------------------------------------------- &
3: Find value of 16th symbol ------> 0000000000000000
4: Right shift result >> 15 ---------> 0000000000000000
Memory is normally only byte-addressable, not bit-addressable. To represent the address of a single bit, you need a regular address and a bit offset.
However, some ISAs do have some capacity for bit-addressable memory. The 8051 microcontroller's bit-addressable memory is one notable example. But the bit set/clear/complement and branch-on-bit instructions are only available with "direct" addressing modes, so you can't pass around an address of a bit, unless you use self-modifying code. Addresses 00-7F are full bytes when used with byte load/store instructions, but bits of the first 16 bytes when used with bit instructions.
Outside of special ISA features like instructions that can use bit addressing, you have a software problem. A single bit is not directly addressable.
You can only read it as part of the byte or word that contains it.
You can certainly represent its location as a normal byte-address + a bit offset, though. For example, a C function like this is easy to implement on any normal ISA that has pointer registers and a right shift:
bool get_bit(const char *location, int bit_offset) {
unsigned char tmp = *location;
return (tmp >> bit_offset) & 1;
}
For example on x86-64:
movzx eax, byte [rdi] ; load the byte
bt eax, esi ; CF = bit at position ESI
setc al ; set AL = CF
ret
On a machine where real addresses don't need as many bits as a register is wide, you could encode you bit-address into a single integer value with the bit-offset in the low 3 bits. (Or more bits for machines with larger bytes). Then to use it, you'd right shift to get an actual machine address and also mask to extract the bit-offset for use as a right-shift count.

algorithm of addressing a triangle matrices memory using assembly

I was doing a project in ASM about pascal triangle using NASM
so in the project you need to calculate pascal triangle from line 0 to line 63
my first problem is where to store the results of calculation -> memory
second problem what type of storage I use in memory, to understand what I mean I have 3 way first declare a full matrices so will be like this way
memoryLabl: resd 63*64 ; 63 rows of 64 columns each
but the problem in this way that half of matrices is not used that make my program not efficient so let's go the second method is available
which is declare for every line a label for memory
for example :
line0: dd 1
line1: dd 1,1
line2: dd 1,2,1 ; with pre-filled data for example purposes
...
line63: resd 64 ; reserve space for 64 dword entries
this way of doing it is like do it by hand,
some other from the class try to use macro as you can see here
but i don't get it
so far so good
let's go to the last one that i have used
which is like the first one but i use a triangle matrices , how is that,
by declaring only the amount of memory that i need
so to store line 0 to line 63 line of pascal triangle, it's give me a triangle matrices because every new line I add a cell
I have allocate 2080 dword for the triangle matrices how is that ??
explain by 2080 dword:
okey we have line0 have 1 dword /* 1 number in first line */
line1 have 2 dword /* 2 numbers in second line */
line2 have 3 dword /* 3 numbers in third line */
...
line63 have 64 dword /* 64 numbers in final line*/
so in the end we have 2080 as the sum of them
I have give every number 1 dword
okey now we have create the memory to store results let's start calculation
first# in pascal triangle you have all the cells in row 0 have value 1
I will do it in pseudo code so you understand how I put one in all cells of row 0:
s=0
for(i=0;i<64;i++):
s = s+i
mov dword[x+s*4],1 /* x is addresses of triangle matrices */
second part in pascal triangle is to have the last row of each line equal to 1
I will use pseudo code to make it simple
s=0
for(i=2;i<64;i++):
s = s+i
mov dword[x+s*4],1
I start from i equal to 2 because i = 0 (i=1) is line0 (line1) and line0 (line1)is full because is hold only one (tow) value as I say in above explanation
so the tow pseudo code will make my rectangle look like in memory :
1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
...
1 1
now come the hard part is the calculation using this value in triangle to fill all the triangle cells
let's start with the idea here
let's take cell[line][row]
we have cell[2][1] = cell[1][0]+cell[1][1]
and cell[3][1]= cell[2][0]+cell[2][1]
cell[3][2]= cell[2][1]+cell[2][2]
in **general** we have
cell[line][row]= cell[line-1][row-1]+cell[line-1][row]
my problem I could not break this relation using ASM instruction because i have a
triangle matrices which weird to work with can any one help me to break it using a relation or very basic pseudo code or asm code ?
TL:DR: you just need to traverse the array sequentially, so you don't have to work out the indexing. See the 2nd section.
To random access index into a (lower) triangular matrix, row r starts after a triangle of size r-1. A triangle of size n has n*(n+1)/2 total elements, using Gauss's formula for the sum of numbers from 1 to n-1. So a triangle of size r-1 has (r-1)*r/2 elements. Indexing a column within a row is of course trivial, once we know the address of the start of a row.
Each DWORD element is 4 bytes wide, and we can take care of that scaling as part of the multiply, because lea lets us shift and add as well as put the result in a different register. We simplify n*(n-1)/2 elements * 4 bytes / elem to n*(n-1) * 2 bytes.
The above reasoning works for 1-based indexing, where row 1 has 1 element. We have to adjust for that if we want zero-based indexing by adding 1 to row indices before the calculation, so we want the size of a triangle
with r+1 - 1 rows, thus r*(r+1)/2 * 4 bytes. It helps to put the linear array index into a triangle to quickly double-check the formula
0
4 8
12 16 20
24 28 32 36
40 44 48 52 56
60 64 68 72 76 80
84 88 92 96 100 104 108
The 4th row, which we're calling "row 3", starts 24 bytes from the start of the whole array. That's (3+1)*(3+1-1) * 2 = (3+1)*3 * 2; yes the r*(r+1)/2 formula works.
;; given a row number in EDI, and column in ESI (zero-extended into RSI)
;; load triangle[row][col] into eax
lea ecx, [2*rdi + 2]
imul ecx, edi ; ecx = r*(r+1) * 2 bytes
mov eax, [triangle + rcx + rsi*4]
This assuming 32-bit absolute addressing is ok (32-bit absolute addresses no longer allowed in x86-64 Linux?). If not, use a RIP-relative LEA to get the triangle base address in a register, and add that to rsi*4. x86 addressing modes can only have 3 components when one of them is a constant. But that is the case here for your static triangle, so we can take full advantage by using a scaled index for the column, and base as our calculated row offset, and the actual array address as the displacement.
Calculating the triangle
The trick here is that you only need to loop over it sequentially; you don't need random access to a given row/column.
You read one row while writing the one below. When you get to the end of a row, the next element is the start of the next row. The source and destination pointers will get farther and farther from each other as you go down the rows, because the destination is always 1 whole row ahead. And you know the length of a row = row number, so you can actually use the row counter as the offset.
global _start
_start:
mov esi, triangle ; src = address of triangle[0,0]
lea rdi, [rsi+4] ; dst = address of triangle[1,0]
mov dword [rsi], 1 ; triangle[0,0] = 1 special case: there is no source
.pascal_row: ; do {
mov rcx, rdi ; RCX = one-past-end of src row = start of dst row
xor eax, eax ; EAX = triangle[row-1][col-1] = 0 for first iteration
;; RSI points to start of src row: triangle[row-1][0]
;; RDI points to start of dst row: triangle[row ][0]
.column:
mov edx, [rsi] ; tri[r-1, c] ; will load 1 on the first iteration
add eax, edx ; eax = tri[r-1, c-1] + tri[r-1, c]
mov [rdi], eax ; store to triangle[row, col]
add rdi, 4 ; ++dst
add rsi, 4 ; ++src
mov eax, edx ; becomes col-1 src value for next iteration
cmp rsi, rcx
jb .column ; }while(src < end_src)
;; RSI points to one-past-end of src row, i.e. start of next row = src for next iteration
;; RDI points to last element of dst row (because dst row is 1 element longer than src row)
mov dword [rdi], 1 ; [r,r] = 1 end of a row
add rdi, 4 ; this is where dst-src distance grows each iteration
cmp rdi, end_triangle
jb .pascal_row
;;; triangle is constructed. Set a breakpoint here to look at it with a debugger
xor edi,edi
mov eax, 231
syscall ; Linux sys_exit_group(0), 64-bit ABI
section .bss
; you could just as well use resd 64*65/2
; but put a label on each row for debugging convenience.
ALIGN 16
triangle:
%assign i 0
%rep 64
row %+ i: resd i + 1
%assign i i+1
%endrep
end_triangle:
I tested this and it works: correct values in memory, and it stops at the right place. But note that integer overflow happens before you get down to the last row. This would be avoided if you used 64-bit integers (simple change to register names and offsets, and don't forget resd to resq). 64 choose 32 is 1832624140942590534 = 2^60.66.
The %rep block to reserve space and label each row as row0, row1, etc. is from my answer to the question you linked about macros, much more sane than the other answer IMO.
You tagged this NASM, so that's what I used because I'm familiar with it. The syntax you used in your question was MASM (until the last edit). The main logic is the same in MASM, but remember that you need OFFSET triangle to get the address as an immediate, instead of loading from it.
I used x86-64 because 32-bit is obsolete, but I avoided too many registers, so you can easily port this to 32-bit if needed. Don't forget to save/restore call-preserved registers if you put this in a function instead of a stand-alone program.
Unrolling the inner loop could save some instructions copying registers around, as well as the loop overhead. This is a somewhat optimized implementation, but I mostly limited it to optimizations that make the code simpler as well as smaller / faster. (Except maybe for using pointer increments instead of indexing.) It took a while to make it this clean and simple. :P
Different ways of doing the array indexing would be faster on different CPUs. e.g. perhaps use an indexed addressing mode (relative to dst) for the loads in the inner loop, so only one pointer increment is needed. But if you want it to run fast, SSE2 or AVX2 vpaddd could be good. Shuffling with palignr might be useful, but probably also unaligned loads instead of some of the shuffling, especially with AVX2 or AVX512.
But anyway, this is my version; I'm not trying to write it the way you would, you need to write your own for your assignment. I'm writing for future readers who might learn something about what's efficient on x86. (See also the performance section in the x86 tag wiki.)
How I wrote that:
I started writing the code from the top, but quickly realized that off-by-one errors were going to be tricky, and I didn't want to just write it the stupid way with branches inside the loops for special cases.
What ended up helping was writing the comments for the pre and post conditions on the pointers for the inner loop. That made it clear I needed to enter the loop with eax=0, instead of with eax=1 and storing eax as the first operation inside the loop, or something like that.
Obviously each source value only needs to be read once, so I didn't want to write an inner loop that reads [rsi] and [rsi+4] or something. Besides, that would have made it harder to get the boundary condition right (where a non-existant value has to read as 0).
It took some time to decide whether I was going to have an actual counter in a register for row length or row number, before I ended up just using an end-pointer for the whole triangle. It wasn't obvious before I finished that using pure pointer increments / compares was going to save so many instructions (and registers when the upper bound is a build-time constant like end_triangle), but it worked out nicely.

How to find the physical address of interrupts in interrupt vector table?

How do i calculate the physical address of any given interrupt (INT22H or INT15H for instance) in the interrupt vector table for 8086 microprocessor?
...calculate the physical address of any given interrupt (INT22H or INT15H for instance) in the interrupt vector table...
Physical address where the int 15h instruction finds the far pointer that it should call.
This is an offset within the Interrupt Vector Table, and so gives a physical address aka linear address from the list {0,4,8,12, ... , 1016,1020}.
Since each vector is 4 bytes long, all it takes is multiplying the interrupt number by 4.
mov ax,0415h ;AL=Interrupt number, AH=4
mul ah ; -> Product in AX
cwd ;(*) -> Result in DX:AX=[0,1023]
(*) I like all my linear addresses expressed as DX:AX. That's why I used the seemingly unnecessary cwd instruction.
Physical address where int 15h ultimately gets handled.
This can be anywhere in the 1MB memory. (On 8086 there's no memory beyond 1MB).
Each 4 byte vector consists of an offset word followed by a segment word. The order is important.
The linear address is calculated from multiplying the segment value by 16 and adding the offset value.
mov ax,16
mul word ptr [0015h * 4 + 2] ;Segment in high word -> Product in DX:AX
add ax, [0015h * 4] ;Offset in low word
adc dx, 0 ; -> Result in DX:AX=[0,1048575]

Shift elements to the left of a SIMD register based on boolean mask [duplicate]

This question already has answers here:
AVX2 what is the most efficient way to pack left based on a mask?
(6 answers)
Closed 6 months ago.
This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector
I would like to create an optimal function with this signature:
__m256i PackLeft(__m256i inputVector, __m256i boolVector);
The desired behaviour is that on an input of 64bit int like this:
inputVector = {42, 17, 13, 3}
boolVector = {true, false, true, false}
It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value should be:
{42, 13, X, X}
... Where X is "I don't care".
An obvious way to do this is the use _mm_movemask_epi8 to get a 8 byte int out of the bool vector, look up the shuffle mask in a table and then do a shuffle with the mask.
However, I would like to avoid a lookup table if possible. Is there a faster solution?
This is covered quite well by Andreas Fredriksson in his 2015 GDC talk: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf
Starting on slide 104, he covers how to do this using only SSSE3 and then using just SSE2.
Just saw this problem - perhaps u have already solved it, but am still writing the logic for other programmers who may need to handle this situation.
The solution (in Intel ASM format) is given below. It consists of three steps :
Step 0 : convert the 8 bit mask into a 64 bit mask, with each set bit in the original mask represented as a 8 set bits in the expanded mask.
Step 1 : Use this expanded mask to extract the relevant bits from the source data
Step 2: Since you require the data to be left packed, we shift the output by appropriate number of bits.
The code is as below :
; Step 0 : convert the 8 bit mask into a 64 bit mask
xor r8,r8
movzx rax,byte ptr mask_pattern
mov r9,rax ; save a copy of the mask - avoids a memory read in Step 2
mov rcx,8 ; size of mask in bit count
outer_loop :
shr al,1 ; get the least significant bit of the mask into CY
setnc dl ; set DL to 0 if CY=1, else 1
dec dl ; if mask lsb was 1, then DL is 1111, else it sets to 0000
shrd r8,rdx,8
loop outer_loop
; We get the mask duplicated in R8, except it now represents bytewise mask
; Step 1 : we extract the bits compressed to the lowest order bit
mov rax,qword ptr data_pattern
pext rax,rax,r8
; Now we do a right shift, as right aligned output is required
popcnt r9,r9 ; get the count of bits set in the mask
mov rcx,8
sub cl,r9b ; compute 8-(count of bits set to 1 in the mask)
shl cl,3 ; convert the count of bits to count of bytes
shl rax,cl
;The required data is in RAX
Trust this helps

Strange pointer arithmetic

I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.
Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
UPDATE:
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
struct
{
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
}
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
SUMMARY:
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Exceptions:
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).
LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.
It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)

Resources