To read and then write a byte from and to memory, you can choose between a few options, among them:
Load a 32-bit value, write out a byte:
mov eax, [rdi]
mov byte [rsi], al
Load a byte, write a byte:
mov al, byte [rdi]
mov byte [rsi], al
Load a byte, zero extended, write a byte:
movzx eax, byte [rdi]
mov byte [rsi], al
All the approaches are the same on the write side, but they use different ways of reading the value. Is there any reason to prefer one over the others, particularly performance-wise?
Related
My goal in this code is to find the smallest number in the list. I used bubble sort method in this case; unfortunately, the code is not giving me the smallest/minimum number. Please take a look, Thanks:
include irvine32.inc
.data
input byte 100 dup(0)
stringinput byte "Enter any string: ",0
totallength byte "The total length is: ",0
minimum byte "The minimum value is: ",0
.code
stringLength proc
push ebp
mov ebp, esp
push ebx
push ecx
mov eax, 0
mov ebx, [ebp+8]
L1:
mov ecx, [ebx] ;you can use ecx, cx, ch, cl
cmp ecx, 0 ;you can use ecx, cx, ch, cl
JE L2
add ebx, 1
add eax, 1
jmp L1
L2:
pop ecx
pop ebx
mov ebp, esp
pop ebp
ret 4
stringLength endp
BubbleSort PROC uses ECX
push edx
xor ecx,ecx
mov ecx, 50
OUTER_LOOP:
push ecx
xor ecx,ecx
mov ecx,14
mov esi, OFFSET input
COMPARE:
xor ebx,ebx
xor edx,edx
mov bl, byte ptr ds:[esi]
mov dl, byte ptr ds:[esi+1]
cmp bl,dl
jg SWAP1
CONTINUE:
add esi,2
loop COMPARE
mov esi, OFFSET input
pop ecx
loop OUTER_LOOP
jmp FINISHED
SWAP1:
xchg bl,dl
mov byte ptr ds:[esi+1],dl
mov byte ptr ds:[esi],bl
jmp CONTINUE
FINISHED:
pop edx
ret 4
BubbleSort ENDP
main proc
call clrscr
mov edx, offset stringinput
call writeString
mov edx, offset input
call writeString
call stringLength
mov edx, offset input
mov ecx, sizeof input
call readstring
call crlf
mov edx,offset totallength
call writestring
call writedec
call crlf
mov edx, offset minimum
call crlf
call writeString
push offset input
call BubbleSort
mov edx, offset input
call writeString
call crlf
exit
main endp
end main
I haven't looked over your code, because sorting is an over complicated method for what you want to do. Not only that, but most of us don't pay too much attention to uncommented code. Just takes to long to figure out what you're trying to do.
Simply iterate through the entire list and start with 255 (FFH) in AL let's say. Each time you come across a number that is smaller than the one in AL, then replace it with that value and then when loop is finished, AL will have the lowest value.
If you need to know where it is in the list, you could maybe use AH which would be the difference between start address and current address. Knowledge of the instruction set is essential as finding the length of the string can be simplified by;
mov di, input ; Point to beginning of buffer
mov cx, -1 ; for a maximum of 65535 characters
xor al, al ; Looking for NULL
rep scasb
neg cx
dec cx ; CX = length of string.
Remember, ES needs to point to #DATA
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I've got the following code but I can't work out why if the number I enter is too high does it return the wrong number. It might be because of the data types and dividing and multiplying but I can't work out exactly why. if you know of why I would be grateful for the help.
.586
.model flat, stdcall
option casemap :none
.stack 4096
extrn ExitProcess#4: proc
GetStdHandle proto :dword
ReadConsoleA proto :dword, :dword, :dword, :dword, :dword
WriteConsoleA proto :dword, :dword, :dword, :dword, :dword
STD_INPUT_HANDLE equ -10
STD_OUTPUT_HANDLE equ -11
.data
bufSize = 80
inputHandle DWORD ?
buffer db bufSize dup(?)
bytes_read DWORD ?
sum_string db "The number was ",0
outputHandle DWORD ?
bytes_written dd ?
actualNumber dw 0
asciiBuf db 4 dup (0)
.code
main:
invoke GetStdHandle, STD_INPUT_HANDLE
mov inputHandle, eax
invoke ReadConsoleA, inputHandle, addr buffer, bufSize, addr bytes_read,0
sub bytes_read, 2 ; -2 to remove cr,lf
mov ebx,0
mov al, byte ptr buffer+[ebx]
sub al,30h
add [actualNumber],ax
getNext:
inc bx
cmp ebx,bytes_read
jz cont
mov ax,10
mul [actualNumber]
mov actualNumber,ax
mov al, byte ptr buffer+[ebx]
sub al,30h
add actualNumber,ax
jmp getNext
cont:
invoke GetStdHandle, STD_OUTPUT_HANDLE
mov outputHandle, eax
mov eax,LENGTHOF sum_string ;length of sum_string
invoke WriteConsoleA, outputHandle, addr sum_string, eax, addr bytes_written, 0
mov ax,[actualNumber]
mov cl,10
mov bl,3
nextNum:
xor edx, edx
div cl
add ah,30h
mov byte ptr asciiBuf+[ebx],ah
dec ebx
mov ah,0
cmp al,0
ja nextNum
mov eax,4
invoke WriteConsoleA, outputHandle, addr asciiBuf, eax, addr bytes_written, 0
mov eax,0
mov eax,bytes_written
push 0
call ExitProcess#4
end main
Yes, it is plausible that your return value is capped by a maximum value. This maximum is either the BYTE boundary of 255 or the WORD boundary of 65536. Let me explain why, part by part:
mov inputHandle, eax
invoke ReadConsoleA, inputHandle, addr buffer, bufSize, addr bytes_read,0
sub bytes_read, 2 ; -2 to remove cr,lf
mov ebx,0
mov al, byte ptr buffer+[ebx]
sub al,30h
add [actualNumber],ax
In this part you are calling a Win32 API function, which always returns the return value in the register EAX. After it has returned, you assign the lower 8-bits of the 32-bit return value to byte ptr buffer+[ebx], subtract 30h from it. Then you MOV the 8-bit you just modified in AL and the 8-bit from the return-value preserved in AH as a block AX to a WORD variable by add [actualNumber],ax. So AH stems from the EAX return value and is quite of undefined. You may be lucky if it's 0, but that should not be assumed.
The next problem is the following sub-routine:
getNext:
inc bx
cmp ebx,bytes_read
jz cont
mov ax,10
mul [actualNumber]
mov actualNumber,ax
mov al, byte ptr buffer+[ebx]
sub al,30h
add actualNumber,ax
jmp getNext
You are moving the decimal base 10 to the WORD register AX and multiply it by the WORD variable [actualNumber]. So far, so good. But the result of a 16-bit*16-bit MUL is returned in the register pair AX:DX(lower:higher). So your mov actualNumber,ax solely MOVs the lower 16-bits to your variable (DX is ignored, limiting your result to result % 65536). So your maximum possible result is MAX_WORD = 65535. Everything else would just give you the modulo in AX.
After your mov al, byte ptr buffer+[ebx] your overwrite the lower 8-bits of this result with the BYTE pointed to by buffer[ebx] and then subtract 30h from it. Remember: the higher 8-bits of the result still remain in AH, the higher 8-bits of AX.
Then you (re)add this value to the variable actualNumber with add actualNumber,ax. Let me condense these last two paragraphs:
Operation | AX |
| AL AH |
mov actualNumber,ax | ................ |
mov al, byte ptr buffer+[ebx] | ........ AH |
sub al,30h | ....-30h AH |
add actualNumber,ax | ................ |
So, you are modifying the lower 8-bits of AX through AL and then add the higher 8-bits of actualNumber/AH to itself - effectively doubling AH and then adding this to actualNumber like this:
actualNumber = 256 * (2 * AH) + (byte ptr buffer[ebx]-30h) ; I doubt you want that ;-)
These problems may cause several deviations from the desired result.
If I do step through the debugger in Ollydbg I see
MOV EAX,DWORD PTR DS:[ESI+EBP*8]
and register ESI = 0040855C and EBP = 00000000.
My problem is I dont know 2 register * 8
MOV EAX,DWORD PTR DS:[ESI+EBP*8]
MOV - move
EAX - to EAX (generally this will be a value you just calculated)
DWORD PTR - from the value pointed at by
[DS: - in the data segment]
[ESI+EBP*8] - ESI plus 8 times EBP.
Move the value in EAX into the address pointed at by ESI + EBP*8 (ESI plus 8 times EBP, it means exactly how it's written)
This is probably being used to load data from an array, where the 8 is there to scale up the counter (which is EBP) to the size of the thing being stored (8 bytes), and ESI contains the address of the start of the array. So if EBP is zero, you store the data in ESI+0, if EBP=1, you end up storing at ESI+8, etc.
In normal INTEL syntax this instruction moves a value from memory into EAX.
MOV EAX,DWORD PTR DS:[ESI+EBP*8]
It is usually used to extract a value from an array.
The array is situated in memory at DS:ESI.
The elements are indexed through EBP.
The scale of 8 means that every element is 64 bit long and this instruction only reads the low dword.
Why this following nasm code does not compile :
cmp byte [rdi], byte [rsi]
whereas this code compiles:
mov al, byte [rsi]
cmp byte [rdi], al
Two indirect arguments are not supported. And why not use cmpsb, which compares these two bytes directly?
I am using the following ASM routine to bubble sort an array. I want to know of the inefficiencies of my code:
.386
.model flat, c
option casemap:none
.code
public sample
sample PROC
;[ebp+0Ch]Length
;[ebp+08h]Array
push ebp
mov ebp, esp
push ecx
push edx
push esi
push eax
mov ecx,[ebp+0Ch]
mov esi,[ebp+08h]
_bubbleSort:
push ecx
push esi
cmp ecx,1
je _exitLoop
sub ecx,01h
_miniLoop:
push ecx
mov edx,DWORD PTR [esi+4]
cmp DWORD PTR [esi],edx
ja _swap
jmp _continueLoop
_swap:
lodsd
mov DWORD PTR [esi-4],edx
xchg DWORD PTR [esi],eax
jmp _skipIncrementESI
_continueLoop:
add esi,4
_skipIncrementESI:
pop ecx
loop _miniLoop
_exitLoop:
pop esi
pop ecx
loop _bubbleSort
pop eax
pop esi
pop edx
pop ecx
pop ebp
ret
sample ENDP
END
Basically I have two loops, as usual for the bubble sort algorithm. The value of ecx for the outer loop is 10, and for the inner loop it is [ecx-1]. I have tried the routine and it compiles and runs successfully, but I am not sure if it is efficient.
There are several things you can do to speed up your assembly code:
don't do things like ja label_1 ; jmp label_2. Just do jbe label_2 instead.
loop is a very slow instruction. dec ebx; jnz loopstart is much faster
use all registers instead of repeatedly push/pop ecx and esi. Use ebx and edi too.
jmp-targets should be well aligned. Use align 4 before the two loop-starts and after the jbe
Get yourself a manual for your cpu from Intel (you can download it as pdf), it has the timings for the opcodes, maybe it has other hints too.
Several simple tips:
1) Try to minimize the number of conditional jumps, because they are very expensive. Unroll if possible.
2) Reorder instructions to minimize stalls because of data depencency:
cmp DWORD PTR [esi],edx ;// takes some time to compute,
mov edx,DWORD PTR [esi+4] ;
ja _swap ;// waits for results of cmp
3) Avoid old composite instructions (dec, jnz pair is faster than loop and is not bound to ecx register)
It would be quite difficult to write assembly code that is faster than the code generated by optimizing C compiler, because you should consider lots of factors: size of data and instruction caches, alignments, pipeline, instruction timings. You can find some good documentation about this here. I especially recommend the first book: Optimizing software in C++
Substitute for "add esi,4" if we do not need a flag for this instruction:
_continueLoop:
lea esi,[esi+4]