how to optimise double dereferencing?

how to optimise double dereferencing? - algorithm

Very specific optimisation task.
I have 3 arrays:
const char* inputTape
const int* inputOffset, organised in a group of four
char* outputTapeoutput
which i must assemble output tape from input, according to following 5 operations:
int selectorOffset = inputOffset[4*i];
char selectorValue = inputTape[selectorOffset];
int outputOffset = inputOffset[4*i+1+selectorValue];
char outputValue = inputTape[outputOffset];
outputTape[i] = outputValue; // store byte
and then advance counter.
All iterations are same and could be done all in parallel. Format of inputOffset could be a subject for change, but until same input will produce same output.
OpenCL on GPU fails on this algorithm (works same or even slower that cpu)
Assembly the best i got 5 mov, 1 lea, 1 dec instructions. Upd:
thanks to Peter Cordes little hint
loop_start:
mov eax,dword ptr [rdx-10h] ; selector offset
movzx r10d,byte ptr [rax+r8] ; selector value
mov eax,dword ptr [rdx+r10*4-0Ch] ; output offset
movzx r10d,byte ptr [r8+rax] ; output value
mov byte ptr [r9+rcx-1],r10b ; store to outputTape
lea rdx, [rdx-10h] ; pointer to inputOffset for current
dec ecx ; loop counter, sets zero flag if (ecx == 0)
jne loop_start ; continue looping while non zero iterations left: ( ecx != 0 )
How could i optimise this for SSE/AVX operation? i am stumbled...
UPD: better to see it than to hear it..

Related

How to print signed integer in x86 assembly (NASM) on Mac

I found an implementation of unsigned integer conversion in x86 assembly, and I tried plugging it in but being new to assembly and not having a debugging env there yet, it's difficult to understand why it's not working. I would also like it to work with signed integers so it can capture error messages from syscalls.
Wondering if one could show how to fix this code to get the signed integer to print, without using printf but using strprn provided by this answer.
%define a rdi
%define b rsi
%define c rdx
%define d r10
%define e r8
%define f r9
%define i rax
%define EXIT 0x2000001
%define EXIT_STATUS 0
%define READ 0x2000003 ; read
%define WRITE 0x2000004 ; write
%define OPEN 0x2000005 ; open(path, oflag)
%define CLOSE 0x2000006 ; CLOSE
%define MMAP 0x2000197 ; mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t offset)
; szstr computes the lenght of a string.
; rdi - string address
; rdx - contains string length (returned)
strsz:
xor rcx, rcx ; zero rcx
not rcx ; set rcx = -1 (uses bitwise id: ~x = -x-1)
xor al,al ; zero the al register (initialize to NUL)
cld ; clear the direction flag
repnz scasb ; get the string length (dec rcx through NUL)
not rcx ; rev all bits of negative -> absolute value
dec rcx ; -1 to skip the null-term, rcx contains length
mov rdx, rcx ; size returned in rdx, ready to call write
ret
; strprn writes a string to the file descriptor.
; rdi - string address
; rdx - contains string length
strprn:
push rdi ; push string address onto stack
call strsz ; call strsz to get length
pop rsi ; pop string to rsi (source index)
mov rax, WRITE ; put write/stdout number in rax (both 1)
mov rdi, 1 ; set destination index to rax (stdout)
syscall ; call kernel
ret
; mov ebx, 0xCCCCCCCD
itoa:
xor rdi, rdi
call itoal
ret
; itoa loop
itoal:
mov ecx, eax ; save original number
mul ebx ; divide by 10 using agner fog's 'magic number'
shr edx, 3 ;
mov eax, edx ; store quotient for next loop
lea edx, [edx*4 + edx] ; multiply by 10
shl rdi, 8 ; make room for byte
lea edx, [edx*2 - '0'] ; finish *10 and convert to ascii
sub ecx, edx ; subtract from original number to get remainder
lea rdi, [rdi + rcx] ; store next byte
test eax, eax
jnz itoal
exit:
mov a, EXIT_STATUS ; exit status
mov i, EXIT ; exit
syscall
_main:
mov rdi, msg
call strprn
mov ebx, -0xCCCCCCCD
call itoa
call strprn
jmp exit
section .text
msg: db 0xa, " Hello StackOverflow!!!", 0xa, 0xa, 0
With this working it will be possible to properly print signed integers to STDOUT, so you can log the registers values.
https://codereview.stackexchange.com/questions/142842/integer-to-ascii-algorithm-x86-assembly
How to print a string to the terminal in x86-64 assembly (NASM) without syscall?
How do I print an integer in Assembly Level Programming without printf from the c library?
https://baptiste-wicht.com/posts/2011/11/print-strings-integers-intel-assembly.html
How to get length of long strings in x86 assembly to print on assertion

My answer on How do I print an integer in Assembly Level Programming without printf from the c library? which you already linked shows that serializing an integer into memory as ASCII decimal gives you a length, so you have no use for (a custom version of) strlen here.
(Your msg has an assemble-time constant length, so it's silly not to use that.)
To print a signed integer, implement this logic:
if (x < 0) {
print('-'); // or just was_negative = 1
x = -x;
}
unsigned_intprint(x);
Unsigned covers the abs(most_negative_integer) case, e.g. in 8-bit - (-128) overflows to -128 signed. But if you treat the result of that conditional neg as unsigned, it's correct with no overflow for all inputs.
Instead of actually printing a - by itself, just save the fact that the starting number was negative and stick the - in front of the other digits after generating the last one. For bases that aren't powers of 2, the normal algorithm can only generate digits in reverse order of printing,
My x86-64 print integer with syscall answer treats the input as unsigned, so you should simply use that with some sign-handling code around it. It was written for Linux, but replacing the write system call number will make it work on Mac. They have the same calling convention and ABI.
And BTW, xor al,al is strictly worse than xor eax,eax unless you specifically want to preserve the upper 7 bytes of RAX. Only xor-zeroing of full registers is handled efficiently as a zeroing idiom.
Also, repnz scasb is not fast; about 1 compare per clock for large strings.
For strings up to 16 bytes, you can use a single XMM vector with pcmpeqb / pmovmskb / bsf to find the first zero byte, with no loop. (SSE2 is baseline for x86-64).

Assembly program crashes on call or exit

Debugging my code in VS2015, I get to the end of the program. The registers are as they should be, however, on call ExitProcess, or any variation of that, causes an "Access violation writing location 0x00000004." I am utilizing Irvine32.inc from Kip Irvine's book. I have tried using call DumpRegs, but that too throws the error.
I have tried using other variations of call ExitProcess, such as exit and invoke ExitProcess,0 which did not work either, throwing the same error. Before, when I used the same format, the code worked fine. The only difference between this code and the last one is utilizing the general purpose registers.
include Irvine32.inc
.data
;ary dword 100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
ary dword -24, 1, -5, 30, 35, 81, 94, 143, 0
.code
main PROC
;ESI will be used for the array
;EDI will be used for the array value
;ESP will be used for the array counting
;EAX will be used for the accumulating sum
;EBX will be used for the average
;ECX will be used for the remainder of avg
;EBP will be used for calculating remaining sum
mov eax,0 ;Set EAX register to 0
mov ebx,0 ;Set EBX register to 0
mov esp,0 ;Set ESP register to 0
mov esi,OFFSET ary ;Set ESI register to array
sum: mov edi,[esi] ;Set value to array value
cmp edi,0 ;Check value to temination value 0
je finsum ;If equal, jump to finsum
add esp,1 ;Add 1 to array count
add eax,edi ;Add value to sum
add esi,4 ;Increment to next address in array
jmp sum ;Loop back to sum array
finsum: mov ebp,eax ;Set remaining sum to the sum
cmp ebp,0 ;Compare rem sum to 0
je finavg ;Jump to finavg if sum is 0
cmp ebp,esp ;Check sum to array count
jl finavg ;Jump to finavg if sum is less than array count
avg: add ebx,1 ;Add to average
sub ebp,esp ;Subtract array count from rem sum
cmp ebp,esp ;Compare rem sum to array count
jge avg ;Jump to avg if rem sum is >= to ary count
finavg: mov ecx,ebp ;Set rem sum to remainder of avg
call ExitProcess
main ENDP
END MAIN
Registers before call ExitProcess
EAX = 00000163 EBX = 0000002C ECX = 00000003 EDX = 00401055
ESI = 004068C0 EDI = 00000000 EIP = 0040366B ESP = 00000008
EBP = 00000003 EFL = 00000293
OV = 0 UP = 0 EI = 1 PL = 1 ZR = 0 AC = 1 PE = 0 CY = 1

mov esp,0 sets the stack pointer to 0. Any stack instructions like push/pop or call/ret will crash after you do that.
Pick a different register for your array-count temporary, not the stack pointer! You have 7 other choices, looks like you still have EDX unused.
In the normal calling convention, only EAX, ECX, and EDX are call-clobbered (so you can use them without preserving the caller's value). But you're calling ExitProcess instead of returning from main, so you can destroy all the registers. But ESP has to be valid when you call.
call works by pushing a return address onto the stack, like sub esp,4 / mov [esp], next_instruction / jmp ExitProcess. See https://www.felixcloutier.com/x86/CALL.html. As your register-dump shows, ESP=8 before the call, which is why it's trying to store to absolute address 4.
Your code has 2 sections: looping over the array and then finding the average. You can reuse a register for different things in the 2 sections, often vastly reducing register pressure. (i.e. you don't run out of registers.)
Using implicit-length arrays (terminated by a sentinel element like 0) is unusual outside of strings. It's much more common to pass a function a pointer + length, instead of just a pointer.
But anyway, you have an implicit-length array so you have to find its length and remember that when calculating the average. Instead of incrementing a size counter inside the loop, you can calculate it from the pointer you're also incrementing. (Or use the counter as an array index like ary[ecx*4], but pointer-increments are often more efficient.)
Here's what an efficient (scalar) implementation might look like. (With SSE2 for SIMD you could add 4 elements with one instruction...)
It only uses 3 registers total. I could have used ECX instead of ESI (so main could ret without having destroyed any of the registers the caller expected it to preserve, only EAX, ECX, and EDX), but I kept ESI for consistency with your version.
.data
;ary dword 100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
ary dword -24, 1, -5, 30, 35, 81, 94, 143, 0
.code
main PROC
;; inputs: static ary of signed dword integers
;; outputs: EAX = array average, EDX = remainder of sum/size
;; ESI = array count (in elements)
;; clobbers: none (other than the outputs)
; EAX = sum accumulator
; ESI = array pointer
; EDX = array element temporary
xor eax, eax ; sum = 0
mov esi, OFFSET ary ; incrementing a pointer is usually efficient, vs. ary[ecx*4] inside a loop or something. So this is good.
sumloop: ; do {
mov edx, [esi]
add edx, 4
add eax, edx ; sum += *p++ without checking for 0, because + 0 is a no-op
test edx, edx ; sets FLAGS the same as cmp edx,0
jnz sumloop ; }while(array element != 0);
;;; fall through if the element is 0.
;;; esi points to one past the terminator, i.e. two past the last real element we want to count for the average
sub esi, OFFSET ary + 4 ; (end+4) - (start+4) = array size in bytes
shr esi, 2 ; esi = array length = (end-start)/element_size
cdq ; sign-extend sum into EDX:EAX as an input for idiv
idiv esi ; EAX = sum/length EDX = sum%length
call ExitProcess
main ENDP
I used x86's hardware division instruction, instead of a subtraction loop. Your repeated-subtraction loop looked pretty complicated, but manual signed division can be tricky. I don't see where you're handling the possibility of the sum being negative. If your array had a negative sum, repeated subtraction would make it grow until it overflowed. Or in your case, you're breaking out of the loop if sum < count, which will be true on the first iteration for a negative sum.
Note that comments like Set EAX register to 0 are useless. We already know that from reading mov eax,0. sum = 0 describes the semantic meaning, not the architectural effect. There are some tricky x86 instructions where it does make sense to comment about what it even does in this specific case, but mov isn't one of them.
If you just wanted to do repeated subtraction with the assumption that sum is non-negative to start with, it's as simple as this:
;; UNSIGNED division (or signed with non-negative dividend and positive divisor)
; Inputs: sum(dividend) in EAX, count(divisor) in ECX
; Outputs: quotient in EDX, remainder in EAX (reverse of the DIV instruction)
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb subloop_end ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
cmp eax, ecx
jae repeat_subtraction ; while( dividend >= divisor );
; fall through when eax < ecx (unsigned), leaving EAX = remainder
subloop_end:
Notice how checking for special cases before entering the loop lets us simplify it. See also Why are loops always compiled into "do...while" style (tail jump)?
sub eax, ecx and cmp eax, ecx in the same loop seems redundant: we could just use sub to set flags, and correct for the overshoot.
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb division_done ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
jnc repeat_subtraction ; while( dividend -= divisor doesn't wrap (carry) );
add eax, ecx ; correct for the overshoot
dec edx
division_done:
(But this isn't actually faster in most cases on most modern x86 CPUs; they can run the inc, cmp, and sub in parallel even if the inputs weren't the same. This would maybe help on AMD Bulldozer-family where the integer cores are pretty narrow.)
Obviously repeated subtraction is total garbage for performance with large numbers. It is possible to implement better algorithms, like one-bit-at-a-time long-division, but the idiv instruction is going to be faster for anything except the case where you know the quotient is 0 or 1, so it takes at most 1 subtraction. (div/idiv is pretty slow compared to any other integer operation, but the dedicated hardware is much faster than looping.)
If you do need to implement signed division manually, normally you record the signs, take the unsigned absolute value, then do unsigned division.
e.g. xor eax, ecx / sets dl gives you dl=0 if EAX and ECX had the same sign, or 1 if they were different (and thus the quotient will be negative). (SF is set according to the sign bit of the result, and XOR produces 1 for different inputs, 0 for same inputs.)

Adobe Type 1 Encryption Algorithm in Assembly

I am trying to encrypt a text file using the adobe type 1 font encryption algorithm. However, I don't know how to properly implement the algorithm in assembly language. Please, help me if you can.
Here is the adobe type 1 font encryption algorithm:
unsigned short int r;
unsigned short int c1 =52845;
unsigned short int c2 = 22719;
unsigned char eencrypt(char plain) unsigned char plain;
{ unsigned char cipher;
cipher = (plain ^ (r >> 8));
r = (cipher + r) * c1 + c2;
return cipher;
}
Here is my code:
.model tiny
.data
filename db "file.txt", 0
bufferSize = 512
filehandle dw ?
buffer db bufferSize dup (0)
r dw 0
c1 dw 52845
c2 dw 22719
cipher dw ?
message1 db 'Cannot open file. $'
message2 db 'Cannot read file. $'
message3 db 'Cannot close file. $'
.code
org 100h
start:
call open
call read
call close
call Exit
;procedures
open:
mov ah,3DH
mov al,0
mov dx, offset filename
int 21h
jc openErr
mov filehandle, ax
ret
read: ;reads file
mov ah, 3Fh
mov bx, filehandle
mov cx, bufferSize
mov dx, offset buffer
int 21h
cmp ax,0
jc readErr
;displays content of file
call clear
mov ah, 9
mov dx, offset buffer
int 21h
ret
close:
mov ah, 3Eh
mov bx, filehandle
int 21h
jc closeErr
ret
encrypt:
; need loop to loop through each char, don't know how to do that
mov ax, [r]
shr ax, 8
mov bl, [buffer]
xor bh,bh
xor bx, ax
mov cipher, bx
mov dx, cipher
add dx, [r] ;get error: extra characters on line
imul dx, c1
add dx, c2
mov [r], dx
;decrypt:
clear: ;clears the screen
mov ax,003h
int 10h
ret
Exit:
mov ax, 4C00h
int 21h
newline: ;prints a newline
mov ah, 2
mov dl, 0DH
int 21h
mov dl, 0AH
int 21h
ret
;error messages
openErr :
call newline
lea DX,message1 ;set up pointer to error message
mov AH,9 ;display string function
int 21H ;DOS call
stc ;set error flag
ret
readErr :
call newline
lea DX,message2 ;set up pointer to error message
mov AH,9 ;display string function
int 21H ;DOS call
stc ;set error flag
ret
closeErr :
call newline
lea DX,message3 ;set up pointer to error message
mov AH,9 ;display string function
int 21H ;DOS call
stc ;set error flag
ret
end start

;encrypt:
; need loop to loop through each char, don't know how to do that
;mov ax, r
;shr ax, 8 r>>8
;mov bl, buffer bl = buffer
buffer is symbolic address, not value in the buffer. Use mov bl,[buffer] to read the value from memory. And if you want to loop over full buffer content, put that address into some register, si is often favoured for keeping pointer to the source of data (because it is hardcoded in lods instructions, and "s" may be read as "source"). If you are C++ compiler, you don't care, and you pick any spare register you have, without considering it's name.
;xor bx, ax bx = buffer ^(r>>8)
bh was not set and may be anything, ruining your calculation. Either do xor bx,bx before loading the bl from buffer, or xor bh,bh to clear bh only (tiny fraction slower on modern x86 because it must then combine bh+bl into bx), or use movzx bx,byte ptr [buffer] in previous step to zero-extend the char value into word value.
;mov cipher, bx cipher = buffer^(r>>8)
Label cipher was defined ahead of db ?, but here you are writing two bytes into memory, so it will also overwrite first byte of message1 string. And I strongly suggest to use [] every time you write into memory, even when TASM/MASM allows this kind of syntax, but I find it difficult to read (seeing [cipher] will make my brain automatically think "memory access" even with quick glance on source).
;mov cx, c1
c1 and c2 can be EQU, so they will compile as intermediate values directly into code.
;add cx, c2
That's not, what the C encrypt does (multiplication has priority over addition).
;mov dx, cipher
Again you treat cipher as word, which makes sense, but it's against the db ? (so the db needs fixing, not code here).
;add dx, r
;imul dx, cx
I would use [] brackets around r again, and the multiplication/addition order is of course wrong.
Not a bad try, but don't hesitate to uncomment that piece of code, pre-set registers with some debug values, and jump straight to it, open the binary in debugger, and single-step over it to verify it works as you did want. You would probably quickly find all those things I wrote above.
Then just turn it into loop (use any ASM book/tutorial with some example about working with arrays/strings, learn to use pointers, then it should be not that hard to finish).

Converting quaternary to octal. ASM 8086

I have to prepare program for 8086 processor which converts quaternary to octal number.
My idea:
Multiplying every digit by exponents of 4 and add to register. Later check the highest exponent of 8 not higher than sum from first step.
Divide by exponents of 8 until remainder equals 0. Every result of dividing is one digit in octal.
But for 16-digits number last exponent of 4 is 4^15. I suppose It isn't optimal algorithm.
Is there any other way? Maybe to binary and group by 3 digits.

Turns out you can indeed process values 3 digits at a time. Done this way, you can process strings of arbitrary length, without being limited by the size of a register. Not sure why you might need to, unless aliens try to communicate with us using ascii strings of quaternary digits with huge lengths. Might happen.
It's possible to do the translation either way (Right to Left or Left to Right). However, there's a bit of a challenge to either:
If you are processing RtL, you need to know the length of the output string before you start (so that you know where to write the digits as you compute them). This is do-able, but a bit tricky. In simplest terms, the length is ((strlen(Q) + 2) / 3) * 2. That almost gets it. However, you can end up with a blank space at the beginning for a number of cases. "1" as well as "10" will give the blank space. "20" won't. The correct value can be computed, but it's annoying.
Likewise, processing LtR has a similar problem. You don't have the problem of figuring out where to write digits, but consider: If the string to convert it "123", then the conversion is simple (33 octal). But what if you start processing, and the complete string is "1231" (155 octal)? In that case what you need to process it like "001231" (01 55). IOW, digits can be processed in groups of 3, but you need to handle the initial case where the number of digits doesn't evenly divide by 3.
Posting solutions to homework is usually something I avoid. However I doubt you are going to turn this in as your 'solution,' and it's (barely) possible that google might send someone here who needs something similar.
A few things to note:
This code is intended to be called from C using Microsoft's fastcall (it made testing easier) and compiled with masm.
While it is written in 32bit (my environment), there's nothing that particularly requires 32bit in it. Since you said you were targeting 8086, I've tried to avoid any 'advanced' instructions. Converting to 16bit or even 64bit should not present much of a challenge.
It processes from left to right.
As with any well-written routine, it validates its parameters. It outputs a zero length string on error, such as invalid digits in the input string.
It will crash if the output buffer is NULL. I suppose I could return a bool on error (currently returns void), but, well, I didn't.
I'm sure the code could be tighter (couldn't it always?), but for "homework project quality," it seems reasonable.
Other than that, that comments should explain the code.
.386
.model flat
.code
; Call from C via:
; extern "C" void __fastcall PrintOct(const char *pQuat, char *pOct);
; On Entry:
; ecx: pQuat
; edx: pOct
; On Exit:
; eax, ecx, edx clobbered
; all others preserved
; If pOct is zero bytes long, an error occurred (probably invalid digits)
#PrintOct#8 PROC
; -----------------------
; If pOct is NULL, there's nothing we can do
test edx, edx
jz Failed
; -----------------------
; Save the registers we modify (except for
; eax, edx and ecx which we treat as scratch).
push esi
push ebx
push edi
mov esi, ecx
mov edi, edx
xor ebx, ebx
; -----------------------
; esi: pQuat
; edi: pOct
; ebx: zero (because we use lea)
; ecx: temp pointer to pQuat
; Reject NULL pQuat
test esi, esi
jz WriteNull
; -----------------------
; Reject 0 length pQuat
mov bl, BYTE PTR [esi]
test bl, bl
jz WriteNull
; -----------------------
; How many chars in pQuat?
mov dl, bl ; bl is first digit as ascii. Preserve it.
CountLoop:
inc ecx ; One more valid char
; While we're counting, check for invalid digits
cmp dl, '0'
jl WriteNull
cmp dl, '3'
jg WriteNull
mov dl, BYTE PTR [ecx] ; Read the next char
test dl, dl ; End of string?
jnz CountLoop
sub ecx, esi
; -----------------------
; At this point, there is at least 1 valid digit, and
; ecx contains # digits
; bl still contains first digit as ascii
; Normally we process 3 digits at a time. But the number of
; digits to process might not be an even multiple of 3.
; This code finds the 'remainder' when dividing ecx by 3.
; It might seem like you could just use 'div' (and you can),
; but 'div' is so insanely expensive, that doing all these
; lines is *still* cheaper than a single div.
mov eax, ecx
mov edx, 0AAAAAAABh
mul edx
shr edx, 1
lea edx, [edx+edx*2]
sub ecx, edx ; This gives us the remainder (0-2).
; If the remainder is zero, use the normal 3 digit load
jz LoadTriplet
; -----------------------
; Build a triplet from however many leading 'odd' digits
; there are (1 or 2). Result is in al.
lea eax, DWORD PTR [ebx-48] ; This get us the first digit
; If there was only 1 digit, don't try to load 2
cmp cl, 1
je OneDigit
; Load the other digit
shl al, 2
mov bl, BYTE PTR [esi+1]
sub bl, 48
or al, bl
OneDigit:
add esi, ecx ; Update our pQuat pointer
jmp ProcessDigits
; -----------------------
; Build a triplet from the next 3 digits.
; Result is in al.
; bl contains the first digit as ascii
LoadTriplet:
lea eax, DWORD PTR [ebx-48]
shl al, 4 ; Make room for the other 2 digits.
; Second digit
mov cl, BYTE PTR [esi+1]
sub cl, '0'
shl cl, 2
or al, cl
; Third digit
mov bl, BYTE PTR [esi+2]
sub bl, '0'
or al, bl
add esi, 3 ; Update our pQuat pointer
; -----------------------
; At this point
; al: Triplet
; ch: DigitWritten (initially zeroed when computing remainder)
ProcessDigits:
mov dl, al
shr al, 3 ; left digit
and dl, 7 ; right digit
; If we haven't written any digits, and we are
; about to write a zero, skip it. This deals
; with both "000123" and "2" (due to OneDigit,
; the 'left digit' might be zero).
; If we haven't written any digits yet (ch == 0), and the
; value we are are about to write is zero (al == 0), skip
; the write.
or ch, al
jz Skip1
add al, '0' ; Convert to ascii
mov BYTE PTR [edi], al ; Write a digit
inc edi ; Update pointer to output buffer
jmp Skip1a ; No need to check again
Skip1:
or ch, dl ; Both check and update DigitWritten
jz Skip2
Skip1a:
add dl, '0' ; Convert to ascii
mov BYTE PTR [edi], dl ; Write a digit
inc edi ; Update pointer to output buffer
Skip2:
; Load the next digit.
mov bl, BYTE PTR [esi]
test bl, bl
jnz LoadTriplet
; -----------------------
; All digits processed. We know there is at least 1 valid digit
; (checked on entry), so if we never wrote anything, the value
; must have been zero. Since we skipped it to avoid
; unnecessary preceding zeros, deal with it now.
test ch, ch
jne WriteNull
mov BYTE PTR [edi], '0'
inc edi
; -----------------------
; Write the trailing NULL. Note that if the returned string is
; 0 bytes long, an error occurred (probably invalid digits).
WriteNull:
mov BYTE PTR [edi], 0
; -----------------------
; Cleanup
pop edi
pop ebx
pop esi
Failed:
ret
#PrintOct#8 ENDP
end
I've run a string with 1,000,000,000 quaternary digit thru it as well as all the values from 0-4,294,967,295. Seems to work.
I for one welcome our new 4-digited alien overlords.

How to echo memory location use NASM [duplicate]

I am looking for a way to print an integer in assembler (the compiler I am using is NASM on Linux), however, after doing some research, I have not been able to find a truly viable solution. I was able to find a description for a basic algorithm to serve this purpose, and based on that I developed this code:
global _start
section .bss
digit: resb 16
count: resb 16
i: resb 16
section .data
section .text
_start:
mov dword[i], 108eh ; i = 4238
mov dword[count], 1
L01:
mov eax, dword[i]
cdq
mov ecx, 0Ah
div ecx
mov dword[digit], edx
add dword[digit], 30h ; add 48 to digit to make it an ASCII char
call write_digit
inc dword[count]
mov eax, dword[i]
cdq
mov ecx, 0Ah
div ecx
mov dword[i], eax
cmp dword[i], 0Ah
jg L01
add dword[i], 48 ; add 48 to i to make it an ASCII char
mov eax, 4 ; system call #4 = sys_write
mov ebx, 1 ; file descriptor 1 = stdout
mov ecx, i ; store *address* of i into ecx
mov edx, 16 ; byte size of 16
int 80h
jmp exit
exit:
mov eax, 01h ; exit()
xor ebx, ebx ; errno
int 80h
write_digit:
mov eax, 4 ; system call #4 = sys_write
mov ebx, 1 ; file descriptor 1 = stdout
mov ecx, digit ; store *address* of digit into ecx
mov edx, 16 ; byte size of 16
int 80h
ret
C# version of what I want to achieve (for clarity):
static string int2string(int i)
{
Stack<char> stack = new Stack<char>();
string s = "";
do
{
stack.Push((char)((i % 10) + 48));
i = i / 10;
} while (i > 10);
stack.Push((char)(i + 48));
foreach (char c in stack)
{
s += c;
}
return s;
}
The issue is that it outputs the characters in reverse, so for 4238, the output is 8324. At first, I thought that I could use the x86 stack to solve this problem, push the digits in, and pop them out and print them at the end, however when I tried implementing that feature, it flopped and I could no longer get an output.
As a result, I am a little bit perplexed about how I can implement a stack in to this algorithm in order to accomplish my goal, aka printing an integer. I would also be interested in a simpler/better solution if one is available (as it's one of my first assembler programs).

One approach is to use recursion. In this case you divide the number by 10 (getting a quotient and a remainder) and then call yourself with the quotient as the number to display; and then display the digit corresponding to the remainder.
An example of this would be:
;Input
; eax = number to display
section .data
const10: dd 10
section .text
printNumber:
push eax
push edx
xor edx,edx ;edx:eax = number
div dword [const10] ;eax = quotient, edx = remainder
test eax,eax ;Is quotient zero?
je .l1 ; yes, don't display it
call printNumber ;Display the quotient
.l1:
lea eax,[edx+'0']
call printCharacter ;Display the remainder
pop edx
pop eax
ret
Another approach is to avoid recursion by changing the divisor. An example of this would be:
;Input
; eax = number to display
section .data
divisorTable:
dd 1000000000
dd 100000000
dd 10000000
dd 1000000
dd 100000
dd 10000
dd 1000
dd 100
dd 10
dd 1
dd 0
section .text
printNumber:
push eax
push ebx
push edx
mov ebx,divisorTable
.nextDigit:
xor edx,edx ;edx:eax = number
div dword [ebx] ;eax = quotient, edx = remainder
add eax,'0'
call printCharacter ;Display the quotient
mov eax,edx ;eax = remainder
add ebx,4 ;ebx = address of next divisor
cmp dword [ebx],0 ;Have all divisors been done?
jne .nextDigit
pop edx
pop ebx
pop eax
ret
This example doesn't suppress leading zeros, but that would be easy to add.

I think that maybe implementing a stack is not the best way to do this (and I really think you could figure out how to do that, saying as how pop is just a mov and a decrement of sp, so you can really set up a stack anywhere you like by just allocating memory for it and setting one of your registers as your new 'stack pointer').
I think this code could be made clearer and more modular if you actually allocated memory for a c-style null delimited string, then create a function to convert the int to string, by the same algorithm you use, then pass the result to another function capable of printing those strings. It will avoid some of the spaghetti code syndrome you are suffering from, and fix your problem to boot. If you want me to demonstrate, just ask, but if you wrote the thing above, I think you can figure out how with the more split up process.

; Input
; EAX = pointer to the int to convert
; EDI = address of the result
; Output:
; None
int_to_string:
xor ebx, ebx ; clear the ebx, I will use as counter for stack pushes
.push_chars:
xor edx, edx ; clear edx
mov ecx, 10 ; ecx is divisor, devide by 10
div ecx ; devide edx by ecx, result in eax remainder in edx
add edx, 0x30 ; add 0x30 to edx convert int => ascii
push edx ; push result to stack
inc ebx ; increment my stack push counter
test eax, eax ; is eax 0?
jnz .push_chars ; if eax not 0 repeat
.pop_chars:
pop eax ; pop result from stack into eax
stosb ; store contents of eax in at the address of num which is in EDI
dec ebx ; decrement my stack push counter
cmp ebx, 0 ; check if stack push counter is 0
jg .pop_chars ; not 0 repeat
mov eax, 0x0a
stosb ; add line feed
ret ; return to main

; eax = number to stringify/output
; edi = location of buffer
intToString:
push edx
push ecx
push edi
push ebp
mov ebp, esp
mov ecx, 10
.pushDigits:
xor edx, edx ; zero-extend eax
div ecx ; divide by 10; now edx = next digit
add edx, 30h ; decimal value + 30h => ascii digit
push edx ; push the whole dword, cause that's how x86 rolls
test eax, eax ; leading zeros suck
jnz .pushDigits
.popDigits:
pop eax
stosb ; don't write the whole dword, just the low byte
cmp esp, ebp ; if esp==ebp, we've popped all the digits
jne .popDigits
xor eax, eax ; add trailing nul
stosb
mov eax, edi
pop ebp
pop edi
pop ecx
pop edx
sub eax, edi ; return number of bytes written
ret

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio