String Loop in Assembly - visual-studio-2010

I have to create a loop that will have the end result that prints out "LoopLoopLoopLoopLoop" (note only 5 times). What I have so far is this..
.data
STRING_1 DB "Loop"
.code
mov ecx, 5 ; Perform loop 5 times
printLoop:
LEA DX, STRING_1
loop printLoop
call DumpRegs
I'm not even 100% sure if what I have is even correct but I suppose the main question I have is how do I make it so that all of the outcomes are printed on the same line?

Related

What does movsbl (%rax, %rcx, 1),%eax and $0xf, %eax do?

movsbl (%rax, %rcx, 1),%eax
and $0xf, %eax
I have:
%rax=93824992274748
%eax=1431693628
%rcx=0
I really don't know what the reason why I have these results:
How does the first instruction gives me %eax=97?
Why does the and between the binary representation of 97 and 1111 give me 1?
Bitwise AND compares the bits of both operands. In this case, 97 AND 15:
0110 0001 ;97
0000 1111 ;15
For each column of bits, if both bits in the column are 1, the resulting bit in that column is 1. Otherwise, it's zero.
0110 0001 ;97
0000 1111 ;15
---------------
0000 0001 ;1
You might be wondering what purpose this serves. There's a lot of things you can do with AND actually, many of which aren't obvious at first glance. It's very helpful to think of one of the operands as your data and the other as a "filter."
For example, let's say we have a function called rand that returns a random 32-bit unsigned integer in %eax every time you call it. Assume that all possible values are equally likely. Now, let's say that we have some function called myEvent, and whether we call it or not shall be based on the outcome of rand. If we want this event to have a 1 in 16 chance of occurring, we can do this:
call rand
and $0xf, %eax
jnz skip
call myEvent
skip:
The reason this works is because every multiple of 16 has the bottom 4 bits clear. So those are the only bits we're interested in, and we can use and $0xf, %eax ignore everything to the left of those 4 bits, since they'll all turn into zeroes after the and. Once we've done the and, we can then see if %eax contains zero. If it does, %eax contained a multiple of 16 prior to the and.
Here's another example. This tells you if %eax is odd or even:
and $1, %eax
jnz isOdd
This works because a number is odd if its rightmost binary digit is 1. Any time you do a n % 2 in a high-level language, the compiler replaces it with n & 1 rather than doing actual division.

General algorithm for bit permutations

Is there a general strategy to create an efficient bit permutation algorithm. The goal is to create a fast branch-less and if possible LUT-less algorithm. I'll give an example:
A 13 bit code is to be transformed into another 13 bit code according to the following rule table:
BIT
INPUT (DEC)
INPUT (BIN)
OUTPUT (BIN)
OUTPUT (DEC)
0
1
0000000000001
0000100000000
256
1
2
0000000000010
0010000000000
1024
2
4
0000000000100
0100000000000
2048
3
8
0000000001000
1000000000000
4096
4
16
0000000010000
0000001000000
64
5
32
0000000100000
0000000100000
32
6
64
0000001000000
0001000000000
512
7
128
0000010000000
0000000010000
16
8
256
0000100000000
0000000001000
8
9
512
0001000000000
0000000000010
2
10
1024
0010000000000
0000000000001
1
11
2048
0100000000000
0000000000100
4
12
4096
1000000000000
0000010000000
128
Example: If the input code is 1+2+4096=4099 the resulting output would be 256+1024+128=1408
A naive approach would be:
OUTPUT = ((INPUT AND 0000000000001) << 8) OR ((INPUT AND 0000000000010) << 9) OR ((INPUT AND 0000000000100) << 9) OR ((INPUT AND 0000000001000) << 9) OR ...
It means we have 3 instructions per bit (AND, SHIFT, OR) = 39-1 (last OR omitted) instructions for the above example. Instead we could also use a combination of left and right shifts to potentially reduce code size (depends on target platform), but this will not decrease the amount of instructions.
When inspecting the example table, you will of course notice a few obvious possibilities for optimization, for example in line 2/3/4 which can be combined as ((INPUT AND 0000000000111) << 9). But beside that it is becoming a difficult tedious task.
Are the general strategies? I think using Karnaugh-Veitch Map's to simplify the expression could be one approach? However it is pretty difficult for 13 input variables. Also the resulting expression would only be a combination of OR's and AND's.
For bit permutations, several strategies are known that work in certain cases. There's a code generator at https://programming.sirrida.de/calcperm.php which implements most of them. However, in this case, it seems to find only basically the strategy you suggested, indicating that it seems hard to find any pattern to exploit in this permutation.
If one big lookup table is too much, you can try to use two smaller ones.
Take 7 lower bits of the input, look up a 16-bit value in table A.
Take 6 higher bits of the input, look up a 16-bit value in table B.
or the values from 1. and 2. to produce the result.
Table A needs 128*2 bytes, table B needs 64*2 bytes, that's 384 bytes for the lookup tables.
This is a hand-optimised multiple LUT solution, which doesn't really prove anything other than that I had some time to burn.
Multiple small lookup tables can occasionally save time and/or space, but I don't know of a strategy to find the optimal combination. In this case, the best division seems to be three LUTs of three bits each (bits 4-6, 7-9 and 10-12), totalling 24 bytes (each table has 8 one-byte entries), plus a simple shift to cover bits through 3, and another simple shift for the remaining bit 0. Bit 5, which is untransformed, was also a tempting target but I don't see a good way to divide bit ranges around it.
The three look-up tables have single-byte entries because the range of the transformations for each range is just one byte. In fact, the transformations for two of the bit ranges fit entirely in the low-order byte, avoiding a shift.
Here's the code:
unsigned short permute_bits(unsigned short x) {
#define LUT3(BIT0, BIT1, BIT2) \
{ 0, (BIT0), (BIT1), (BIT1)+(BIT0), \
(BIT2), (BIT2)+(BIT0), (BIT2)+(BIT1), (BIT2)+(BIT1)+(BIT0)}
static const unsigned char t4[] = LUT3(1<<(6-3), 1<<(5-3), 1<<(9-3));
static const unsigned char t7[] = LUT3(1<<4, 1<<3, 1<<1);
static const unsigned char t10[] = LUT3(1<<0, 1<<2, 1<<7);
#undef LUT3
return ( (x&1) << 8) // Bit 0
+ ( (x&14) << 9) // Bits 1-3, simple shift
+ (t4[(x>>4)&7] << 3) // Bits 4-6, see below
+ (t7[(x>>7)&7] ) // Bits 7-9, three-bit lookup for LOB
+ (t10[(x>>10)&7] ); // Bits 10-12, ditto
}
Note on bits 4-6
Bit 6 is transformed to position 9, which is outside of the low-order byte. However, bits 4 and 5 are moved to positions 6 and 5, respectively, and the total range of the transformed bits is only 5 bit positions. Several different final shifts are possible, but keeping the shift relatively small provides a tiny improvement on x86 architecture, because it allows the use of LEA to do a simultaneous shift and add. (See the second last instruction in the assembly below.)
The intermediate results are added instead of using boolean OR for the same reason. Since the sets of bits in each intermediate result are disjoint, ADD and OR have the same result; using add can take advantage of chip features like LEA.
Here's the compilation of that function, taken from http://gcc.godbolt.org using gcc 12.1 with -O3:
permute_bits(unsigned short):
mov edx, edi
mov ecx, edi
movzx eax, di
shr di, 4
shr dx, 7
shr cx, 10
and edi, 7
and edx, 7
and ecx, 7
movzx ecx, BYTE PTR permute_bits(unsigned short)::t10[rcx]
movzx edx, BYTE PTR permute_bits(unsigned short)::t7[rdx]
add edx, ecx
mov ecx, eax
sal eax, 9
sal ecx, 8
and ax, 7168
and cx, 256
or eax, ecx
add edx, eax
movzx eax, BYTE PTR permute_bits(unsigned short)::t4[rdi]
lea eax, [rdx+rax*8]
ret
I left out the lookup tables themselves because the assembly produced by GCC isn't very helpful.
I don't know if this is any quicker than #trincot's solution (in a comment); a quick benchmark was inconclusive, but it looked to be a few percent quicker. But it's quite a bit shorter, possibly enough to compensate for the 24 bytes of lookup data.

Write assembly language program to sort student names according to their grades

I want to write a simple assembly language program to sort student names according to their grades.
I am just using:
.data
.code
I try this bubble sort but this one is only for numbers. How can I add names for the students?
.data
array db 9,6,5,4,3,2,1
count dw 7
.code
mov cx,count
dec cx
nextscan:
mov bx,cx
mov si,0
nextcomp:
mov al,array[si]
mov dl,array[si+1]
cmp al,dl
jnc noswap
mov array[si],dl
mov array[si+1],al
noswap:
inc si
dec bx
jnz nextcomp
loop nextscan
Long ago, one of the most common way to represent data was with what was called fixed length fields. It wasn't uncommon to find all related data in one place like this;
Student: db 72, 'Marie '
db 91, 'Barry '
db 83, 'Constantine '
db 59, 'Wil-Alexander '
db 97, 'Jake '
db 89, 'Ceciel '
This is doable, as each of the fields is 16 bytes long and that is the way data used to be constructed in multiples of 2. So the data length was either 2, 4, 8, 16, 32, 64 and so on. Didn't have to be this way and a lot of times it wasn't, but multiples like that made the code simpler.
Problem is, each time we want to sort, all data has to be moved, so the relational database was born. Here we separate variable data from static.
Student: db 'Marie '
db 'Barry '
db 'Constantine '
db 'Wil-Alexander '
db 'Jake '
db 'Ceciel '
Grades: db 72, 0
db 91, 1
db 83, 2
db 59, 3
db 97, 4
db 89, 5
dw -1 ; Marks end of list
Not only will this be easier to manage in the program, but to add more grades and even grades for the same person is easier. Here is an example of how code would work to do comparisons.
mov si, Grades
mov bl, 0
push si
L0: lodsw
cmp ax, -1
jz .done
cmp [si-4], al
jae L0
.... Exchange Data Here ....
bts bx, 0
jmp L0
.done:
pop si
btc bx, 0
jc L0 - 1
ret
After routine has been executed the contents of grades is as follows;
61 04 5B 01 59 05 53 02 48 00 3B 00
I do have a working copy of this program tested in DOSBOX and because this is a homework assignment, I'm not going to hand it to you on a silver platter, but 95% of the work is done. All you need to do before handing in is make sure you can explain why BTS & BTC makes the bubble work and implement something that will exchange data.
If you needed to display this data, you'd need to device a conversion routine from binary -> decimal, but by simply multiplying the index number by 16 associated with each grade and adding the address of Student to it, that would give you a pointer to the appropriate name.
Sort pointers to name, grade structs, or indices into separate name and grade arrays.
That's one extra level of indirection in the compare, but not in the swap.

Creating array and adding values

So i am working on an assignment and i am having some issues understanding arrays in this type of code (keep in mind that my knowledge of this stuff is limited). My code is supposed to ask the user to enter the number of values that that will be put in an array of SDWORD's and then create a procedure that has the user input the numbers. I have the part done below that asks the user for the amount (saved in "count") but i am struggling with the other procedure part For example with my code below if they enter 5 then the procedure that i have to make would require them to input 5 numbers that would go in to an array.
The problem I am facing is that i'm not sure how to actually set up the array. It can contain anywhere between 2 and twelve numbers which is why i have the compare set up in the code below. Let's say for example that the user inputs that they will enter 5 numbers and i set it up like this...
.data
array SDWORD 5
the problem i am having is that I'm not sure if that is saying the array will hold 5 values or if just one value in the array is 5. I need the amount of values in the array to be equal to "count". "count" as i have set up below is the amount that the user is going to enter.
Also i obviously know how to set up the procedure like this...
EnterValues PROC
return
EnterValues ENDP
I just don't know how to implement something like this. All of the research that i have done online is only confusing me more and none of the examples i have found ask the user to enter how many number will be i the array before they physically enter any numbers in to it. I hope what i described makes sense. Any input on what i could possibly do would be great!
INCLUDE Irvine32.inc
.data
count SDWORD ?
prompt1 BYTE "Enter the number of values to sort",0
prompt2 BYTE "Error. The number must be between 2 and 12",0
.code
Error PROC
mov edx, OFFSET prompt2
call WriteString
exit ; exit ends program after error occures
Error ENDP
main PROC
mov edx, OFFSET prompt1
call WriteString ; prints out prompt1
call ReadInt
mov count, eax ; save returned value from eax to count
cmp count, 12
jle Loop1 ; If count is less than or equal to 12 jump to Loop1, otherwise continue with Error procedure
call Error ; performs Error procedure which will end the program
Loop1: cmp count, 2
jge Loop2 ; If count is greater than or equal to 2 jump to Loop2, otherwise continue with Error procedure
call Error ; performs Error procedure which will end the program
Loop2: exit
main ENDP
END main
============EDIT==============
I came up with this...
EnterValues PROC
mov ecx, count
mov edx, 0
Loop3:
mov eax, ArrayOfInputs[edx * 4]
call WriteInt
call CrLf
inc edx
dec ecx
jnz Loop3
ret
EnterValues ENDP
.data
array SDWORD 5
defines one SDWORD with the initial value 5 in the DATA section and gives it the name "array".
You might want to use the DUP operator
.data
array SDWORD 12 DUP (5)
This defines twelve SDWORD and initializes each of them with the value 5. If the initial value doesn't matter, i.e. you want an uninitialized array change the initial value to '?':
array SDWORD 12 DUP (?)
MASM may now create a _BSS segment. To force the decision:
.data?
array SDWORD 12 DUP (?)
The symbol array is used in a MASM program as a constant offset to the address of the first entry. Use an additional index to address subsequent entries, for example:
mov eax, [array + 4] ; second SDWORD
mov eax, [array + esi]
Pointer arithmetic:
lea esi, array ; copy address into register
add esi, 8 ; move pointer to the third entry
mov eax, [esi] ; load eax with the third entry
lea esi, array + 12 ; copy the address of the fourth entry
mov eax, [esi] ; load eax with the fourth entry
You've got in every case an array with a fixed size. It's on you, just to fill it with count values.

gcc compiled binary in x86 assembly

gcc compile binary has following assembly:
8049264: 8d 44 24 3e lea 0x3e(%esp),%eax
8049268: 89 c2 mov %eax,%edx
804926a: bb ff 00 00 00 mov $0xff,%ebx
804926f: b8 00 00 00 00 mov $0x0,%eax
8049274: 89 d1 mov %edx,%ecx
8049276: 83 e1 02 and $0x2,%ecx
8049279: 85 c9 test %ecx,%ecx
804927b: 74 09 je 0x8049286
At first glance, I had no idea what it is doing at all. My best guess is some sort of memory alignment and clearing up local variable (because rep stos is filling 0 at local variable location). If you take a look at first few lines, load address into eax and move to ecx and test if it is even address or not, but I'm lost why this is happening. I want to know what exactly happen in here.
It looks like initialising a local variable located at [ESP + 0x03e] to zeroes. At first, EDX is initialised to hold the address and EBX is initialised to hold the size in bytes. Then, it's checked whether EDX & 2 is nonzero; in other words, whether EDX as a pointer is wyde-aligned but not tetra-aligned. (Assuming ESP is tetrabyte aligned, as it generally should, EDX, which was initialised at 0x3E bytes above ESP, would not be tetrabyte aligned. But this is slightly besides the point.) If this is the case, the wyde from AX, which is zero, is stored at [EDX], EDX is incremented by two, and the counter EBX is decremented by two. Now, assuming ESP was at least wyde-aligned, EDX is guaranteed to be tetra-aligned. ECX is calculated to hold the number of tetrabytes remaining by shifting EBX right two bits, EDI is loaded from EDX, and the REP STOS stores that many zero tetrabytes at [EDI], incrementing EDI in the process. Then, EDX is loaded from EDI to get the pointer-past-space-initialised-so-far. Finally, if there were at least two bytes remaining uninitialised, a zero wyde is stored at [EDX] and EDX is incremented by two, and if there was at least one byte remaining uninitialised, a zero byte is stored at [EDX] and EDX is incremented by one. The point of this extra complexity is apparently to store most of the zeroes as four-byte values rather than single-byte values, which may, under certain circumstances and in certain CPU architectures, be slightly faster.

Resources