Related
I wanted to know the extent to which ChatGPT could generate assembly language, so I gave it the following prompt:
Write an optimized ARM assembly code that bubble sorts an array.
I will not include the full output as it's rather boring and standard, but I will provide the snippets below. Ellipses indicate lines I've cut out. All comments are mine
.data
array: .word 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
arrayLength = 10
...
outer_loop:
...
inner_loop:
add r3, r0, r2 ##r0=0 and r2=0 on first iteration
ldr r4, [array, r3, lsl #2]
add r3, r3, #4
ldr r5, [array, r3, lsl #2]
cmp r4, r5 ##N not set, so bgt branches on first iter
bgt swap
...
swap:
str r5, [array, r3, lsl #2]
subs r3, r3, #4
str r4, [array, r3, lsl #2]
b inner_loop
...
I am trying to get into MCUs currently and have been working with. This is rather easy to understand, but if we just focus on the first iteration, it seems that we're loading 9 into r4 and 8 into r5, then putting these numbers right back into the same place in the array from which we retrieved them. At length, you can sort of see what happens by the following sequence:
array + r3 points to the first value in the array.
r4 holds the value 9.
r3 increments by 4.
array + r3 points to the second value in the array.
r5 holds the value 8.
branch to swap block
we store the value in r5, which is 8, in the place array + r3 points, which is the second array slot.
r3 decrements by 4.
array + r3 points to the first value in the array.
we store the value r4, which is 9, in the place array + r3 points, which is the first slot.
So my questions are: am I somehow misunderstanding the opcodes? If so, why does my intuition here fail regarding the sequence I expect to execute? Is the bot just wrong? Also, the subs in the swap block may set a flag if we go negative with our address, but ChatGPT doesn't do anything with this fact. So would a simple sub be better?
I also had a question about the logical left shift. I first assumed this had to be due to endianness but realized that doesn't quite explain it. Why is the array stored at a weird offset like that? I assume I'm missing context where this sort of stuff is normal, but I have yet to encounter this in my experience.
ChatGPT insists this sorts the array, but always with the context of what a bubble sort is. I don't know if it is actually running the code against the actual input data I feed it or else if it is just doing a sort and then telling me it used a bubble sort. For instance, I gave it the following prompt
Imagine you have the array [10,5,4,1] stored in memory and used the above code to sort this array. After running the program, what would the resultant array be?
and received the following explanation
After running the above ARM assembly code on the array [10, 5, 4, 1], the resultant sorted array would be [1, 4, 5, 10].
The bubble sort algorithm works by repeatedly swapping adjacent elements that are in the wrong order until the entire array is sorted in ascending order. In each iteration of the outer loop, the length of the unsorted portion of the array is decremented, and in each iteration of the inner loop, adjacent elements are compared and swapped if necessary. This process continues until the entire array is sorted.
The tense here is important: "the resultant array would be..."
I've been studying algorithm efficiency, and one part of a course said that counting the number of operations (as opposed to timing the algorithm) depends on how a function is implemented (which is written as a downside), and it also depends on the algorithm (which is written as an upside).
What exactly does this mean? What does it mean for two pieces of code to be two different implementations of the same algorithm (am I missing some subtlety here, or does it just mean that two functions that do the same thing but vary slightly in their syntax, count as two separate implementations of the same algorithm)? How is the fact that it depends on the algorithm good, but the fact that it depends on implementations bad?
Neither are correct, the truth lies somewhere in the middle. The algorithm isnt anything, maybe 30+ years ago, but today a compiler can deconstruct your algorithm and reconstruct it differently (if it has been programmed to recognize what you are trying to do).
Mathmatically: you have probably heard the one in elementary school about adding all the numbers from 1 to 100 up, makes it easier from 0 to 100, so that is either 99 or 100 addition operations yes? Plus a loop which is a counter and a compare. Well what if you were to realize that 0+100 = 100, 99+1 = 100, 98+2 = 100. There are 50 pairs that add up to 100, and then 50 is left by itself. so we could reduce the 100 additions and a loop with 100 additions and a compare down to 50*100+50 or 50*101. One multiplication. You could probably make an algorithm with some constraints perhaps but add up all the numbers from 0 to N with N being positive as the constraint, even vs odd values of N would produce a different generic algorithm perhaps, perhaps not, probably has a N/2 in there and some multiply and maybe an add. Far cheaper than doing N additions in a loop which the loop variable has to do that many additions and compares.
But what about implementation:
00000000 <fun1>:
0: e59f0000 ldr r0, [pc] ; 8 <fun1+0x8>
4: e12fff1e bx lr
8: 000013ba ; <UNDEFINED> instruction: 0x000013ba
0000000c <fun2>:
c: e59f0000 ldr r0, [pc] ; 14 <fun2+0x8>
10: e12fff1e bx lr
14: 000013ba ; <UNDEFINED> instruction: 0x000013ba
00000018 <fun3>:
18: e59f0000 ldr r0, [pc] ; 20 <fun3+0x8>
1c: e12fff1e bx lr
20: d783574e strle r5, [r3, lr, asr #14]
algorithm was irrelevant in this case, note the compiler even reduced the pseudo random summation loop into the answer.
unsigned int fun1 ( unsigned int x )
{
return(x*10);
}
unsigned int fun2 ( unsigned int x )
{
return((x<<3)+(x<<1));
}
unsigned int fun3 ( unsigned int x )
{
return(((x<<2)+x)<<1);
}
I was hoping for a multiply but of course didnt get one, maybe I needed to specify the cpu.
00000000 <fun1>:
0: e0800100 add r0, r0, r0, lsl #2
4: e1a00080 lsl r0, r0, #1
8: e12fff1e bx lr
0000000c <fun2>:
c: e1a03080 lsl r3, r0, #1
10: e0830180 add r0, r3, r0, lsl #3
14: e12fff1e bx lr
00000018 <fun3>:
18: e0800100 add r0, r0, r0, lsl #2
1c: e1a00080 lsl r0, r0, #1
20: e12fff1e bx lr
It didnt need to recognize fun2 and the others are the same. I have seen the mips backend actually call another midway, so fun3 would branch to address 0 in this case for example, which is more costly than just running the instructions, didnt do it for me on this one, so perhaps I need a more complicated function.
now assuming x is an even number
unsigned int fun1 ( unsigned int x )
{
unsigned int ra;
unsigned int rb;
rb=0;
for(ra=0;ra<=x;ra++) rb+=ra;
return(rb);
}
unsigned int fun2 ( unsigned int x )
{
return((x/2)*(x+1));
}
we should get a different result, the compiler is not that smart...
00000000 <fun1>:
0: e3a02000 mov r2, #0
4: e1a03002 mov r3, r2
8: e0822003 add r2, r2, r3
c: e2833001 add r3, r3, #1
10: e1500003 cmp r0, r3
14: 2afffffb bcs 8 <fun1+0x8>
18: e1a00002 mov r0, r2
1c: e12fff1e bx lr
00000020 <fun2>:
20: e1a030a0 lsr r3, r0, #1
24: e2802001 add r2, r0, #1
28: e0000293 mul r0, r3, r2
2c: e12fff1e bx lr
we assume the multiply is cheap, docs will say one clock but that is not necessarily true, there is a pipe they can save a ton of chip real estate by consuming more and burying the time in the pipe, or as you see in a non-pipelined processor the clocks for a multiply are longer. we can assume here it is buried in the pipe and if you could keep the pipe moving smoothly it is really fast.
Anyway, we can safely assume with the last example the loop of additions is much slower than the optimized algorithm. So algorithm as well as implementation help us here.
unsigned int fun1 ( unsigned int x )
{
return(x/10);
}
00000000 <fun1>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <fun1+0x10>
4: e0821390 umull r1, r2, r0, r3
8: e1a001a2 lsr r0, r2, #3
c: e12fff1e bx lr
10: cccccccd stclgt 12, cr12, [r12], {205} ; 0xcd
this is a fun one I can/have shown that the multiply by 1/5th or 1/10th solution is slower than a straight divide if your processor has the divide, there is the additional load there is the shift as well as the multiply, where the divide might be a load and divide. You have to have the memory be slow such that the extra load and the extra fetch swallow the difference, here again divides are slower that multiplies in general. but the compiler is still right most of the time the multiply is faster so this solution is okay. But it didnt implement the operation we asked for directly, so the algorithm changed from the desired to something else. Implementation saved the algorithm and or at least didnt hurt it.
Look up FFT, this is a classic example of starting with the elementary algorithm that has some amount of math, you can count the operations, then various ways to re-arrange the data/operations to reduce that math, and further reduce it. And that is great, in that case you are quite likely helping the compiler. But implementation could help further if you let it and specifically how you write your code can take a great algorithm and make it worse.
unsigned int fun1 ( unsigned int x )
{
return(x*10.0);
}
00000000 <fun1>:
0: ee070a90 vmov s15, r0
4: ed9f6b05 vldr d6, [pc, #20] ; 20 <fun1+0x20>
8: eeb87b67 vcvt.f64.u32 d7, s15
c: ee277b06 vmul.f64 d7, d7, d6
10: eefc7bc7 vcvt.u32.f64 s15, d7
14: ee170a90 vmov r0, s15
18: e12fff1e bx lr
1c: e1a00000 nop ; (mov r0, r0)
20: 00000000 andeq r0, r0, r0
24: 40240000 eormi r0, r4, r0
unsigned int fun1 ( unsigned int x )
{
return(x*10.0F);
}
00000000 <fun1>:
0: ee070a90 vmov s15, r0
4: ed9f7a04 vldr s14, [pc, #16] ; 1c <fun1+0x1c>
8: eef87a67 vcvt.f32.u32 s15, s15
c: ee677a87 vmul.f32 s15, s15, s14
10: eefc7ae7 vcvt.u32.f32 s15, s15
14: ee170a90 vmov r0, s15
18: e12fff1e bx lr
1c: 41200000 ; <UNDEFINED> instruction: 0x41200000
subtle, needed a 32 bit constant vs 64, the math is single vs double, take a more complicated algorithm that will add up. And in the end could we have just done a fixed point multiply and gotten the same result?
unsigned int fun1 ( unsigned int x )
{
return((((x<<1)*20)+1)>>1);
}
00000000 <fun1>:
0: e0800100 add r0, r0, r0, lsl #2
4: e1a00180 lsl r0, r0, #3
8: e1a000a0 lsr r0, r0, #1
c: e12fff1e bx lr
Would there have been any rounding anyway with x being an integer?
There is no fact either way, it is not a fact that implementation does not matter (even in a classroom with one small chalkboard vs several wide ones, or a whiteboard that the marker lasts longer and erases just as easy) it is not a fact that algorithm does not matter, it is not a fact that programming language does not matter, it is not a fact that compiler doesnt matter it is not a fact that compiler options does not matter it is not a fact that processor does not matter.
timing your algorithms execution is not the end all, be all, I can easily demonstrate that the same machine code runs slower or faster on the same processor and system without doing things like changing clock speed, etc.
Not uncommon for the method of timing the algorithm to add error into the result. Want to make it fast of one system, time, tweak, time, tweak. tweaking at times involves trying different algorithms. for a family of similar systems the same deal but understand where the performance gains came from and adjust based on how those factors vary across the family of targets.
algorithm matters is a fact. implementation matters is a fact.
Note there is no reason to get into an argument with your professor, I would call that a fact, get through the class, pass it, and move on. Pick your battles just like you would with your boss or co-workers in the real world. But, unlike the real world you finish the semester you are done with that class and perhaps professor forever, real world you may have those coworkers and boss for a long time and one bad battle or one lost battle can affect you for a long time. Even if you are right.
Can't speak for what the course authors meant, but perhaps I can clear your second issue.
An algorithm is a description of the actions needed in order to achieve a certain goal / computation. It's given in the language of mathematics most often. Computer programs are one way of implementing an algorithm[1], and the most common. Even if they are quite abstract things, they're still way more concrete than the mathematical description. They're tied to the programming language and environment they are written in and it's various quirks, the specifics of the problem you're trying to solve[2], and even the particular engineer who is writing it. So it's natural that two programs or parts of programs which implement a certain algorithm to be different, and even have different performance properties. The number of instructions executed for a certain input would definitely fall into the bucket of properties which vary between two implementations, thusly.
[1] Another way might be in hardware, like a digital circuit or an analog computer, or through some mechanical process, like a clock or one of those mechanical automatons from the 19th century, or even some biological or chemical process.
[2] To clarify, a general purpose sorting routine might be written in a different way than a 16-bit integer sorting routine, even if both of them implement QuickSort.
On a discoboard (ARM7) I'm attempting to implement fletcher's algorithm from https://en.wikipedia.org/wiki/Fletcher%27s_checksum, and the input is a single 32 bit word.
Couldn't implement the 32 bit version of fletcher's as it required loading a huge number into memory so:
I'm splitting the 32 bit word into 2 16 bit half words, and then running the fletcher-16 algorithm.
However, the output is always the sum of the numbers instead, which seems very wrong to me.
eg,
Input: 0x1b84ccc / 1101110000100110011001100
Expected Output:
Checksum value
Real Output:
The sum of the 2 16 bit half words. Wut
Could anyone help if this is the actual algorithm, or have i made an error?
# Input:
# r0: 32 bit message
# Output:
# r0: checksum value
fletchers_checksum:
push {r1-r4,lr}
mov r3, #0 # store the sum
mov r4, r0 # store message
#split to 2 16 bit messages:
##take frequency
ldr r1, =#0xFFFF0000
and r0, r1, r4
lsr r0, #16
bl compute_checksum_for_16_bit_number
##amplitude
ldr r1, =#0xFFFF
and r0, r1, r4
bl compute_checksum_for_16_bit_number
mov r0, r3
pop {r1-r3,lr}
bx lr
compute_checksum_for_16_bit_number:
push {lr}
ldr r1, =#65536
add r0, r3 #add current sum to it.
bl mod
mov r3, r0 #store new sum
pop {lr}
bx lr
Thank you!
From the linked Wikipedia page:
Usually, the second sum will be multiplied by 2^16 and added to the
simple checksum, effectively stacking the sums side-by-side in a
32-bit word with the simple checksum at the least significant end.
Your code appears to calculate the two 16-bit checksums, but not to shift the second checksum by 16 bits as required.
I have looked it up and nothing explains it well. It says that rlwimi can be used to be equivalent to it, but I don't know that instruction either.
Code with it in there:
andi. r0, r6, 3 # while(r6 != 3)
bdnzf eq, loc_90014730 # if(CTR != 0) loc_90014730();
insrwi r4, r4, 8,16 # ????
srwi. r0, r5, 4 # r0 = r5 >> 4;
insrwi r4, r4, 16,0
(r4 == 0)
I've been stuck on this instruction for a while. Please, don't just give me the result, please give me a detailed explanation.
I think you need to do some experiments with rlwimi to fully explain it to yourself, but here is what I find helpful.
There is a programming note in Book 1 of the Power PC Programming Manual for rlwimi that provides a little more detail on inslwi and insrwi:
rlwimi can be used to insert an n-bit field that is left-justified in
the low-order 32 bits of register RS, into RAL starting at bit
position b, by setting SH=32-b, MB=b, and ME=(b+n)-1. It can be used
to insert an n-bit field that is right-justified in the low-order 32
bits of register RS, into RAL starting at bit position b, by setting
SH=32-(b+n), MB=b, and ME=(b+n)-1.
It also helps to compare the results of insrwi and inslwi. Here are two examples tracing through the rlwimi procedure, where r4=0x12345678.
insrwi r4,r4,8,16 is equivalent to rlwimi r4,r4,8,16,23
Rotate left 8 bits and notice it puts the last 8 bits of the original r4 in those positions that match the generated mask: 0x34567812
Generate the mask: 0x0000FF00
Insert the last 8 bits, which were those 8 bits that were right justified in r4, under the control of the generated mask: 0x12347878
So insrwi takes n bits from the right side (starting at bit 32) and inserts them into the destination register starting at bit b.
inslwi r4,r4,8,16 is equivalent to rlwimi r4,r4,16,16,23
Rotate left 16 bits and notice it puts the first 8 bits of the original r4 in those positions that match the generated mask: 0x56781234
Generate the mask: 0x0000FF00
Insert the first 8 bits, which were those 8 bits that were left justified in r4, under the control of the generated mask: 0x12341278
So inslwi takes n bits from the left side (starting at bit 0) and inserts them into the destination register starting at bit b.
PowerISA 2.07 [1] states insrwi is an extended mnemonic of rlwimi, with the equivalent rlwimi instruction and how they are related.
Probably PowerISA has the detail level you want. :)
[1] https://www.power.org/documentation/power-isa-version-2-07/ (or google, pdf)
My processor, a small 16-bit microcontroller with no FPU and integer math only has 16/16 division and 32/16 division which both take 18 cycles. At the moment I'm using a very slow software routine (~7,500 cycles) to do 64/32 division. Is there any way to use these division engines to calculate a 64/32 division? Similar to how I'm already using the 16x16 multiplier and adder to calculate 32x32 multiplies? I'm using C but can work with any general explanation on how it can be done... I'm hoping to target <200 cycles (if it's at all possible.)
See "Hacker's Delight", multiword division (pages 140-145).
The basic concept (going back to Knuth) is to think of your problem in base-65536 terms. Then you have a 4 digit by 2 digit division problem, with 2/1 digit division as a primitive.
The C code is here: https://github.com/hcs0/Hackers-Delight/blob/master/divmnu.c.txt
My copy of Knuth (The Art of Computer Programming) is at work, so I can't check it until Monday, but that would be my first source. It has a whole section on arithmetic.
edit: your post about "16/16 division and 32/16 division which both take 18 cycles." -- dsPICs have a conditional subtract operation in assembly. Consider using this as your computational primitive.
Also note that if X = XH * 232 + XL and D = DH * 216 + DL, then if you are looking for
(Q,R) = X/D where X = Q * D + R
where Q = QH * 216 + QL, R = RH * 216 + RL, then
XH * 232 + XL = DH * QH * 232 + (DL * QH + DH * QL) * 216 + (DL * QL) + RH * 216 + RL
This suggests (by looking at terms that are the high 32 bits) to use the following procedure, akin to long division:
(QH, R0) = XH / (DH+1) -> XH = QH * (DH+1) + R0 [32/16 divide]
R1 = X - (QH * 216) * D [requires a 16*32 multiply, a shift-left by 16, and a 64-bit subtract]
calculate R1' = R1 - D * 216
while R1' >= 0, adjust QH upwards by 1, set R1 = R1', and goto step 3
(QL, R2) = (R1 >> 16) / (DH+1) -> R1 = QL * (DH+1) + R2 [32/16 divide]
R3 = R1 - (QL * D) [requires a 16*32 multiply and a 48-bit subtract]
calculate R3' = R3 - D
while R3' >= 0, adjust QL upwards by 1, set R3 = R3', and goto step 7
Your 32-bit quotient is the pair (QH,QL), and 32-bit remainder is R3.
(This assumes that the quotient is not larger than 32-bit, which you need to know ahead of time, and can easily check ahead of time.)
Starting point would be:
D. Knuth, The Art of Computer Programming Vol.2, Section 4.3.1, Algorithm D
But I suppose you may need to optimize the algorithm.
You may want to look at Booth's Algorithm (http://www.scribd.com/doc/3132888/Booths-Algorithm-Multiplication-Division).
The part you want is about 1/2 way down the page.
I haven't looked at this since my VLSI class, but, this may be your best bet, if possible you may want to do this in assembly, to optimize it as much as possible, if you will be calling this often.
Basically involves shifting and adding or subtracting.