I've been studying algorithm efficiency, and one part of a course said that counting the number of operations (as opposed to timing the algorithm) depends on how a function is implemented (which is written as a downside), and it also depends on the algorithm (which is written as an upside).
What exactly does this mean? What does it mean for two pieces of code to be two different implementations of the same algorithm (am I missing some subtlety here, or does it just mean that two functions that do the same thing but vary slightly in their syntax, count as two separate implementations of the same algorithm)? How is the fact that it depends on the algorithm good, but the fact that it depends on implementations bad?
Neither are correct, the truth lies somewhere in the middle. The algorithm isnt anything, maybe 30+ years ago, but today a compiler can deconstruct your algorithm and reconstruct it differently (if it has been programmed to recognize what you are trying to do).
Mathmatically: you have probably heard the one in elementary school about adding all the numbers from 1 to 100 up, makes it easier from 0 to 100, so that is either 99 or 100 addition operations yes? Plus a loop which is a counter and a compare. Well what if you were to realize that 0+100 = 100, 99+1 = 100, 98+2 = 100. There are 50 pairs that add up to 100, and then 50 is left by itself. so we could reduce the 100 additions and a loop with 100 additions and a compare down to 50*100+50 or 50*101. One multiplication. You could probably make an algorithm with some constraints perhaps but add up all the numbers from 0 to N with N being positive as the constraint, even vs odd values of N would produce a different generic algorithm perhaps, perhaps not, probably has a N/2 in there and some multiply and maybe an add. Far cheaper than doing N additions in a loop which the loop variable has to do that many additions and compares.
But what about implementation:
00000000 <fun1>:
0: e59f0000 ldr r0, [pc] ; 8 <fun1+0x8>
4: e12fff1e bx lr
8: 000013ba ; <UNDEFINED> instruction: 0x000013ba
0000000c <fun2>:
c: e59f0000 ldr r0, [pc] ; 14 <fun2+0x8>
10: e12fff1e bx lr
14: 000013ba ; <UNDEFINED> instruction: 0x000013ba
00000018 <fun3>:
18: e59f0000 ldr r0, [pc] ; 20 <fun3+0x8>
1c: e12fff1e bx lr
20: d783574e strle r5, [r3, lr, asr #14]
algorithm was irrelevant in this case, note the compiler even reduced the pseudo random summation loop into the answer.
unsigned int fun1 ( unsigned int x )
{
return(x*10);
}
unsigned int fun2 ( unsigned int x )
{
return((x<<3)+(x<<1));
}
unsigned int fun3 ( unsigned int x )
{
return(((x<<2)+x)<<1);
}
I was hoping for a multiply but of course didnt get one, maybe I needed to specify the cpu.
00000000 <fun1>:
0: e0800100 add r0, r0, r0, lsl #2
4: e1a00080 lsl r0, r0, #1
8: e12fff1e bx lr
0000000c <fun2>:
c: e1a03080 lsl r3, r0, #1
10: e0830180 add r0, r3, r0, lsl #3
14: e12fff1e bx lr
00000018 <fun3>:
18: e0800100 add r0, r0, r0, lsl #2
1c: e1a00080 lsl r0, r0, #1
20: e12fff1e bx lr
It didnt need to recognize fun2 and the others are the same. I have seen the mips backend actually call another midway, so fun3 would branch to address 0 in this case for example, which is more costly than just running the instructions, didnt do it for me on this one, so perhaps I need a more complicated function.
now assuming x is an even number
unsigned int fun1 ( unsigned int x )
{
unsigned int ra;
unsigned int rb;
rb=0;
for(ra=0;ra<=x;ra++) rb+=ra;
return(rb);
}
unsigned int fun2 ( unsigned int x )
{
return((x/2)*(x+1));
}
we should get a different result, the compiler is not that smart...
00000000 <fun1>:
0: e3a02000 mov r2, #0
4: e1a03002 mov r3, r2
8: e0822003 add r2, r2, r3
c: e2833001 add r3, r3, #1
10: e1500003 cmp r0, r3
14: 2afffffb bcs 8 <fun1+0x8>
18: e1a00002 mov r0, r2
1c: e12fff1e bx lr
00000020 <fun2>:
20: e1a030a0 lsr r3, r0, #1
24: e2802001 add r2, r0, #1
28: e0000293 mul r0, r3, r2
2c: e12fff1e bx lr
we assume the multiply is cheap, docs will say one clock but that is not necessarily true, there is a pipe they can save a ton of chip real estate by consuming more and burying the time in the pipe, or as you see in a non-pipelined processor the clocks for a multiply are longer. we can assume here it is buried in the pipe and if you could keep the pipe moving smoothly it is really fast.
Anyway, we can safely assume with the last example the loop of additions is much slower than the optimized algorithm. So algorithm as well as implementation help us here.
unsigned int fun1 ( unsigned int x )
{
return(x/10);
}
00000000 <fun1>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <fun1+0x10>
4: e0821390 umull r1, r2, r0, r3
8: e1a001a2 lsr r0, r2, #3
c: e12fff1e bx lr
10: cccccccd stclgt 12, cr12, [r12], {205} ; 0xcd
this is a fun one I can/have shown that the multiply by 1/5th or 1/10th solution is slower than a straight divide if your processor has the divide, there is the additional load there is the shift as well as the multiply, where the divide might be a load and divide. You have to have the memory be slow such that the extra load and the extra fetch swallow the difference, here again divides are slower that multiplies in general. but the compiler is still right most of the time the multiply is faster so this solution is okay. But it didnt implement the operation we asked for directly, so the algorithm changed from the desired to something else. Implementation saved the algorithm and or at least didnt hurt it.
Look up FFT, this is a classic example of starting with the elementary algorithm that has some amount of math, you can count the operations, then various ways to re-arrange the data/operations to reduce that math, and further reduce it. And that is great, in that case you are quite likely helping the compiler. But implementation could help further if you let it and specifically how you write your code can take a great algorithm and make it worse.
unsigned int fun1 ( unsigned int x )
{
return(x*10.0);
}
00000000 <fun1>:
0: ee070a90 vmov s15, r0
4: ed9f6b05 vldr d6, [pc, #20] ; 20 <fun1+0x20>
8: eeb87b67 vcvt.f64.u32 d7, s15
c: ee277b06 vmul.f64 d7, d7, d6
10: eefc7bc7 vcvt.u32.f64 s15, d7
14: ee170a90 vmov r0, s15
18: e12fff1e bx lr
1c: e1a00000 nop ; (mov r0, r0)
20: 00000000 andeq r0, r0, r0
24: 40240000 eormi r0, r4, r0
unsigned int fun1 ( unsigned int x )
{
return(x*10.0F);
}
00000000 <fun1>:
0: ee070a90 vmov s15, r0
4: ed9f7a04 vldr s14, [pc, #16] ; 1c <fun1+0x1c>
8: eef87a67 vcvt.f32.u32 s15, s15
c: ee677a87 vmul.f32 s15, s15, s14
10: eefc7ae7 vcvt.u32.f32 s15, s15
14: ee170a90 vmov r0, s15
18: e12fff1e bx lr
1c: 41200000 ; <UNDEFINED> instruction: 0x41200000
subtle, needed a 32 bit constant vs 64, the math is single vs double, take a more complicated algorithm that will add up. And in the end could we have just done a fixed point multiply and gotten the same result?
unsigned int fun1 ( unsigned int x )
{
return((((x<<1)*20)+1)>>1);
}
00000000 <fun1>:
0: e0800100 add r0, r0, r0, lsl #2
4: e1a00180 lsl r0, r0, #3
8: e1a000a0 lsr r0, r0, #1
c: e12fff1e bx lr
Would there have been any rounding anyway with x being an integer?
There is no fact either way, it is not a fact that implementation does not matter (even in a classroom with one small chalkboard vs several wide ones, or a whiteboard that the marker lasts longer and erases just as easy) it is not a fact that algorithm does not matter, it is not a fact that programming language does not matter, it is not a fact that compiler doesnt matter it is not a fact that compiler options does not matter it is not a fact that processor does not matter.
timing your algorithms execution is not the end all, be all, I can easily demonstrate that the same machine code runs slower or faster on the same processor and system without doing things like changing clock speed, etc.
Not uncommon for the method of timing the algorithm to add error into the result. Want to make it fast of one system, time, tweak, time, tweak. tweaking at times involves trying different algorithms. for a family of similar systems the same deal but understand where the performance gains came from and adjust based on how those factors vary across the family of targets.
algorithm matters is a fact. implementation matters is a fact.
Note there is no reason to get into an argument with your professor, I would call that a fact, get through the class, pass it, and move on. Pick your battles just like you would with your boss or co-workers in the real world. But, unlike the real world you finish the semester you are done with that class and perhaps professor forever, real world you may have those coworkers and boss for a long time and one bad battle or one lost battle can affect you for a long time. Even if you are right.
Can't speak for what the course authors meant, but perhaps I can clear your second issue.
An algorithm is a description of the actions needed in order to achieve a certain goal / computation. It's given in the language of mathematics most often. Computer programs are one way of implementing an algorithm[1], and the most common. Even if they are quite abstract things, they're still way more concrete than the mathematical description. They're tied to the programming language and environment they are written in and it's various quirks, the specifics of the problem you're trying to solve[2], and even the particular engineer who is writing it. So it's natural that two programs or parts of programs which implement a certain algorithm to be different, and even have different performance properties. The number of instructions executed for a certain input would definitely fall into the bucket of properties which vary between two implementations, thusly.
[1] Another way might be in hardware, like a digital circuit or an analog computer, or through some mechanical process, like a clock or one of those mechanical automatons from the 19th century, or even some biological or chemical process.
[2] To clarify, a general purpose sorting routine might be written in a different way than a 16-bit integer sorting routine, even if both of them implement QuickSort.
On a discoboard (ARM7) I'm attempting to implement fletcher's algorithm from https://en.wikipedia.org/wiki/Fletcher%27s_checksum, and the input is a single 32 bit word.
Couldn't implement the 32 bit version of fletcher's as it required loading a huge number into memory so:
I'm splitting the 32 bit word into 2 16 bit half words, and then running the fletcher-16 algorithm.
However, the output is always the sum of the numbers instead, which seems very wrong to me.
eg,
Input: 0x1b84ccc / 1101110000100110011001100
Expected Output:
Checksum value
Real Output:
The sum of the 2 16 bit half words. Wut
Could anyone help if this is the actual algorithm, or have i made an error?
# Input:
# r0: 32 bit message
# Output:
# r0: checksum value
fletchers_checksum:
push {r1-r4,lr}
mov r3, #0 # store the sum
mov r4, r0 # store message
#split to 2 16 bit messages:
##take frequency
ldr r1, =#0xFFFF0000
and r0, r1, r4
lsr r0, #16
bl compute_checksum_for_16_bit_number
##amplitude
ldr r1, =#0xFFFF
and r0, r1, r4
bl compute_checksum_for_16_bit_number
mov r0, r3
pop {r1-r3,lr}
bx lr
compute_checksum_for_16_bit_number:
push {lr}
ldr r1, =#65536
add r0, r3 #add current sum to it.
bl mod
mov r3, r0 #store new sum
pop {lr}
bx lr
Thank you!
From the linked Wikipedia page:
Usually, the second sum will be multiplied by 2^16 and added to the
simple checksum, effectively stacking the sums side-by-side in a
32-bit word with the simple checksum at the least significant end.
Your code appears to calculate the two 16-bit checksums, but not to shift the second checksum by 16 bits as required.
So I have an LC3 coding assignment where we have to implement and test user subroutines for input and output of unsigned integers in decimal format. Now for our input we have to do a sequence of keystrokes to construct a single integer value by applying a Repeated Multiplication algorithm, which would be multiplication by 10 via 4 additions. I am not really understanding this concept of multiplication by 4 additions. Could anyone please explain?
x is number you want to multiply by 10
a = x+x = 2x
b = a+a = 4x
c = b+b = 8x
d = a+c = 10x
If your value is in R1 you can try the following:
ADD R2, R1, R1 ;Value = Value x 10
ADD R4, R2, R2
ADD R1, R4, R4
ADD R1, R1, R2
I have looked it up and nothing explains it well. It says that rlwimi can be used to be equivalent to it, but I don't know that instruction either.
Code with it in there:
andi. r0, r6, 3 # while(r6 != 3)
bdnzf eq, loc_90014730 # if(CTR != 0) loc_90014730();
insrwi r4, r4, 8,16 # ????
srwi. r0, r5, 4 # r0 = r5 >> 4;
insrwi r4, r4, 16,0
(r4 == 0)
I've been stuck on this instruction for a while. Please, don't just give me the result, please give me a detailed explanation.
I think you need to do some experiments with rlwimi to fully explain it to yourself, but here is what I find helpful.
There is a programming note in Book 1 of the Power PC Programming Manual for rlwimi that provides a little more detail on inslwi and insrwi:
rlwimi can be used to insert an n-bit field that is left-justified in
the low-order 32 bits of register RS, into RAL starting at bit
position b, by setting SH=32-b, MB=b, and ME=(b+n)-1. It can be used
to insert an n-bit field that is right-justified in the low-order 32
bits of register RS, into RAL starting at bit position b, by setting
SH=32-(b+n), MB=b, and ME=(b+n)-1.
It also helps to compare the results of insrwi and inslwi. Here are two examples tracing through the rlwimi procedure, where r4=0x12345678.
insrwi r4,r4,8,16 is equivalent to rlwimi r4,r4,8,16,23
Rotate left 8 bits and notice it puts the last 8 bits of the original r4 in those positions that match the generated mask: 0x34567812
Generate the mask: 0x0000FF00
Insert the last 8 bits, which were those 8 bits that were right justified in r4, under the control of the generated mask: 0x12347878
So insrwi takes n bits from the right side (starting at bit 32) and inserts them into the destination register starting at bit b.
inslwi r4,r4,8,16 is equivalent to rlwimi r4,r4,16,16,23
Rotate left 16 bits and notice it puts the first 8 bits of the original r4 in those positions that match the generated mask: 0x56781234
Generate the mask: 0x0000FF00
Insert the first 8 bits, which were those 8 bits that were left justified in r4, under the control of the generated mask: 0x12341278
So inslwi takes n bits from the left side (starting at bit 0) and inserts them into the destination register starting at bit b.
PowerISA 2.07 [1] states insrwi is an extended mnemonic of rlwimi, with the equivalent rlwimi instruction and how they are related.
Probably PowerISA has the detail level you want. :)
[1] https://www.power.org/documentation/power-isa-version-2-07/ (or google, pdf)
I was wondering how to display/output the value in a register.
ex: R3 has the value of 2 stored into it. I want to display that number to the screen.
This code below doesn't work because it tells me i need a label. I've also tried storing the value of R3 into R0, but when i display it i get some funky symbol(s)
LEA R0, R3
PUTS
Use the OUT instruction, and make sure you output ASCII codes:
AND R0, R0, #0
LD R0, ZERO
ADD R0, R0, R3
OUT
HALT
ZERO: .fill x30 ; ASCII code for '0'
Note: this will only work for single digit numbers (0-9). If you want to display a number with more than 1 digit, you have to loop through all of it's digits.