Most Efficient way to set Register to 1 or (-1) on original 8086 - performance

I am taking an assembly course now, and the guy who checks our home assignments is a very pedantic old-school optimization freak. For example he deducts 10% if he sees:
mov ax, 0
instead of:
xor ax,ax
even if it's only used once.
I am not a complete beginner in assembly programing but I'm not an optimization expert, so I need your help in something (might be a very stupid question but I'll ask anyway):
if I need to set a register value to 1 or (-1) is it better to use:
mov ax, 1
or do something like:
xor ax,ax
inc ax
I really need a good grade, so I'm trying to get it as optimized as possible. ( I need to optimize both time and code size)

A quick google for 8086 instructions timings size turned up a listing of instruction timings which seems to have all the timings and sizes for the 8086/8088 through Pentium.
Although you should note that this probably doesn't include code fetch memory bottlenecks which can be very significant, especially on an 8088. This usually makes optimization for code-size a better choice. See here for some details on this.
No doubt you could find official Intel documentation on the web with similar information, such as the "8086/8088 User's Manual: Programmer's and Hardware Reference".
For your specific question, the table below gives a comparison that indicates the latter is better (less cycles, and same space):
Instructions
Clock cycles
Bytes
xor ax, axinc ax
33---6
21---3
mov ax, 1
4
3
But you might want to talk to your educational institute about this guy. A 10% penalty for a simple thing like that seems quite harsh. You should ask what should be done in the case where you have two possibilities, one faster and one shorter.
Then, once they've admitted that there are different ways to optimise code depending on what you're trying to achieve, tell them that what you're trying to do is optimise for readability and maintainability, and seriously couldn't give a damn about a wasted cycle or byte here or there(1).
Optimisation is something you generally do if and when you have a performance problem, after a piece of code is in a near-complete state - it's almost always wasted effort when the code is still subject to a not-insignificant likelihood of change.
For what it's worth, sub ax,ax appears to be on par with xor ax,ax in terms of clock cycles and size, so maybe you could throw that into the mix next time to cause him some more work.
(1)No, don't really do that , but it's fun to vent occasionally :-)

You're better off with
mov AX,1
on the 8086. If you're tracking register contents, you can possibly do better if you know that, for example, BX already has a 1 in it:
mov AX,BX
or if you know that AH is 0:
mov AL,1
etc.

Depending upon your circumstances, you may be able to get away with ...
sbb ax, ax
The result will either be 0 if the carry flag is not set or -1 if the carry flag is set.
However, if the above example is not applicable to your situation, I would recommend the
xor ax, ax
inc ax
method. It should satisfy your professor for size. However, if your processor employs any pipe-lining, I would expect there to be some coupling-like delay between the two instructions (I could very well be wrong on that). If such a coupling exists, the speed could be improved slightly by reordering your instructions slightly to have another instruction between them (one that does not use ax).
Hope this helps.

I would use mov [e]ax, 1 under any circumstances. Its encoding is no longer than the hackier xor sequence, and I'm pretty sure it's faster just about anywhere. 8086 is just weird enough to be the exception, and as that thing is so slow, a micro-optimization like this would make most difference. But any where else: executing 2 "easy" instructions will always be slower than executing 1, especially if you consider data hazards and long pipelines. You're trying to read a register in the very next instruction after you modify it, so unless your CPU can bypass the result from stage N of the pipeline (where the xor is executing) to to stage N-1 (where the inc is trying to load the register, never mind adding 1 to its value), you're going to have stalls.
Other things to consider: instruction fetch bandwidth (moot for 16-bit code, both are 3 bytes); mov avoids changing flags (more likely to be useful than forcing them all to zero); depending on what values other registers might hold, you could perhaps do lea ax,[bx+1] (also 3 bytes, even in 32-bit code, no effect on flags); as others have said, sbb ax,ax could work too in circumstances - it's also shorter at 2 bytes.
When faced with these sorts of micro-optimizations you really should measure the alternatives instead of blindly relying even on processor manuals.
P.S. New homework: is xor bx,bx any faster than xor bx,cx (on any processor)?

Related

What is the purpose of the 40h REX opcode in ASM x64?

I've been trying to understand the purpose of the 0x40 REX opcode for ASM x64 instructions. Like for instance, in this function prologue from Kernel32.dll:
As you see they use push rbx as:
40 53 push rbx
But using just the 53h opcode (without the prefix) also produces the same result:
According to this site, the layout for the REX prefix is as follows:
So 40h opcode seems to be not doing anything. Can someone explain its purpose?
the 04xh bytes (i.e. 040h, 041h... 04fh) are indeed REX bytes. Each bit in the lower nibble has a meaning, as you listed in your question. The value 040h means that REX.W, REX.R, REX.X and REX.B are all 0. That means that adding this byte doesn't do anything to this instruction, because you're not overriding any default REX bits, and it's not an 8-bit instruction with AH/BH/CH/DH as an operand.
Moreover, the X, R and B bits all correspond to some operands. If your instruction doesn't consume these operands, then the corresponding REX bit is ignored.
I call this a dummy REX prefix, because it does nothing before a push or pop. I wondered whether it is allowed and your experience show that it is.
It is there because the people at Microsoft apparently generated the above code. I'd speculate that for the extra registers it is needed, so they generate it always and didn't bother to remove it when it is not needed. Another possibility is that the lengthening of the instruction has a subtle effect on scheduling and or aligning and can make the code faster. This of course requires detailed knowledge of the particular processor.
I'm working at an optimiser that looks at machine code. Dummy prefixes are helpful because they make the code more uniform; there are less cases to consider. Then as a last step superfluous prefixes can be removed among other things.

Assembly registers [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
My question WAS about getting as much info as I could about registers...No luck :/
Everyone got everything so wrong [Probably because English is not my native language].
So, the question will be more general... ;(
I need a tutorial with the BASICS!
Ah...Could I be more not-specific?
Also, thanks for the help in advance!
In general you can use any of eax, ebx, ecx, edx, esi and edi pretty much as you want. They can each hold any 32-bit value.
Keep in mind that if you call any Win32 API functions that they are free to modify eax, ecx and edx. So if you need to preserve the values of those registers across a function call you'll have to save them somewhere temporarily (e.g. on the stack).
Similarly, if you write a function that is to be called by another function (e.g. a Windows callback) you should preserve ebx, esi,edi and ebp within that function.
Some instructions are hardcoded to use certain registers. For example, the loop instruction uses (e)cx, the string instructions use esi/edi, the div instruction uses eax/edx, etc. You can find all such cases by going through the descriptions for all the instructions in Intel's manual.
The "fixed uses" of the registers derive from the ancient roots back in the 8086 days (and in some ways, even from before that).
The 8086 was an accumulator machine, you were supposed to do math mostly with ax (there was no eax yet), and a bit with dx. You can see this back in many instructions, for example most ALU ops have a smaller form for op ax, imm (also op al, imm) than for op other, imm, and the ancient decimal math instructions (daa and friends) operate only on al. There are instructions that always reference (e)ax and maybe (e)dx as "high half", see the "old multiplication" (with the single explicit operand), imul with an immediate was added in the 80186, imul reg, r/m was added in the 80386 which added a whole lot of stuff including 32bit mode. With 32bit mode also came the modern ModRM/SIB structure, here are the old 16bit version and the modern 32/64bit version. In the old version, there are only 4 registers that could ever be used in a memory operand, so there's a bit of the "fixed roles for registers" again. 32bit mode mostly removed that, except that esp can never be the index register (that wouldn't normally make sense anyway).
More recently, Haswell introduced shlx which removes the restriction that shifting by a variable amount could only be done using cl as the count, and mulx partially removed the fixed roles of registers for "wide multiplication" (80186 and 80386 only added the "general" forms for multiplication without the high half), mulx still gives edx a fixed role though.
More strangely, the relatively recently added pblendvb assigned a fixed role to xmm0, previous to that the vector registers weren't encumbered by such old-fashioned restrictions. That fixed role disappeared with AVX though, which allowed the extra operand to be encoded. pcmpistri and friends still assign a fixed role to ecx though.
With x64 came a change to 8 bit register operands, if a REX prefix is present it is now possible to use spl, bpl, sil and dil, previously unencodable, but at the cost of being able to address ah, ch, dh or bh. That's probably a symptom of moving away from special roles too, since previously it wouldn't have made much sense to be able to use bpl, but now that it's "more general purpose" it might have some uses (it's still often used as a base pointer though).
The general pattern is towards fewer restrictions/fixed roles. But much of the history of x86 is still visible today.
As a general comment, before you go much further, I recommend adopting a programming style, or you'll find it very hard to follow your own code. Below is a formatted example of your code, maybe not everything is correctly formatted but it gives you an idea. Once in the habit, it's easier than making higgledy-piggledy code. One of its main advantages, is with practice you can cast your eye down the code and follow it far quicker than if you have to read every line.
.386
.model flat, stdcall
option casemap :none
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\masm32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\masm32.lib
.data
ProgramText db "Hello World!", 0
BadText db "Error: Sum is incorrect value", 0
GoodText db "Excellent! Sum is 6", 0
Sum sdword 0
.code
start:
; eax
mov ecx, 6 ; set the counter to 6 ?
xor eax, eax ; set eax to 0
_label:
add eax, ecx ; add the numbers ?
dec ecx ; from 0 to 6 ?
jnz _label ; 21
mov edx, 7 ; 21
mul edx ; multiply by 7 147
push eax ; pushes eax into the stack
pop Sum ; pops eax and places it in Sum
cmp Sum, 147 ; compares Sum to 147
jz _good ; if they are equal, go to _good
_bad:
invoke StdOut, addr BadText
jmp _quit
_good:
invoke StdOut, addr GoodText
_quit:
invoke ExitProcess, 0
end start
I'll single out one line:
push eax ; pushes eax into the stack
Don't use comments to explain what an instruction does: use them to say what you are trying to acheive, or what the register represents, to give added value to the code.
Good luck to you: plenty of practice and midnight oil!

Usefulness of LOOPNE

I am unable to understand the usefulness of LOOPNE. Even if LOOPNE was not there and only LOOP was there, it would have done the same thing here. Please help me out.
MOV CX, 80
MOV AH,1
INT 21H
CMP AL, ' '
LOOPNE BACK
CMP is more or less a SUB instruction without changing the value, which means that it sets flags such as ZF (the zero flag).
LOOPNE has 2 conditions to loop: cx > 0 and ZF = 0
LOOP has 1 condition to loop: cx > 0
So, a normal LOOP would go through all characters, whereas LOOPNE will go through all characters, or until a space is encountered. Whichever comes first
LOOPNE loops when a comparison fails, and when there is a remaining nonzero iteration count (after decrementing it). This is arguably very convenient for finding an element in a linear list of known length.
There is little use for it in modern x86 CPUs.
The LOOPNE instruction is likely implemented internally in the CPU by microinstructions and thus effectively equivalent to JNE/DEC CX/JNE.
Because the CPU designers invest vast amounts of effort to optimize compare/branch/register arithmetic, the equivalent instruction sequence is likely, on a highly pipelined CPU, to execute virtually just as fast. It may actually execute slower; you'll only know by timing it. And the fact that you are confused about what it does makes it a source of coding errors.
I presently code the equivalent instruction sequence, because I got bit by a misunderstanding once. I'm not confused about CMP and JNE.

Pseudo Random Number Generator Project

I am required to design and build an 8 bit Pseudo Random Number Generator. I have looked at possible methods; using background noise, user input etc. I was wondering if anyone could give me some advice on where to start as this would be of great help to me.
random.org is perhaps the best place to start your investigation.
Below should get you started with the basics
howstuffworks.com
Construct your own random number generator
For a simple 8 bit PRNG you could ry something like a Linear Feedback Shift Register. This is very simple to implement in either software or hardware.
My plan is to use a temperature sensor. When the temps are being processed in the ADC, I am going to amplify the noise generated. This will then give me the random 8 bit number I require which will be used as the 'seed' for the PRNG in stdlib (C programming).
What do you's think?
I've found that the following works very well. This is implemented in MSP430 assembly, but would be easy enough to port to another processor. I've used this to generate 'white' noise for a synthesizer project, and there were no audible patterns in the output. Depending on what your requirements are, this might be sufficient. It uses two state variables, the previous output (8 bits), and a 16-bit state register. I found this online, http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=95614&highlight=radbrad, where it's listed in AVR assembly, and ported it to MSP.
Because it uses shifts and shifts the top bit out of one register into the bottom of another, it doesn't really lend itself to efficient implementation in C. Hence the assembly. I hope you find this as useful as I did.
mov.b &rand_out, r13
mov.b r13,r12
and.b #66, r13
jz ClearCarry
cmp.b #66, r13
xor.w #1, sr ; invert carry flag
jmp SkipClearCarry
ClearCarry:
clrc
SkipClearCarry:
rlc.w &rand_state
rlc.b r12
mov.b r12,&rand_out
ret

Why differ(!=,<>) is faster than equal(=,==)?

I've seen comments on SO saying "<> is faster than =" or "!= faster than ==" in an if() statement.
I'd like to know why is that so. Could you show an example in asm?
Thanks! :)
EDIT:
Source
Here is what he did.
function Check(var MemoryData:Array of byte;MemorySignature:Array of byte;Position:integer):boolean;
var i:byte;
begin
Result := True; //moved at top. Your function always returned 'True'. This is what you wanted?
for i := 0 to Length(MemorySignature) - 1 do //are you sure??? Perhaps you want High(MemorySignature) here...
begin
{!} if MemorySignature[i] <> $FF then //speedup - '<>' evaluates faster than '='
begin
Result:=memorydata[i + position] <> MemorySignature[i]; //speedup.
if not Result then
Break; //added this! - speedup. We already know the result. So, no need to scan till end.
end;
end;
end;
I'd claim that this is flat out wrong except perhaps in very special circumstances. Compilers can refactor one into the other effortlessly (by just switching the if and else cases).
It could have something to do with branch prediction on the CPU. Static branch prediction would predict that a branch simply wouldn't be taken and fetch the next instruction. However, hardly anybody uses that anymore. Other than that, I'd say it's bull because the comparisons should be identical.
I think there's some confusion in your previous question about what the algorithm was that you were trying to implement, and therefore in what the claimed "speedup" purports to do.
Here's some disassembly from Delphi 2007. optimization on. (Note, optimization off changed the code a little, but not in a relevant way.
Unit70.pas.31: for I := 0 to 100 do
004552B5 33C0 xor eax,eax
Unit70.pas.33: if i = j then
004552B7 3B02 cmp eax,[edx]
004552B9 7506 jnz $004552c1
Unit70.pas.34: k := k+1;
004552BB FF05D0DC4500 inc dword ptr [$0045dcd0]
Unit70.pas.35: if i <> j then
004552C1 3B02 cmp eax,[edx]
004552C3 7406 jz $004552cb
Unit70.pas.36: l := l + 1;
004552C5 FF05D4DC4500 inc dword ptr [$0045dcd4]
Unit70.pas.37: end;
004552CB 40 inc eax
Unit70.pas.31: for I := 0 to 100 do
004552CC 83F865 cmp eax,$65
004552CF 75E6 jnz $004552b7
Unit70.pas.38: end;
004552D1 C3 ret
As you can see, the only difference between the two cases is a jz vs. a jnz instruction. These WILL run at the same speed. what's likely to affect things much more is how often the branch is taken, and if the entire loop fits into cache.
For .Net languages
If you look at the IL from the string.op_Equality and string.op_Inequality methods, you will see that both internall call string.Equals.
But the op_Inequality inverts the result. This is two IL-statements more.
I would say they the performance is the same, with maybe a small (very small, very very small) better performance for the == statement. But I believe that the optimizer & JIT compiler will remove this.
Spontaneous though; most other things in your code will affect performance more than the choice between == and != (or = and <> depending on language).
When I ran a test in C# over 1000000 iterations of comparing strings (containing the alphabet, a-z, with the last two letters reversed in one of them), the difference was between 0 an 1 milliseconds.
It has been said before: write code for readability; change into more performant code when it has been established that it will make a difference.
Edit: repeated the same test with byte arrays; same thing; the performance difference is neglectible.
It could also be a result of misinterpretation of an experiment.
Most compilers/optimizers assume a branch is taken by default. If you invert the operator and the if-then-else order, and the branch that is now taken is the ELSE clause, that might cause an additional speed effect in highly calculating code (*)
(*) obviously you need to do a lot of operations for that. But it can matter for the tightest loops in e.g. codecs or image analysis/machine vision where you have 50MByte/s of data to trawl through.
.... and then I even only stoop to this level for the really heavily reusable code. For ordinary business code it is not worth it.
I'd claim this was flat out wrong full stop. The test for equality is always the same as the test for inequality. With string (or complex structure testing), you're always going to break at exactly the same point. Until that break point is reached, then the answer for equality is unknown.
I strongly doubt there is any speed difference. For integral types for example you are getting a CMP instruction and either JZ (Jump if zero) or JNZ (Jump if not zero), depending on whether you used = or ≠. There is no speed difference here and I'd expect that to hold true at higher levels too.
If you can provide a small example that clearly shows a difference, then I'm sure the Stack Overflow community could explain why. However, I think you might have difficulty constructing a clear example. I don't think there will be any performance difference noticeable at any reasonable scale.
Well it could be or it couldn't be, that is the question :-)
The thing is this is highly depending on the programming language you are using.
Since all your statements will eventually end up as instructions to the CPU, the one that uses the least amount of instruction to achieve the result will be the fastest.
For example if you say bits x is equal to bits y, you could use the instruction that does an XOR using both bits as an input, if the result is anything but 0 it is not the same. So how would you know that the result is anything but 0? By using the instruction that returns true if you say input a is bigger than 0.
So this is already 2 instructions you use to do it, but since most CPU's have an instruction that does compare in a single cycle it is a bad example.
The point I am making is still the same, you can't make this generally statements without providing the programming language and the CPU architecture.
This list (assuming it's on x86) of ASM instructions might help:
Jump if greater
Jump on equality
Comparison between two registers
(Disclaimer, I have nothing more than very basic experience with writing assembler so I could be off the mark)
However it obviously depends purely on what assembly instructions the Delphi compiler is producing. Without seeing that output then it's guesswork. I'm going to keep my Donald Knuth quote in as caring about this kind of thing for all but a niche set of applications (games, mobile devices, high performance server apps, safety critical software, missile launchers etc.) is the thing you worry about last in my view.
"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil."
If you're writing one of those or similar then obviously you do care, but you didn't specify it.
Just guessing, but given you want to preserve the logic, you cannot just replace
if A = B then
with
if A <> B then
To conserve the logic, the original code must have been something like
if not (A = B) then
or
if A <> B then
else
and that may truely be a little bit slower than the test on inequality.

Resources