I'm currently working on a project that requires me to write a bubble sort algorithm in Harvard Machine 16 Bit Assembly Code. I tried searching for it online, however most assembly code snippets use the CMP and MOV operators.
I have the following instruction available:
ADD, SUB, AND, Copy, ADDI, SUBI, ANDI, LOADI, BZ, BEQ, BRA, SW, LW.
Could anyone please give me a nudge in the proper direction?
Thanks in advance,
You can always implement an equivalent of CMP using SUB (or even ADD if SUB isn't available).
MOV can always be constructed out of a load and a store. You could also simulate it using a load and ADD to a zero-initialized register or memory location.
Don't search. Write the algorithm in pseudo-code and see how you can construct each step with the instructions you've got.
Related
I've been trying to understand the purpose of the 0x40 REX opcode for ASM x64 instructions. Like for instance, in this function prologue from Kernel32.dll:
As you see they use push rbx as:
40 53 push rbx
But using just the 53h opcode (without the prefix) also produces the same result:
According to this site, the layout for the REX prefix is as follows:
So 40h opcode seems to be not doing anything. Can someone explain its purpose?
the 04xh bytes (i.e. 040h, 041h... 04fh) are indeed REX bytes. Each bit in the lower nibble has a meaning, as you listed in your question. The value 040h means that REX.W, REX.R, REX.X and REX.B are all 0. That means that adding this byte doesn't do anything to this instruction, because you're not overriding any default REX bits, and it's not an 8-bit instruction with AH/BH/CH/DH as an operand.
Moreover, the X, R and B bits all correspond to some operands. If your instruction doesn't consume these operands, then the corresponding REX bit is ignored.
I call this a dummy REX prefix, because it does nothing before a push or pop. I wondered whether it is allowed and your experience show that it is.
It is there because the people at Microsoft apparently generated the above code. I'd speculate that for the extra registers it is needed, so they generate it always and didn't bother to remove it when it is not needed. Another possibility is that the lengthening of the instruction has a subtle effect on scheduling and or aligning and can make the code faster. This of course requires detailed knowledge of the particular processor.
I'm working at an optimiser that looks at machine code. Dummy prefixes are helpful because they make the code more uniform; there are less cases to consider. Then as a last step superfluous prefixes can be removed among other things.
Are there good tutorials around that explain about the first and second pass of assembler along with their algorithms ? I searched a lot about them but haven't got satisfying results.
Please link the tutorials if any.
Dont know of any tutorials, there really isnt much to it.
one:
inc r0
cmp r0,0
jnz one
call fun
add r0,7
jmp more_fun
fun:
mov r1,r0
ret
more_fun:
The assembler/software, like a human is going to read the source file from top to bottom, byte 0 in the file to the end. there are no hard and fast rules as to what you complete in each pass, and it is not necessarily a pass "on the file" but a pass "on the data".
First pass:
As you read each line you parse it. You are building some sort of data structure that has the instructions in file order. When you come across a label like one:, you keep track of what instruction that was in front of or perhaps you have a marker between instructions however you choose to implement it. When you come across an instruction that uses a label you have two choices, you can right now go look for that label, and if it is a backwards looking label then you should have seen it already like the jnz one instruction. IF you have thus far been keeping track of the number and size (if variable word length) instructions you can choose to encode this instruction now if it is a relative instruction, if the instruction set uses absolute you might have to just leave a placeholder anyway.
Now the call fun and jump more_fun instructions pose a problem, when you get to these instructions you cannot resolve them at this time, you dont know if these labels are local to this file or are in another file, so you cannot encode this instruction on the first pass, you have to save it for later, and this is the reason for the second pass.
The second pass is likely to be a pass across your data structures and not actually on the file, and this is heavily implementation specific. For example you might have a one dimensional array of structures and everything is in there. You may choose to make many passes on that data for example, start one index through the array looking for unresolved labels. When you find an unresolved label, send a second index through the array looking for a label definition. If you dont find it then, application specific, does your assembler create objects to be linked later or does it create a binary does it have to have everything resolved in this one assembly to binary step? If object then you assume this is external, unless application specific, your assembler requires external labels to be defined as external. So whether or not the missing label is an error is application specific. if it is not an error then, application specific, you should encode for the longest/farthest type of branch leaving the address or distance details for the linker to fill in.
For the labels you have found you now have a rough idea on how far. Now, depending on the instruction set and/or features of your assembler, you need to make several more passes on the data. You need to start encoding the instructions, assuming you have at least one flavor of relative distance call or branch instruction, you have to decide on the first encoding pass whether to hope for the, what i assume, is a shorter/smaller instruction for the relative distance branch or assume the larger one. You cant really determine if the smaller one will reach until you get one or a few encoding passes across the instructions.
top:
...
jmp down
...
jnz top
...
down:
As you encode the jmp down, you might choose optimistically to encode it as a smaller (number of bytes/words if variable word length) relative branch leaving the distance to be determined. When you get to the jnz top, lets say it is exactly to the byte just close enough to top to encode using a relative branch. On the second pass though you have to go back and finish the jmp down you find that it wont reach, you need more bytes/words to encode it as a long branch. Now the jnz top has to become a far branch as well (causing down to move again). You have to keep passing over the instructions, computing their distance far/short until you make pass with no changes. Be careful not to get caught in an infinite loop, where one pass you get to shorten an instruction, but that causes another to lengthen, and on the next pass the lengthen one causes the other to lengthen but the second to shorten and this repeats forever.
We could go back to the top of this and in your first pass you might build more than one or several data structures, maybe as you go you build a list of found labels, and a list of missing labels. And the second pass you look through the list of missing and see if they are in the found then resolve them that way. Or maybe on the first pass, and some might argue this is a single pass assembler, when you find a label, before continuing through the file you look back to see if anyone was looking for that label (or if that label had already been defined to declare an error) I would call this a multi pass assembler because it still passes through the data many times.
And now lets make it much worse. Look at the arm instruction set as an example and any other fixed length instruction set. Your relative branches are usually encoded in one instruction, thus fixed length instruction set. A far branch normally involves a load pc from the data found at this address, meaning you really need two items the instruction, then somewhere within the relative reach of that instruction a data word containing the absolute address of where to branch. You can choose to force the user to create these, but with the ARM assemblers for example they can and will do this for you, the simplest example is:
ldr r0,=0x12345678
...
b somewhere
That syntax means load r0 with the value 0x12345678, which does not fit in an arm instruction. What the assembler does with that syntax is it tries to find a dead spot in the code within reach of that instruction where it can place the data value, then it encodes that instruction as a load from pc relative address. For example after an unconditional branch is a good place to hide data. sometimes you have to use directives like .pool to encourage or remind the assembler good places to stick this data. r0 is not the program counter r15 is and you could use r15 there to connect this to the branching discussion above.
Take a look at the assembler I created for this project http://github.com/dwelch67/lsasim, a fixed length instruction set, but I force the user to allocate the word and load from it, I dont allow the shortcut the arm assemblers tend to allow.
I hope this helps explain things. The bottom line is that you cannot resolve lables in one linear pass through the data, you have to go back and connect the dots to the forward referenced labels. And I argue you have to do many passes anyway to resolve all of the long/short encodings (unless the instruction set/syntax forces the user to explicitly specify an absolute vs relative branch, and some do rjmp vs jmp or rjmp vs ljmp, rcall vs call, etc). Making one pass on the "file" sure, not a problem. If you allow include type directives some tools will create a temporary file where it pulls all the includes in creating a single file which has no includes in it, and then the tool makes one pass on this (this is how gcc manages includes for example, save intermediate files sometime and see what files are produced)(if you report line numbers with warnings/errors then you have to manage the temp file lines vs the original file name and line.).
A good place to start is David Solomon's book, Assemblers and Loaders. It's an older book, but the information is still relevant.
You can download a PDF of the book.
Is it possible to write a sequence of instructions that will place a 1 in the least significant bit of the memory cell at address B3 without disturbing the other bits in the memory cell?
The machine instructions I am referring to is the STOP, ADD, SWITCH, STOP, LOAD, ROTATE etc.
Clarification: this question was originally tagged C#; since it wasn't the OP that re-tagged it, I'll leave this here until the OP's intentions are clearer.
C# is a high-level programming language, which compiles down to IL, not machine code. As such: no, there is absolutely no supported mechanism for performing specific machine code operations (and even if there were, it couldn't possibly port between langauges).
You can do high level bit operations, using the operators on the integer-based types; and if you really want you can write IL, either building it manually (ilasm), or at runtime via DynamicMethod / ILGenerator - but these still only deal with CIL opcodes, not machine codes.
I think ORing it with 1 will do the job ain't it:
algo:
byte= [data at 0xB3]
byte = byte | 0x01
this works fine with me in developing for 8051 MCUs.
I am required to design and build an 8 bit Pseudo Random Number Generator. I have looked at possible methods; using background noise, user input etc. I was wondering if anyone could give me some advice on where to start as this would be of great help to me.
random.org is perhaps the best place to start your investigation.
Below should get you started with the basics
howstuffworks.com
Construct your own random number generator
For a simple 8 bit PRNG you could ry something like a Linear Feedback Shift Register. This is very simple to implement in either software or hardware.
My plan is to use a temperature sensor. When the temps are being processed in the ADC, I am going to amplify the noise generated. This will then give me the random 8 bit number I require which will be used as the 'seed' for the PRNG in stdlib (C programming).
What do you's think?
I've found that the following works very well. This is implemented in MSP430 assembly, but would be easy enough to port to another processor. I've used this to generate 'white' noise for a synthesizer project, and there were no audible patterns in the output. Depending on what your requirements are, this might be sufficient. It uses two state variables, the previous output (8 bits), and a 16-bit state register. I found this online, http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=95614&highlight=radbrad, where it's listed in AVR assembly, and ported it to MSP.
Because it uses shifts and shifts the top bit out of one register into the bottom of another, it doesn't really lend itself to efficient implementation in C. Hence the assembly. I hope you find this as useful as I did.
mov.b &rand_out, r13
mov.b r13,r12
and.b #66, r13
jz ClearCarry
cmp.b #66, r13
xor.w #1, sr ; invert carry flag
jmp SkipClearCarry
ClearCarry:
clrc
SkipClearCarry:
rlc.w &rand_state
rlc.b r12
mov.b r12,&rand_out
ret
I am taking an assembly course now, and the guy who checks our home assignments is a very pedantic old-school optimization freak. For example he deducts 10% if he sees:
mov ax, 0
instead of:
xor ax,ax
even if it's only used once.
I am not a complete beginner in assembly programing but I'm not an optimization expert, so I need your help in something (might be a very stupid question but I'll ask anyway):
if I need to set a register value to 1 or (-1) is it better to use:
mov ax, 1
or do something like:
xor ax,ax
inc ax
I really need a good grade, so I'm trying to get it as optimized as possible. ( I need to optimize both time and code size)
A quick google for 8086 instructions timings size turned up a listing of instruction timings which seems to have all the timings and sizes for the 8086/8088 through Pentium.
Although you should note that this probably doesn't include code fetch memory bottlenecks which can be very significant, especially on an 8088. This usually makes optimization for code-size a better choice. See here for some details on this.
No doubt you could find official Intel documentation on the web with similar information, such as the "8086/8088 User's Manual: Programmer's and Hardware Reference".
For your specific question, the table below gives a comparison that indicates the latter is better (less cycles, and same space):
Instructions
Clock cycles
Bytes
xor ax, axinc ax
33---6
21---3
mov ax, 1
4
3
But you might want to talk to your educational institute about this guy. A 10% penalty for a simple thing like that seems quite harsh. You should ask what should be done in the case where you have two possibilities, one faster and one shorter.
Then, once they've admitted that there are different ways to optimise code depending on what you're trying to achieve, tell them that what you're trying to do is optimise for readability and maintainability, and seriously couldn't give a damn about a wasted cycle or byte here or there(1).
Optimisation is something you generally do if and when you have a performance problem, after a piece of code is in a near-complete state - it's almost always wasted effort when the code is still subject to a not-insignificant likelihood of change.
For what it's worth, sub ax,ax appears to be on par with xor ax,ax in terms of clock cycles and size, so maybe you could throw that into the mix next time to cause him some more work.
(1)No, don't really do that , but it's fun to vent occasionally :-)
You're better off with
mov AX,1
on the 8086. If you're tracking register contents, you can possibly do better if you know that, for example, BX already has a 1 in it:
mov AX,BX
or if you know that AH is 0:
mov AL,1
etc.
Depending upon your circumstances, you may be able to get away with ...
sbb ax, ax
The result will either be 0 if the carry flag is not set or -1 if the carry flag is set.
However, if the above example is not applicable to your situation, I would recommend the
xor ax, ax
inc ax
method. It should satisfy your professor for size. However, if your processor employs any pipe-lining, I would expect there to be some coupling-like delay between the two instructions (I could very well be wrong on that). If such a coupling exists, the speed could be improved slightly by reordering your instructions slightly to have another instruction between them (one that does not use ax).
Hope this helps.
I would use mov [e]ax, 1 under any circumstances. Its encoding is no longer than the hackier xor sequence, and I'm pretty sure it's faster just about anywhere. 8086 is just weird enough to be the exception, and as that thing is so slow, a micro-optimization like this would make most difference. But any where else: executing 2 "easy" instructions will always be slower than executing 1, especially if you consider data hazards and long pipelines. You're trying to read a register in the very next instruction after you modify it, so unless your CPU can bypass the result from stage N of the pipeline (where the xor is executing) to to stage N-1 (where the inc is trying to load the register, never mind adding 1 to its value), you're going to have stalls.
Other things to consider: instruction fetch bandwidth (moot for 16-bit code, both are 3 bytes); mov avoids changing flags (more likely to be useful than forcing them all to zero); depending on what values other registers might hold, you could perhaps do lea ax,[bx+1] (also 3 bytes, even in 32-bit code, no effect on flags); as others have said, sbb ax,ax could work too in circumstances - it's also shorter at 2 bytes.
When faced with these sorts of micro-optimizations you really should measure the alternatives instead of blindly relying even on processor manuals.
P.S. New homework: is xor bx,bx any faster than xor bx,cx (on any processor)?