Why isn't everything SSE (or "SIMD") by default? [duplicate] - parallel-processing

This question already has answers here:
Is there any architecture that uses the same register space for scalar integer and floating point operations?
(4 answers)
Why floating point registers are different than general purpose ones
(1 answer)
Closed 1 year ago.
I don't know much about the internal working of the CPU, and my understanding of SSE is equally basic; it works in the form of additional long registers that pack some number of data types you want a perform a single operation on (in parallel) using a single instruction.
Great, but why isn't every register and every operation like that by default? If I want to add two integers, why would I need to place each in two separate registers and do the operation through multiple instructions when I could just do it through SSE? does it interfere with concurrency somehow? is it a hardware limitation?
Thanks! If there are somewhat easy to follow sources as well, I would gladly appreciate it

Related

How are Opcodes & Operands Defined in a CPU?

Extensive searching has sent me in a loop over the course of 3 days, so I'm depending on you guys to help me catch a break.
Why exactly does one 8-bit sequence of high's and low's perform this action, and 8-bit sequence performs that action.
My intuition tells me that the CPU's circuitry hard-wired one binary sequence to do one thing, and another to do another thing. That would mean different Processor's with potentially different chip circuitry wouldn't define one particular binary sequence as the same action as another?
Is this why we have assembly? I need someone to confirm and/or correct my hypothesis!
Opcodes are not always 8 bits but yes, it is hardcoded/wired in the logic to isolate the opcode and then send you down a course of action based on that. Think about how you would do it in an instruction set simulator, why would logic be any different? Logic is simpler than software languages, there is no magic there. ONE, ZERO, AND, OR, NOT thats as complicated as it gets.
Along the same lines if I was given an instruction set document and you were given an instruction set document and told to create a processor or write an instruction set simulator. Would we produce the exact same code? Even if the variable names were different? No. Ideally we would have programs that are functionally the same, they both parse the instruction and execute it. Logic is no different you give the spec to two engineers you might get two different processors that functionally are the same, one might perform better, etc. Look at the long running processor families, x86 in particular, they re-invent that every couple-three years being instruction set compatible for the legacy instructions while sometimes adding new instructions. Same for ARM and others.
And there are different instruction sets ARM is different from x86 is different from MIPS, the opcodes and/or bits you examine in the instruction vary, for none of these can you simply look at 8 bits, each you have some bits then if that is not enough to uniquely identify the instruction/operation then you need to examine some more bits, where those bits are what the rules are are very specific to each architecture. Otherwise what would be the point of having different names for them if they were the same.
And this information was out there you just didnt look in the right places, there are countless open online courses on the topic, books that google should hit some pages on, as well as open source processor cores you can look at and countless instruction set simulators with source code.

Machine Code: How many Read and/or Write cycles are involved in the Fetch and Execute cycles of the following instructions execution?

Okay, I'm going through past exam questions for a module, Computer Architecture, and I've come across the following question and I have no idea how to do it? If anyone can tell/show me how I would answer this or send me a link where I could learn how to answer this type of question that would be ideal. Thanks.
Q: How many Read and/or Write cycles are involved in the Fetch and Execute cycles of the
following instructions execution of the following :
a) LDA B $10EF corresponding machine code A6 10 EF, Extended addressing.
b) LDA B #$ 3B corresponding machine code C6 3B, Immediate addressing.
c) STA B $6020 corresponding machine code 57 60 20, Extended addressing.
Without the information regarding which CPU it is, all we can give is general advice on how to work it out.
Those opcodes look like rather simple ones from the early days of the PC industry but they don't match the more popular chips of that time frame.
The basic approach would be to look up the instructions in the CPU reference/guide and it would tell you what read and write cycles would occur for a given instruction/addressing-mode combination.
For example, immediate addressing is usually just the extraction of a value at or near the program counter (PC) so would involve a simple read.
Extended addressing depends, of course, on what they mean by extended. It may be a single de-reference which would involve reading a word at or near the PC followed by the use of that value to read another. Or it may be two levels of indirection. Or their definition of extended could be some bizarre combination of indexed, based and indirect addressing combined, which would result in even more cycles.
Without the chip specs, it's difficult to be certain.
My advice is to comb through the course material (if available) to try and discern what CPU is being used, then look that up with your favourite search engine. It doesn't appear to be any of the usual suspects like Mostek 6502 and derivatives, the Motorola 680x series, or the TI chips.
The other thing you could try is to post all of the questions (or a link to them) up here, the extra information may provide a clue to the architecture in use.

Can Registers inside a CPU do Arithmetics

I read in many detailed articles that Data from the Registers are used as Operands for the ALU to add two 32-bit integers, and this is only one small part of what the ALU can actually do.
However I also read the Register can even do arithmetic too? The difference between the two is quite blurred to me, what is the clear cut difference between a Register and the actual ALU component?
I know ALU doesn't store values, rather it receives it, and is instructed to simply do the Logic part, but the Register can both store and do general purpose stuff?
If the latter is true, then when does one use the ALU, and when does when use the General Purpose Registers?
Registers can't do arithmetic. A "register" is just a term for a place where you can stick a value. You can do arithmetic on the values stored in registers and have the results saved back into the register. This arithmetic would be done by the "ALU," which is the generic term for the portion of the processor that does number-crunching.
If you're still confused by something specific that you read, please quote it here or post a citation and someone can try to clarify. Note that "register" and "ALU" are very generic terms and are implemented and used differently in every architecture.
Although it is true that registers, properly speaking, don't do arithmetic, it IS possible to construct a circuit which both stores a number and, when a particular input line is set, increments that number. One could interpret this as a register which does a (very limited amount of) arithmetic. As it happens, there is generally no use for such a circuit in a general-purpose CPU (except maybe as the Program Counter), but such circuits could be useful in very simple digital controllers.
[note: historically, the first such circuits were made from vacuum tubes and used in Geiger Counters to count how many radioactive decay events occurred over short periods of time]
Registers don't do arithmetic. Modern cores have several “execution units” or “functional units” rather than an “ALU“. A reasonable rule of thumb is to discard any text (online or hardcopy) that speaks in terms of ALUs in the context of mainstream CPUs. “ALU” is still meaningful in the context of embedded systems, where a µC might actually have an ALU, but otherwise it's essentially an anachronism. If you see it, usually all it tells you is that the material is seriously out-of-date.

When to use emplace* and when to use push/insert [duplicate]

This question already has answers here:
push_back vs emplace_back
(7 answers)
Closed 9 years ago.
I know of general idea of emplace functions on containers("construct new element inplace").
My question is not what it does,
but more of like Effective C++11 one.
What are good rules for deciding when to use (for eg when it comes to std::vector)
emplace_back() and when to use push_back() and in general emplace* vs "old" insert functions?
emplace_back() only really makes sense when you have to construct the object from scratch just before you put it into the container. If you hand it a pre-constructed object it basically degrades into push_back(). You'll mostly see a difference if the object is expensive to copy or you have to create a lot of them in a tight loop.
I tend to replace the following code:
myvector.push_back(ContainedObject(hrmpf));
with
myvector.emplace_back(hrmpf);
if the former shows up on the profiler output. For new code, I'll probably use emplace_back if I can (we still mainly us VS2010 at work and its implementation of emplace_back() is a bit hobbled).

Building a custom machine code from the ground up

I have recently begun working with logic level design as an amateur hobbyist but have now found myself running up against software, where I am much less competent. I have completed designing a custom 4 bit CPU in Logisim loosely based on the paper "A Very Simple Microprocessor" by Etienne Sicard. Now that it does the very limited functions that I've built into it (addition, logical AND, OR, and XOR) without any more detectable bugs (crossing fingers) I am running into the problem of writing programs for it. Logisim has the functionality of importing a script of Hex numbers into a RAM or ROM module so I can write programs for it using my own microinstruction code, but where do I start? I'm quite literally at the most basic possible level of software design and don't really know where to go from here. Any good suggestions on resources for learning about this low level of programming or suggestions on what I should try from here? Thanks much in advance, I know this probably isn't the most directly applicable question ever asked on this forum.
I'm not aware of the paper you mention. But if you have designed your own custom CPU, then if you want to write software for it, you have two choices: a) write it in machine code, or b) write your own assembler.
Obviously I'd go with b. This will require that you shift gear a bit and do some high-level programming. What you are aiming to write is an assembler program that runs on a PC, and converts some simple assembly language into your custom machine code. The assembler itself will be a high-level program and as such, I would recommend writing it in a high-level programming language that is good at both string manipulation and binary manipulation. I would recommend Python.
You basically want your assembler to be able to read in a text file like this:
mov a, 7
foo:
mov b, 20
add a, b
cmp a, b
jg foo
(I just made this program up; it's nonsense.)
And convert each line of code into the binary pattern for that instruction, outputting a binary file (or perhaps a hex file, since you said your microcontroller can read in hex values). From there, you will be able to load the program up onto the CPU.
So, I suggest you:
Come up with (on paper) an assembly language that is a simple written representation for each of the opcodes your machine supports (you may have already done this),
Learn simple Python,
Write a Python script that reads one line at a time (sys.stdin.readline()), figures out which opcode it is and what values it takes, and outputs the corresponding machine code to stdout.
Write some assembly code in your assembly language that will run on your CPU.
Sounds like a fun project.
I have done something similar that you might find interesting. I also have created from scratch my own CPU design. It is an 8-bit multi-cycle RISC CPU based on Harvard architecture with variable length instructions.
I started in Logisim, then coded everything in Verilog, and I have synthesized it in an FPGA.
To answer your question, I have done a simple and rudimentary assembler that translates a program (instructions, ie. mnemonics + data) to the corresponding machine language that can then be loaded into the PROG memory. I've written it in shell script and I use awk, which is what I was confortable with.
I basically do two passes: first translate mnemonics to their corresponding opcode and translate data (operands) into hex, here I keep track of all the labels addresses. second pass will replace all labels with their corresponding address.
(labels and addresses are for jumps)
You can see all the project, including the assembler, documented here: https://github.com/adumont/hrm-cpu
Because your instruction set is so small, and based on the thread from the mguica answer, I would say the next step is to continue and/or fully test your instruction set. do you have flags? do you have branch instructions. For now just hand generate the machine code. Flags are tricky, in particular the overflow (V) bit. You have to examine carry in and carry out on the msbit adder to get it right. Because the instruction set is small enough you can try the various combinations of back to back instructions and followed by or and followed by xor and followed by add, or followed by and or followed by xor, etc. And mix in the branches. back to flags, if the xor and or for example do not touch carry and overflow then make sure you see carry and overflow being a zero and not touched by logical instructions and carry and overflow being one and not touched, and also independently show carry and overflow are separate, one on one off, not touched by logical, etc. make sure all the conditional branches only operate on that one condition, lead into the various conditional branches with flag bits that are ignored in both states insuring that the conditional branch ignores them. Also verify that if the conditional branch is not supposed to modify them that it doesnt. likewise if the condition doesnt cause a branch that the conditional flags are not touched...
I like to use randomization but it may be more work than you are after. I like to independently develop a software simulator of the instruction set, which I find easier to use that the logic also sometimes easier to use in batch testing. you can then randomize some short list of instructions, varying the instruction and the registers, naturally test the tester by hand computing some of the results, both state of registers after test complete and state of flag bits. Then make that randomized list longer, at some point you can take a long instruction list and run it on the logic simulator and see if the logic comes up with the same register results and flag bits as the instruction set simulator, if they vary figure out why. If the do not try another random sequence, and another. Filling registers with prime numbers before starting the test is a very good idea.
back to individual instruction testing and flags go through all the corner cases 0xFFFF + 0x0000 0xFFFF+1, things like that places just to the either side of and right on operands and results that are one count away from where a flag changes at the point where the flag changes and just the other side of that. for the logicals for example if they use the zero flag, then have various data patterns that test results that are on either side of and at zero 0x0000, 0xFFFF 0xFFFE 0x0001 0x0002, etc. Probably a walking ones result as well 0x0001 result 0x0002, ox0004, etc.
hopefully I understood your question and have not pointed out the obvious or what you have already done thus far.

Resources