I'm doing a theoretical assignment where I design my own ISA. I'm doing a Memory-Memory design where the ALU receives inputs from memory and outputs back to memory without using any registers. This is an outdated method and registers are more effective now, but that doesn't matter for my assignment.
My question:
If the encoding of one of my instructions looks like this
opcode|destination|value1|value2|function
00 0001 0011 1100 00
the function "00" stands for addition and the opcode 00 stands for an ALU operation.
My RTN looks like this for that function:
Mem[0001] <--- Mem[0011] + Mem[1100]
0001, 0011, 1100 are memory addresses, what I'm trying the accomplish is to sum the values INSIDE those memory addresses and then store it in the memory address of 0001 (overwriting it).
So if the value in memory address 0011 was '2' and the value in memory address 1100 was '3', my instruction would store '5' in memory address 0001.
Also lets say I want to overwrite the value '3' that's in address 1100 with '4'. I can just do Mem[1100] <--- 0100(binary for 4) ?
Is what I'm implementing correct? Or am I approaching memory addressing completely wrong?
These architectures usually have one accumulator. Otherwise you'd need a dual port ram to access two operands at the same time.
You could latch one memory value, but that's just a less versatile accumulator.
Memory writes are done on a different clock/ clock flank than reads.
Memory-const operations use a different opcode than memory-memory operations of the same type.
Finally, if your const is too big for your instruction size, you need to first copy the const to a memory address, then use it on a memory-memory operation.
Related
My teacher has given me the question to differentiate the maximum memory space of 1MB and 4GB microprocessor. Does anyone know how to answer this question apart from size mentioned difference ?
https://i.stack.imgur.com/Q4Ih7.png
A 32-bit microprocessor can address up to 4 GB of memory, because its registers can contain an address that is 32 bits in size. (A 32-bit number ranges from 0 to 4,294,967,295). Each of those values can represent a unique memory location.
The 16-bit 8086, on the other hand, has 16-bit registers which only range from 0 to 65,535. However, the 8086 has a trick up its sleeve- it can use memory segments to increase this range up to one megabyte (20 bits). There are segment registers whose values are automatically bit-shifted left by 4 then added to the regular registers to form the final address.
For example, let's look at video mode 13h on the 8086. This is the 256-color VGA standard with a resolution of 320x200 pixels. Each pixel is represented by a single byte and the desired color is stored in that byte. The video memory is located at address 0xA0000, but since this value is greater than 16 bits, typically the programmer will load 0xA000 into a segment register like ds or es, then load 0000 into si or di. Once that is done, the program can read from [ds:si] and write to [es:di] to access the video memory. It's important to keep in mind that with this memory addressing scheme, not all combinations of segment and offset represent a unique memory location. Having es = A100/di = 0000 is the same as es=A000/di=1000.
In 8051 memory bank (00h to 1Fh), 8051 provide 32 registers as 8 registers(R0 t0 R7) to each of 4 banks.
Why these registers are not given as R0 to R31?
Thanks in Advance..
Many instruction opcodes are only 8 bits long; if all 32 registers were accessible in one of these instructions then there would be only 3 bits left to encode the instruction length and the operation. Similarly, two byte instructions often use the second byte to encode a full 8-bit operand (eg., an address), and have effectively the same constraint.
In many instances it is possible to refer to the register you need by its absolute address, using a longer instruction, but if you will access it frequently then it may be better to change the active bank so that you can use more short opcodes.
As far as I remember you can access only 8 of those register at a time. To access one of the other groups you need to switch the bank. I guess it has something to do with a instruction operand being only 3 bits long (and not 5 bits).
My apologies if this is the wrong stackexchange for this; it just seemed like the closest one to a place that could be of help for computer architecture. For a homework problem in computer systems I was asked:
Consider three direct mapped caches X, Y, and Z each interpreting an
8-bit address slightly differently according to the {tag:setIdx:byteOffset}
format specified. For each address in the reference stream, indicate whether the
access will hit (H) or miss (M) in each cache.
C1 C2 C3
Address Formats: {2:2:4} {2:3:3} {2:4:2}
Address References in Binary: 00000010, 00000100...
I'm supposed to say whether each of the address references will result in a hit or miss but I don't know where to begin.
For formats, I thought that tag meant the tag of the data in a block of cache, setIdx meant the amount of bits given to represent the different blocks in a cache, and offset was the particular byte within a block that you can choose from.
I feel like I don't understand what a hit or miss is. I thought there were 3 types: compulsory, capacity, and conflict. How would I know which is a compulsory miss if I don't know what is already in the cache? How can I tell the capacity of the cache given the tag formats?
Thanks for any hints or tips.
Take C1 for example, it has 2 bits for setIdx, and 4 bits for byteOffset.
So this cache will have 2^2 = 4 blocks (00, 01, 10, and 11), and each block will have 2^4 = 16 bytes.
The address reference can now be split into C1 format: {00 00 0010}
Assuming cache is empty by default, the first lookup will result in a miss. However, cache will now have the block "00" loaded with tag "00".
The next reference {00 00 0100} will look up block "00", it sees that the tag is also "00", we have a hit.
I don't think that you would have a hit. Even though the address 00 00 0100 would look up the same block, it would be looking up a different address in memory. In a direct mapped cache hits are only cause by trying to access the same address in memory in the same block in consecutive instructions. The address in memory is given by the byte address and not the block number. A direct mapped cache will replace the contents of a block so long as it isn't trying to access the same address in the block. If 00 00 0100 were to precede another 00 00 0100 then there would be a hit in a direct mapped cache.
In an associated cache, the memory address in given by the block number and not the byte address so a hit would be generated here.
The block number is given by floor(Byte address/Bytes per block) mod (number of blocks)
I know that some processors fail with misaligned data, and others like the oh-so-common x86, would just be slower with that.
My question is why? Why is it harder for an x86 processor to get the data from the pointer 0x12345679 than it is from the pointer 0x12345678? Just to be clear, I'm aware that page faults may happen if the data is in multiple pages, and I understand that more data may need to be fetched from memory (one part for the start of the value and one for the end), but that isn't always true and this isn't what my question is about. I'm asking, why is it always slower?
Suppose the memory starts at 0x10000000. Why is it harder for the processor to get a 2-byte short from 0x10000001 than it is from 0x10000002? Why is it harder to get a 4-byte int from 0x10000001 than it is from 0x10000000? And so forth.
Because the data bus is wider than eight bits.
Let assume that the data bus is 32 bits. To get 16 bits from address 0x10000001, it has to get the four bytes that starts at 0x10000000 and shift the value to get the two bytes in the middle.
To get 16 bits from the address 0x10000003, it has to get the words that start at 0x10000000 and 0x10000004, and use one byte from each value.
The processor can only access memory in an aligned fashion. This is a consequence of how the interconnect between the processor and memory functions.
When a processor supports unaligned reads, what's really happening is the processor issuing two separate reads (or one read of larger size) and stitching the parts together, which is why it's slower than an aligned read.
One example: if the databus is 32 bits and a 32 bit value is not on a 32 bit boundary, the bytes will have to be fetched in more than one operation and moved around to load the value properly into a processor register.
The normal answers to why data alignment is to access more efficiently and to simplify the design of CPU.
A relevant question and its answers is here. And another source is here. But they both do not resolve my question.
Suppose a CPU has a access granularity of 4 bytes. That means the CPU reads 4 bytes at a time. The material I listed above both says that if I access a misaligned data, say address 0x1, then the CPU has to do 2 accesses (one from addresses 0x0, 0x1, 0x2 and 0x3, one from addresses 0x4, 0x5, 0x6 and 0x7) and combine the results. I can't see why. Why just can't CPU read data from 0x1, 0x2, 0x3, 0x4 when I issue accessing address 0x1. It will not degrade the performance and incur much complexity in circuitry.
Thank you in advance!
It will not degrade the performance and incur much complexity in circuitry.
It's the false assumptions we take as fact that really throw off further understanding.
Your comment in the other question used much more appropriate wording ("I don't think it would degrade"...)
Did you consider that the memory architecture uses many memory chips in parallel in order to maximize the bandwidth? And that a particular data item is in only one chip, you can't just read whatever chip happens to be most convenient and expect it to have the data you want.
Right now, the CPU and memory can be wired together such that bits 0-7 are wired only to chip 0, 8-15 to chip 1, 16-23 to chip 2, 24-31 to chip 3. And for all integers N, memory location 4N is stored in chip 0, 4N+1 in chip 1, etc. And it is the Nth byte in each of those chips.
Let's look at the memory addresses stored at each offset of each memory chip
memory chip 0 1 2 3
offset
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
N 4N 4N+1 4N+2 4N+3
So if you load from memory bytes 0-3, N=0, each chip reports its internal byte 0, the bits all end up in the right places, and everything is great.
Now, if you try to load a word starting at memory location 1, what happens?
First, we look at the way it is done. First memory bytes 1-3, which are stored in memory chips 1-3 at offset 0, end up in bits 8-31, because that's where those memory chips are attached, even though you asked them to be in bits 0-23. This isn't a big deal because the the CPU can swizzle them internally, using the same circuitry used for logical shift left. Then on the next transaction memory byte 4, which is stored in memory chip 0 at offset 1, gets read into bits 0-7 and swizzled into bits 24-31 where you wanted it to be.
Notice something here. The word you asked for is split across offsets, the first memory transaction read from offset 0 of three chips, the second memory transaction read from offset 1 of the other chip. Here's where the problem lies. You have to tell the memory chips the offset so they can send you the right data back, and the offset is ~40 bits wide and the signals are VERY high speed. Right now there is only one set of offset signals that connects to all the memory chips, to do a single transaction for unaligned memory access you would need independent offset (called the address bus BTW) running to each memory chip. For a 64-bit processor, you'd change from one address bus to eight, an increase of almost 300 pins. In a world where CPUs use between 700 and 1300 pins, this can hardly be called "not much increase in circuitry". Not to mention the huge increase in noise and crosstalk from that many extra high-speed signals.
Ok, it isn't quite that bad, because there can only be a maximum of two different offsets out on the address bus at once, and one is always the other plus one. So you could get away with one extra wire to each memory chip, saying in effect either (read the offset listed on the address bus) or (read the offset following) which is two states. But now there's an extra adder in each memory chip, which means it has to calculate the offset before actually doing the memory access, which slows down the maximum clock rate for memory. Which means that aligned access gets slower if you want unaligned access to be faster. Since 99.99% of access can be made aligned, this is a net loss.
So that's why unaligned access gets split into two steps. Because the address bus is shared by all the bytes involved. And this is actually a simplification, because when you have different offsets, you also have different cache lines involved, so all the cache coherency logic would have to double to handle twice the communication between CPU cores.
In my opinion that's a very simplistic assumption. The circuitry could involve many layers of pipeling and caching optimisation to ensure that certain bits of memory are read. Also the memory reads are delegated to the memory subsystems that may be built from components that have orders of difference in performance and design complexity to read in the way that you think.
However I do add the caveat that I'm not a cpu or memory designer so I could be talking a crock.
The answer to your question is in the question itself.
The CPU has access granularity of 4 bytes. So it can only slurp up data in chunks of 4 bytes.
If you had accessed the address 0x0, the CPU would give you the 4 bytes from 0x0 to 0x3.
When you issue an instruction to access data from address 0x1, the CPU takes that as a request for 4 bytes of data starting at 0x1 ( ie. 0x1 to 0x4 ). This can't be interpreted in any other way essentially because of the granularity of the CPU. Hence, the CPU slurps up data from 0x0 to 0x3 & 0x4 to 0x7 (ergo, 2 accesses), then puts the data from 0x1 to 0x4 together as the final result.
Addressing 4 bytes with the first byte misaligned at the left at 0x1 not 0x0 means it does not start on a word boundary and spills over to the next adjacent word. First access grabs the 3 bytes to word boundary (assuming a 32-bit word) and then second access grabs byte 0x4 in the mode of completing the 4-byte 32-bit word of the memory addressing implementation. The object code or assembler effectively does the second access and concatenation for the programmer transparently. Its best to keep to word boundaries when possible usually in units of 4 bytes.