Endianness only important when accessing identical data both as a doubleword and as eight bytes - endianness

I think I understand at least the basics of endianness, but I can't understand the last sentence of this paragraph in my computer organization text. Can someone provide an example that illustrates the point being made here?
Computers divide into those that use the address of the leftmost or “big end” byte as the doubleword address versus those that use the rightmost or “little end” byte. LEGv8 can work either as big-endian or little-endian. Since the order matters only if you access the identical data both as a doubleword and as eight bytes, few need to be aware of the “endianness”.
I should point out that "LEGv8" is a subset of the ARMv8 instruction set.

For the first part consider how a processor accesses data from memory. It transfers memory contents to and from registers using some kind of load and store instruction combined with the length. You don't need to consider the endianness, you just instruct the processor to load, perform other operations and then store the result. The endianness is the same during a load and store when using the size of the data type so you don't care what it is.
Now consider you need to do something like send that same doubleword across a network or save to a file, examples of where you might need to access a doubleword as individual bytes. Maybe the other end of the network connection runs a different endianness. Maybe the file format specifies a certain endianness for interoperability. In this case you need to know the byte order so you can keep it correct.

Related

How does store to load forwarding happens in case of unaligned memory access?

I know the load/store queue architecture to facilitate store to load forwarding and disambiguation of out-of-order speculative loads. This is accomplished using matching load and store addresses.
This matching address technique will not work if the earlier store is to unaligned address and the load depends on it. My question is if this second load issued out-of-order how it gets disambiguated by earlier stores? or what policies modern architectures use to handle this condition?
Short
The short answer is that it depends on the architecture, but in theory unaligned operations don't necessarily prevent the architecture from performing store forwarding. As a practical matter, however, the much larger number of forwarding possibilities that unaligned loads operations represent means that forwarding from such locations may not be supported at all, or may be less well supported than the aligned cases.
Long
The long answer is that any particular architecture will have various scenarios they can handle efficiently, and those they cannot.
Old or very simple architectures may not have any store-forwarding capabilities at all. These architectures may not execute out of order at all, or may have some out-of-order capability but may simply wait until all prior stores have committed before executing any load.
The next level of sophistication is an architecture that at least has some kind of CAM to check prior store addresses. This architecture may not have store forwarding, but may allow loads to execute in-order or out-of-order once the load address and all prior store addresses are known (and there is no match). If there is a match with a prior store, the architecture may wait until the store commits before executing the load (which will read the stored value from the L1, if any).
Next up, we have architecture likes the above that wait until prior store addresses are known and also do store forwarding. The behavior is the same as above, except that when a load address hits a prior store, the store data is forwarded to the load without waiting for it to commit to L1.
A big problem with the above is that in the above designs, loads still can't execute until all prior store addresses are known. This inhibits out-of-order execution. So next up, we add speculation - if a load at a particular IP has been observed to not depend on prior stores, we just let it execute (read its value) even if prior store addresses aren't know. At retirement there will be a second check to ensure than the assumption that there was no hit to a prior store was correct, and if not there will be some type of pipeline clean and recovery. Loads that are predicted to hit a prior store wait until the store data (and possibly address) is available since they'll need store-forwarding.1
That's kind of where we are at today. There are yet more advanced techniques, many of which fall under the banner of memory renaming, but as far as I know they are not widely deployed.
Finally, we get to answer your original question: how all of this interacts with unaligned loads. Most of the above doesn't change - we only need to be more precise about what the definition of a hit is, where a load reads data from a previous store above.
You have several scenarios:
A later load is totally contained within a prior store. This means that all the bytes read by a load come from the earlier store.
A later load is partially contained within a prior store. This means that one or more bytes of the load come from an earlier store, but one or more bytes do not.
A later load is not contained at all within any earlier store.
On most platforms, all three possible scenarios exist regardless of alignment. However, in the case of aligned values, the second case (partial overlap) can only occur when a larger store follows a smaller load, and if the platform only supports once size of loads situation (2) is not supported at all.
Theoretically, direct1 store-to-load forwarding is possible in scenario (1), but not in scenarios (2) or (3).
To catch many practical cases of (1), you only need to check that the store and load addresses are the same, and that the load is not larger than the store. This still misses cases where a small load is fully contained in a larger store, whether aligned or not.
Where alignment helps is that the checks above are easier: you need to compare fewer bits of the addresses (e.g., a 32-bit load can ignore the bottom two bits of the address), and there are fewer possibilities to compare: a 4-byte load can only be contained in an 8-byte store in two possible ways (at the store address or the store address + 4), while misaligned operations can be fully contained in five different ways (at a load address offset any of 0,1,2,3 or 4 bytes from the store).
These differences are important in hardware, where the store queue has to look something like a fully-associative CAM implementing these comparisons. The more general the comparison, the more hardware is needed (or the longer the latency to do a lookup). Early hardware may have only caught the "same address" cases of (1), but the trend is towards catching more cases, both aligned and unaligned. Here is a great overview.
1 How best to do this type of memory-dependence speculation is something that WARF holds patents and based on which it is actively suing all sorts of CPU manufacturers.
2 By direct I mean from a single store to a following store. In principle, you might also have more complex forms of store-forwarding that can take parts of multiple prior stores and forward them to a single load, but it isn't clear to me if current architectures implement this.
For ISAs like x86, loads and stores can be multiple bytes wide and unaligned. A store may write to one or more bytes where a load may read a subset of them. Or on the other hand the entire data written by a store can be a subset of the data read by a following load.
Now to detect ordering violations (stores checking younger executed loads), this kind of overlap should be detected. I believe this can be accomplished even if there are unaligned accesses involved if the LSQ stores both the address and the size (how many bytes) of each access (along with other information such as its value, PC etc). With this information the LSQ can calculate every addresses that each load reads from/store writes to (address <= "addresses involved in the instruction" < address+size) to find whether the address matches or not.
To add a little bit more on this topic, in reality, the store-to-load forwarding (stores checking younger non-executed loads) might be restricted to certain cases such as for perfectly aligned accesses (I'm sorry I'm not aware of the details implemented in modern processors). Correct forwarding mechanism which can handle most cases (and thus improves performance accordingly) comes at the cost of increased complexity. Remember for unsupported forwarding cases you can always stall the load until the matching store(s) writeback to the cache and the load can read its value directly from the cache.

Storing a 32 byte object, on Ivy Bridge?

I am trying to find out whether, on Ivy Bridge, its possible to write a 256-bit object which consists of various data types (int, double, float etc)?
I have had a look at the Intel Manual and ctrl+f for "32-byte" but the results were all discussing 256-bits of the same data type (so 4x doubles or 8x floats etc).
I am doing this as part of a lock-free design to ensure data consistency- load all 256 bits of data together, then extract each of the various components separately.
I did a Web search, and it appears that Intel does not guarantee that a 32 byte write is atomic. I found this which suggests that not even regular 8 byte writes are guaranteed atomic.
Intel provides the compare and exchange 8 byte instruction which is atomic.
Bottom line is that I think you will need to take another approach.
EDIT: I forgot about the x86 lock prefix. Looking at this, it says that byte memory operations are guaranteed atomic, while larger operations are not unless the LOCK prefix is used on the read/write instruction.

Which of a misaligned store and misaligned load is more expensive?

Suppose I'm copying data between two arrays that are 1024000+1 bytes apart. Since the offset is not a multiple of word size, I'll need to do some misaligned accesses - either loads or stores (for the moment, let's forget that it's possible to avoid misaligned accesses entirely with some ORing and bit shifting). Which of misaligned loads or misaligned stores will be more expensive?
This is a hypothetical situation, so I can't just benchmark it :-) I'm more interested in what factors will lead to performance difference, if any. A pointer to some further reading would be great.
Thanks!
A misaligned write will need to read two destination words, merge in the new data, and write two words. This would be combined with an aligned read. So, 3R + 2W.
A misaligned read will need to read two source words, and merge the data (shift and bitor). This would be combined with an aligned write. So, 2R + 1W.
So, the misaligned read is a clear winner.
Of course, as you say there are more efficient ways to do this that avoid any mis-aligned operations except at the ends of the arrays.
Actually that depends greatly on the CPU you are using. On newer Intel CPUs there is no penalty for loading and storing unaligned words (at least none that you can notice). Only if you load and store 16byte or 32byte unaligned chunks you may see small performance degradation.
How much data? are we talking about two things unaligned at the ends of a large block of data (in the noise) or one item (word, etc) that is unaligned (100% of the data)?
Are you using a memcpy() to move this data, etc?
I'm more interested in what factors will lead to performance
difference, if any.
Memories, modules, chips, on die blocks, etc are usually organized with a fixed access size, at least somewhere along the way there is a fixed access size. Lets just say 64 bits wide, not an uncommon size these days. So at that layer wherever it is you can only write or read in aligned 64 bit units.
If you think about a write vs read, with a read you send out an address and that has to go to the memory and data come back, a full round trip has to happen. With a write everything you need to know to perform the write goes on the outbound path, so it is not uncommon to have a fire and forget type deal where the memory controller takes the address and data and tells the processor the write has finished even though the information has not net reached the memory. It does take time but not as long as a read (not talking about flash/proms just ram here) since a read requires both paths. So for aligned full width stuff a write CAN BE faster, some systems may wait for the data to make it all the way to the memory and then return a completion which is perhaps about the same amount of time as the read. It depends on your system though, the memory technology can make one or the other faster or slower right at the memory itself. Now the first write after nothing has been happening can do this fire and forget thing, but the second or third or fourth or 16th in a row eventually fills up a buffer somewhere along the path and the processor has to wait for the oldest one to make it all the way to the memory before the most recent one has a place in the queue. So for bursty stuff writes may be faster than reads but for large movements of data they approach each other.
Now alignment. The whole memory width will be read on a read, in this case lets say 64 bits, if you were only really interested in 8 of those bits, then somewhere between the memory and the processor the other 24 bits are discarded, where depends on the system. Writes that are not a whole, aligned, size of the memory mean that you have to read the width of the memory, lets say 64 bits, modify the new bits, say 8 bits, then write the whole 64 bits back. A read-modify-write. A read only needs a read a write needs a read-modify-write, the farther away from the memory requiring the read modify write the longer it takes the slower it is, no matter what the read-modify-write cant be any faster than the read alone so the read will be faster, the trimming of bits off the read generally wont take any time so reading a byte compared to reading 16 bits or 32 or 64 bits from the same location so long as the busses and destination are that width all the way, take the same time from the same location, in general, or should.
Unaligned simply multiplies the problem. Say worst case if you want to read 16 bits such that 8 bits are in one 64 bit location and the other 8 in the next 64 bit location, you need to read 128 bits to satisfy that 16 bit read. How that exactly happens and how much of a penalty is dependent on your system. some busses set up the transfer X number of clocks but the data is one clock per bus width after that so a 128 bit read might be only one clock longer (than the dozens to hundreds) of clocks it takes to read 64, or worst case it could take twice as long in order to get the 128 bits needed for this 16 bit read. A write, is a read-modify-write so take the read time, then modify the two 64 bit items, then write them back, same deal could be X+1 clocks in each direction or could be as bad as 2X number of clocks in each direction.
Caches help and hurt. A nice thing about using caches is that you can smooth out the transfers to the slow memory, you can let the cache worry about making sure all memory accesses are aligned and all writes are whole 64 bit writes, etc. How that happens though is the cache will perform same or larger sized reads. So reading 8 bits may result in one or many 64 bit reads of the slow memory, for the first byte, if you perform a second read right after that of the next byte location and if that location is in the same cache line then it doesnt go out to slow memory, it reads from the cache, much faster. and so on until you cross over into another cache boundary or other reads cause that cache line to be evicted. If the location being written is in cache then the read-modify-write happens in the cache, if not in cache then it depends on the system, a write doesnt necessarily mean the read modify write causes a cache line fill, it could happen on the back side as of the cache were not there. Now if you modified one byte in the cache line, now that line has to be written back it simply cannot be discarded so you have a one to few widths of the memory to write back as a result. your modification was fast but eventually the write happens to the slow memory and that affects the overall performance.
You could have situations where you do a (byte) read, the cache line if bigger than the external memory width can make that read slower than if the cache wasnt there, but then you do a byte write to some item in that cache line and that is fast since it is in the cache. So you might have experiments that happen to show writes are faster.
A painful case would be reading say 16 bits unaligned such that not only do they cross over a 64 bit memory width boundary but the cross over a cache line boundary, such that two cache lines have to be read, instead of reading 128 bits that might mean 256 or 512 or 1024 bits have to be read just to get your 16.
The memory sticks on your computer for example are actually multiple memories, say maybe 8 8 bit wide to make a 64 bit overall width or 16 4 bit wide to make an overall 64 bit width, etc. That doesnt mean you can isolate writes on one lane, but maybe, I dont know those modules very well but there are systems where you can/could do this, but those systems I would consider to be 8 or 4 bit wide as far as the smallest addressable size not 64 bit as far as this discussion goes. ECC makes things worse though. First you need an extra memory chip or more, basically more width 72 bits to support 64 for example. You must do full writes with ECC as the whole 72 bits lets say has to be self checking so you cant do fractions. if there is a correctable (single bit) error the read suffers no real penalty it gets the corrected 64 bits (somewhere in the path where this checking happens). Ideally you want a system to write back that corrected value but that is not how all systems work so a read could turn into a read modify write, aligned or not. The primary penalty is if you were able to do fractional writes you cant now with ECC has to be whole width writes.
Now to my question, lets say you use memcpy to move this data, many C libraries are tuned to do aligned transfers, at least where possible, if the source and destination are unaligned in a different way that can be bad, you might want to manage part of the copy yourself. say they are unaligned in the same way, the memcpy will try to copy the unaligned bytes first until it gets to an aligned boundary, then it shifts into high gear, copying aligned blocks until it gets near the end, it downshifts and copies the last few bytes if any, in an unaligned fashion. so if this memory copy you are talking about is thousands of bytes and the only unaligned stuff is near the ends then yes it will cost you some extra reads as much as two extra cache line fills, but that may be in the noise. Even on smaller sizes even if aligned on say 32 bit boundaries if you are not moving whole cache lines or whole memory widths there may still be an extra cache line involved, aligned or not you might only suffer an extra cache lines worth of reading and later writing...
The pure traditional, non-cached memory view of this, all other things held constant is as Doug wrote. Unaligned reads across one of these boundaries, like the 16 bits across two 64 bit words, costs you an extra read 2R vs 1R. A similar write costs you 2R+2W vs 1W, much more expensive. Caches and other things just complicate the problem greatly making the answer "it depends"...You need to know your system pretty well and what other stuff is going on around it, if any. Caches help and hurt, with any cache a test can be crafted to show the cache makes things slower and with the same system a test can be written to show the cache makes things faster.
Further reading would be go look at the databooks/sheets technical reference manuals or whatever the vendor calls their docs for various things. for ARM get the AXI/AMBA documentation on their busses, get the cache documentation for their cache (PL310 for example). Information on ddr memory the individual chips used in the modules you plug into your computer are all out there, lots of timing diagrams, etc. (note just because you think you are buying gigahertz memory, you are not, dram has not gotten faster in like 10 years or more, it is pretty slow around 133Mhz, it is just that the bus is faster and can queue more transfers, it still takes hundreds to thousands of processor cycles for a ddr memory cycle, read one byte that misses all the caches and you processor waits an eternity). so memory interfaces on the processors and docs on various memories, etc may help, along with text books on caches in general, etc.

bit-wise matrix transposition in VHDL using blockram

I've been trying to figure out a nice way of transposing a large amount of data in VHDL using a block ram (or similar).
Using a vector of vectors it's relatively easy, but it gets icky for large amounts of data.
I want to use a dual channel block ram so that I can write to the one block and read out the other. write in 8 bit std_logic_vectors, read out 32 bit std_logic_vectors, where the 32 bits is (for first rotation at least) the MSB for input vectors 0 - 31, then 32 - 63 all the way to 294911, then MSB-1, etc.
The case described above is my ideal scenario. Is this even possible? I can't seem to find a nice way of doing this.
In this answer, I'm assuming 18kbit Xilinx-style BRAMs. I'm most familiar with the Virtex-4, so I'm referring to UG070 for the following. The answer will map trivially to other Xilinx FPGAs, and probably other vendors' parts as well.
The first thing to note is that Virtex-4 BRAMs can be used in dual-port mode with different geometries for each port. However, with 1-bit-wide ports, these RAMs' parity bits aren't useful. Thus, an 18kbit BRAM is only effectively 16kbits here.
Next, consider the number of BRAMs you need just for storage (irrespective of your design.) 294912x8 bits maps to 144 BRAMs, which is a pretty large resource commitment. There's always a balance between throughput, design complexity, and resource requirements; if you need to squeeze every MBit/sec out of the design, maybe a BRAM-based approach is ideal. If not, you should consider whether your throughput and latency requirements allow you to use off-chip RAM instead of BRAM.
If you do plan on using BRAMs, then you should consider an array of 18x8 BRAMs. The 8 columns each store a single input bit. Each BRAM is written via a 1-bit port, and read out with a 32-bit port.
Every 8-bit write is mapped to eight one-bit writes (one write to a BRAM in each column.)
Every 32-bit read is mapped to a single 32-bit read from a single BRAM.
You should need very little sequencing logic (around a RAMB16 primitive) to get this working. The major complexity is how to map address bits between the 1-bit port and the 32-bit port.
After a bit of research, and thought, this is my answer to this problem:
Due to the nature of the block ram addressing, the ideal scenario mentioned in the OP is not possible with the current block ram addressing implementation. In order to perform a bitwise matrix transposition in the manner described, the block ram addressing would need to be able to switch between horizontal and vertical. That is, the ram must be accessible both row-wise and column-wise, and the addressing mode must be switchable in real-time. Since a bit-wise data transposition is not really a particularly "useful" transform, there wouldn't really be a reason for the implementation of such a switching scheme. Especially since the whole point of block ram is to store datain chunks of more than 1 bit, and such a transform would scramble the data.
I have discovered a way of changing my design such that 294911 x 8 bits do not need to be transformed at once, but rather done in stages using a process. This does not rely on block ram to perform the transform.

introduction to CS - stored-program concept - can't understand concept

I really do tried to understand the Von Neumann architecture, but there is one thing I can't understand, how can the user know the number in the computer's memory if this command or if it is a data ?
I know there is a 'stored-program concept', but I understood nothing...
Can someone explain it to me in a two sentences ?
thnx !
Put simply, the user cannot look at a memory address and determine if it is a command or data. It can be both.
Its all in the interpretation; if the program counter points to a memory address, it will be interpreted as a command. If it is referenced by a read instruction, it is data.
The point of this is flexibility. A program can write (or re-write) programs into memory, that can then be executed by setting the program counter to the start address.
Modern operating systems limit this behaviour by data execution prevention, keeping parts of the memory from being interpreted as commands.
The Basic concept of Stored program concept is the idea of storing data and instructions together in main memory.
NOTE: This is a vastly oversimplified answer. I intentionally left a lot of things out for the sake of making the point
Remember that all computer memory is, for all intents and purposes on modern machines, a long list of bytes. The numbers are meaningless unless the thing that put them there has a specific purpose for them.
I could put number 5 at address 0. It could represent the 5th instruction specified by my CPU's instruction-set manual. It could represent the number of hours of sleep I had last week. It's meaningless unless it's assigned some value.
So how do computers know what to actually "do" with the numbers?
It's a large combination of standards and specifications, which are documents or code that specify which data should go where, which each piece of data means, what acceptable values for the data are, etc. Such standards are (usually) agreed upon by the masses.
Standards exist everywhere. Your BIOS has specifications as to where to look for the main operating system entry point on the boot media (your hard disk, a live CD, a bootable USB stick, etc.).
From there, the operating system adheres to standards that dictate where in memory the VGA buffer exists (0xb8000 on x86 machines, for example) in order to output all of that boot up text you see when you start your machine.
So on and so forth.
A portable executable (windows) or an ELF image (linux) or a Mach-O image (MacOS) are just files that also follow a specification, usually mandated by the operating system manufacturer, that put pieces of code at specific positions in the file. Then that file is simply loaded into memory, given a specific virtual address in user space, and then the operating system knows exactly where the entry point for your program is.
From there, it sets up the instruction pointer (IP) to point to the current instruction byte. On most CPUs, the current byte pointed to by the IP activates specific circuits in the CPU to perform some action.
For example, on x86 CPUs, byte 0x04 is the ADD instruction that takes the next byte (so IP + 1), reads it as an unsigned 8 bit number, and adds it to the al register. This is mandated by the x86 specification, which all x86 CPUs have agreed to implement.
That means when the IP register is pointing to a byte with the value of 0x04, it will perform the add and increase the IP by 2 - the first is to skip the ADD instruction itself, and the second is to skip the "argument" (operand) to the ADD instruction.
The IP advances as fast as the CPU (and the operating system's scheduler) will allow it to - which amounts to a "running" program.
What the data mean is defined entirely by what's creating the data and what's using it. In the best of circumstances, the two parties agree, usually via a standard or specification of some sort.

Resources