Are bytes real? - byte

I know that this question may sound stupid, but let me just explain. So...
Everyone knows that byte is 8 bits. Simple, right? But where exactly is it specified? I mean, phisically you don't really use bytes, but bits. For example drives. As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes. Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think). Also RAM, which is again - a long stream of ones and zeros. Another example is CPU. It doesn't process 8 bits at a time, but only one.
So where exactly is it specified? Or is it just general rule, which everyone follows? If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to? Also - why can't you use less than a byte of memory? Or maybe you can? For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)? And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
I know that there are a lot of question and knowing answers to these questions doesn't change anything, but I was just wondering.
EDIT: Ok, I might've expressed myself not clear enough. I know that byte is just a concept (well, even bit is just a concept that we make real). I'm NOT asking why there are 8 bits in byte and why bytes exist as a term. What I'm asking is where in a computer is byte defined or if it even is defined. If bytes really are defined somewhere, at what level (a hardware level, OS level, programming language level or just at application level)? I'm also asking if computers even care about bytes (in that concept that we've made real), if they use bytes constantly (like in between two bytes, can there be some 3 random bits?).

Yes, they’re real insofaras they have a definition and a standardised use/understanding. The Wikipedia article for byte says:
The modern de-facto standard of eight bits, as documented in ISO/IEC 2382-1:1993, is a convenient power of two permitting the values 0 through 255 for one byte (2 in power of 8 = 256, where zero signifies a number as well).[7] The international standard IEC 80000-13 codified this common meaning. Many types of applications use information representable in eight or fewer bits and processor designers optimize for this common usage. The popularity of major commercial computing architectures has aided in the ubiquitous acceptance of the eight-bit size.[8] Modern architectures typically use 32- or 64-bit words, built of four or eight bytes
The full article is probably worth reading. No one set out stall 50+ years ago, banged a fist on the desk and said ‘a byte shallt be 8 bits’ but it became that way over time, with popular microprocessors being able to carry out operations on 8 bits at a time. Subsequent processor architectures carry out ops on multiples of this. While I’m sure intel could make their next chip a 100bit capable one, I think the next bitness revolution we’ll encounter will be 128
Everyone knows that byte is 8 bits?
These days, yes
But where exactly is it specified?
See above for the ISO code
I mean, phisically you don't really use bytes, but bits.
Physically we don’t use bits either, but a threshold of detectable magnetic field strength on a rust coated sheet of aluminium, or an amount of electrical charge storage
As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes.
True, everything to a computer is a really long stream of 0 and 1. What is important in defining anything else is where to stop counting this group of 0 or 1, and start counting the next group, and what you call the group. A byte is a group of 8 bits. We group things for convenience. It’s a lot more inconvenient to carry 24 tins of beer home than a single box containing 24 tins
Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think)
Sectors and bytes are analogous in that they represent a grouping of something, but they aren’t necessarily directly related in the way that bits and bytes are because sectors are a level of grouping on top of bytes. Over time the meaning of a sector as a segment of a track (a reference to a platter number and a distance from the centre of the platter) has changed as the march of progress has done away with positional addressing and later even rotational storage. In computing you’ll typically find that there is a base level that is hard to use, so someone builds a level of abstraction on top of it, and that becomes the new “hard to use”, so it’s abstracted again, and again.
Also RAM, which is again - a long stream of ones and zeros
Yes, and is consequently hard to use, so it’s abstracted, and abstracted again. Your program doesn’t concern itself with raising the charge level of some capacitive area of a memory chip, it uses the abstractions it has access to, and that attraction fiddles the next level down, and so on until the magic happens at the bottom of the hierarchy. Where you stop on this downward journey is largely a question of definition and arbitrary choice. I don’t usually consider my ram chips as something like ice cube trays full of electrons, or the subatomic quanta, but I could I suppose. We normally stop when it ceases to useful to solving the
Problem
Another example is CPU. It doesn't process 8 bits at a time, but only one.
That largely depends on your definition of ‘at a time’ - most of this question is about the definitions of various things. If we arbitrarily decide that ‘at a time’ is the unit block of the multiple picoseconds it takes the cpu to complete a single cycle then yes, a CPU can operate on multiple bits of information at once - it’s the whole idea of having a multiple bit cpu that can add two 32 bit numbers together and not forget bits. If you want to slice the time up so precisely that we can determine that enough charge has flowed to here but not there then you could say which bit the cpu is operating on right at this pico (or smaller) second, but it’s not useful to go so fine grained because nothing will happen until the end of the time slice the cpu is waiting for.
Suffice to say, when we divide time just enough to observe a single cpu cycle from start to finish, we can say the cpu is operating on more than one bit.
If you write at one letter per second, and I close my eyes for 2 out of every 3 seconds, I’ll see you write a whole 3 letter word “at the same time” - you write “the cat sat onn the mat” and to the observer, you generated each word simultaneously.
CPUs run cycles for similar reasons, they operate on the flow and buildup of electrical charge and you have to wait a certain amount of time for the charge to build up so that it triggers the next set of logical gates to open/close and direct the charge elsewhere. Faster CPUs are basically more sensitive circuitry; the rate of flow of charge is relatively constant, it’s the time you’re prepared to wait for input to flow from here to there, for that bucket to fill with just enough charge, that shortens with increasing MHz. Once enough charge has accumulated, bump! Something happens, and multiple things are processed “at the same time”
So where exactly is it specified? Or is it just general rule, which everyone follows?
it was the general rule, then it was specified to make sure it carried on being the general rule
If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to?
You could, but you’d essentially have to write an adaptation(abstraction) of an existing processor architecture and you’d use nine 8bit bytes to achieve your presentation of eight 9bit bytes. You’re creating an abstraction on top of an abstraction and boundaries of basic building blocks don’t align. You’d have a lot of work to do to see the system out to completion, and you wouldn’t bother.
In the real world, if ice cube trays made 8 cubes at a time but you thought the optimal number for a person to have in the freezer was 9, you’d buy 9 trays, freeze them and make 72 cubes, then divvy them up into 8 bags, and sell them that way. If someone turned up with 9 cubes worth of water (it melted), you’d have to split it over 2 trays, freeze it, give it back.. this constant adaptation between your industry provided 8 slot trays and your desire to process 9 cubes is the adaptive abstraction
If you do do it, maybe call it a nyte? :)
Also - why can't you use less than a byte of memory? Or maybe you can?
You can, you just have to work with the limitations of the existing abstraction being 8 bits. If you have 8 Boolean values to store you can code things up so you flip bits of the byte on and off, so even though you’re stuck with your 8 cube ice tray you can selectively fill and empty each cube. If your program only ever needs 7 Booleans, you might have to accept the wastage of the other bit. Or maybe you’ll use it in combination with a regular 32 bit int to keep track of a 33bit integer value. Lot of work though, writing an adaptation that knows to progress onto the 33rd bit rather than just throw an overflow error when you try to add 1 to 4,294,967,295. Memory is plentiful enough that you’d waste the bit, and waste another 31bits using a 64bit integer to hold your 4,294,967,296 value.
Generally, resource is so plentiful these days that we don’t care to waste a few bits.. It isn’t always so, of course: take credit card terminals sending data over slow lines. Every bit counts for speed, so the ancient protocols for info interchange with the bank might well use different bits of the same byte to code up multiple things
For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)?
No, because hardware and OS memory management these days keeps programs separate for security and stability. In the olden days though, one program could write to another program’s memory (it’s how we cheated at games, see the lives counter go down, just overwrite a new value), so in those days if two programs could behave, and one would only write to the 4 high bits and the other the 4 low bits then yes, they could have shared a byte. Access would probably be whole byte though, so each program would have to read the whole byte, only change its own bits of it, then write the entire result back
And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
Probably not, but you’ll never know because you don’t get to peek to that level of abstraction enough to see the disk laid out as a sequence of bits and know where the byte boundaries are, or sector boundaries, and whether this logical sector follows that logical sector, or whether a defect in the disk surface means the sectors don’t follow on from each other. You don’t typically care though, because you treat the drive as a contiguous array of bytes (etc) and let its controller worry about where the bits are

Related

xilinx fpga resource estimation

I am trying to understand how to estimate FPGA resource requirement for a design/application.
Lets says Spartan 7 part has,
Logic Cells - 52160
DSP Slices - 120
Memory - 2700
How to find out number of CLB's, RAM, and Flash availability?
Lets say my design needs a SPI interface in FPGA,
How to estimate CLB, RAM and Flash requirement for this design?
Thanks
Estimation of a block of logic can be done in a couple ways. One method is to actually pen out the logic on paper and look at what registers you are planning on creating. Then you need to look at the part you working with. In this case the Spartan 7 has CLB config as below:
This is from the Xilinx UG474 7 Series document, pg 17. So now you can see the quantity of flops and memory per CLB. Once you look at the registers in the code and count up the memory in the design, you can figure out the number of CLB's. You can share memory and flops in a single CLB generally without issue, however, if you have multiple memories, quantization takes over. Two seperate memories can't occupy the same CLB generally. Also, there are other quantization effects. Memories some in perfect binary sizes, and if you build a 33 bit wide memory x 128K locations, you will really absorbes 64x128K bits of memory, where 31 bits x 128K are unused and untouchable for other uses.
The second method of estimating size is more experienced based as is practiced by larger FPGA teams where previous designs are looked at, and engineers will make basic comparisons of logic to identify previous blocks that are similar to what you are designing next. You might argue that I2C interface isn'a 100% like a SPI interface, but they are similar enough that you could say, 125% of I2C would be a good estiamte of a SPI with some margin for error. You then just throw that number into a spread sheet along with estimates for the 100 other modules that are in design and you call that the rough estimate.
If the estimate needs a second pass to make it more accurate, then you should throw a little code together and validate that it is functional enough to NOT be optimizing flops, gates and memory away and then use that to sure up the estimate. This is tougher because optimization (Read as dropping of unused flops) can happen all too easily, so you need to be certain that flops and gates are twiddle-able enough to not let them be interpreted as unused or always 1 or always 0.
To figure out the number of CLB's you can use the CLB slice configuration table above. Take the number of flops and divide by 16 (For the 7 Series devices) and this will give you the flop based CLB number. Take the memory bits, and divide each memory by 256 (again for 7 series devices) and you will get the total CLB's based on memory. At that point just take the larger of the CLB counts and that will be your CLB estimate.

How to test algorithm performance on devices with low resources?

I am interested in using atmel avr controllers to read data from LIN bus. Unfortunately, messages on such bus have no beginning or end indicator and only reasonable solution seems to be brute force parsing. Available data from bus is loaded into circular buffer, and brute force method finds valid messages in buffer.
Working with 64 byte buffer and 20MHZ attiny, how can I test the performance of my code in order to see if buffer overflow is likely to occur? Added: My concern is that algorith will be running slow, thus buffering even more data.
A bit about brute force algorithm. Second element in a buffer is assumed to be message size. For example, if assumed length is 22, first 21 bytes are XORed and tested against 22nd byte in buffer. If checksum passes, code checks if first (SRC) and third (DST) byte are what they are supposed to be.
AVR is one of the easiest microcontrollers for performance analysis, because it is a RISC machine with a simple intruction set and well-known instruction execution time for each instruction.
So, the beasic procedure is that you take the assembly coude and start calculating different scenarios. Basic register operations take one clock cycle, branches usually two cycles, and memory accesses three cycles. A XORing cycle would take maybe 5-10 cycles per byte, so it is relatively cheap. How you get your hands on the assembly code depends on the compiler, but all compilers tend to give you the end result in a reasonable legible form.
Usually without seeing the algorithm and knowing anything about the timing requirements it is quite impossible to give a definite answer to this kind of questions. However, as the LIN bus speed is limited to 20 kbit/s, you will have around 10 000 clock cycles for each byte. That is enough for almost anything.
A more difficult question is what to do with the LIN framing which is dependent on timing. It is not a very nice habit, as it really requires some time extra effort from the microcontroller. (What on earth is wrong with using the 9th bit?)
The LIN frame consists of a
break (at least 13 bit times)
synch delimiter (0x55)
message id (8 bits)
message (0..8 x 8 bits)
checksum (8 bits)
There are at least four possible approaches with their ups and downs:
(Your apporach.) Start at all possible starting positions and try to figure out where the checksummed message is. Once you are in sync, this is not needed. (Easy but returns ghost messages with a probability 1/256. Remember to discard the synch field.)
Use the internal UART and look for the synch field; try to figure out whether the data after the delimiter makes any sense. (This has lower probability of errors than the above, but requires the synch delimiter to come through without glitches and may thus miss messages.)
Look for the break. Easiest way to do this to timestamp all arriving bytes. It is quite probably not required to buffer the incoming data in any way, as the data rate is very low (max. 2000 bytes/s). Nominally, the distance between the end of the last character of a frame and the start of the first character of the next frame is at least 13 bits. As receiving a character takes 10 bits, the delay between receiving the end of the last character in the previous message and end of the first character of the next message is nominally at least 23 bits. In order to allow some tolerance for the bit timing, the limit could be set to, e.g. 17 bits. If the distance in time between "character received" interrupts exceeds this limit, the characters belong to different frame. Once you have detected the break, you may start collecting a new message. (This works almost according to the official spec.)
Do-it-yourself bit-by-bit. If you do not have a good synchronization between the slave and the master, you will have to determine the master clock using this method. The implementation is not very straightforward, but one example is: http://www.atmel.com/images/doc1637.pdf (I do not claim that one to be foolproof, it is rather simplistic.)
I would go with #3. Create an interrupt for incoming data and whenever data comes you compare the current timestamp (for which you need a counter) to the timestamp of the previous interrupt. If the inter-character time is too long, you start a new message, otherwise append to the old message. Then you may need double buffering for the messages (one you are collecting, another you are analyzing) to avoid very long interrupt routines.
The actual implementation depends on the other structure of your code. This shouldn't take much time.
And if you cannot make sure your clock is well enough synchronized (+- 4%) to the moster clock, then you'll have to look at #4, which is probably much more instructive but quite tedious.
Your fundamental question is this (as I see it):
how can I test the performance of my code in order to see if buffer overflow is likely to occur?
Set a pin high at the start of the algorithm, set it low at the end. Look at it on an oscilloscope (I assume you have one of these - embedded development is very difficult without it.) You'll be able to measure the max time the algorithm takes, and also get some idea of the variability.

How to select the most powerful OpenCL device?

My computer has both an Intel GPU and an NVIDIA GPU. The latter is much more powerful and is my preferred device when performing heavy tasks. I need a way to programmatically determine which one of the devices to use.
I'm aware of the fact that it is hard to know which device is best suited for a particular task. What I need is to (programmatically) make a qualified guess using the variables listed below.
How would you rank these two devices? Intel HD Graphics 4400 to the left, GeForce GT 750M to the right.
GlobalMemoryCacheLineSize 64 vs 128
GlobalMemoryCacheSize 2097152 vs 32768
GlobalMemorySize 1837105152 vs 4294967296
HostUnifiedMemory true vs false
Image2DMaxHeight 16384 vs 32768
Image2DMaxWidth 16384 vs 32768
Image3DMaxDepth 2048 vs 4096
Image3DMaxHeight 2048 vs 4096
Image3DMaxWidth 2048 vs 4096
LocalMemorySize 65536 vs 49152
MaxClockFrequency 400 vs 1085
MaxComputeUnits 20 vs 2
MaxConstantArguments 8 vs 9
MaxMemoryAllocationSize 459276288 vs 1073741824
MaxParameterSize 1024 vs 4352
MaxReadImageArguments 128 vs 256
MaxSamplers 16 vs 32
MaxWorkGroupSize 512 vs 1024
MaxWorkItemSizes [512, 512, 512] vs [1024, 1024, 64]
MaxWriteImageArguments 8 vs 16
MemoryBaseAddressAlignment 1024 vs 4096
OpenCLCVersion 1.2 vs 1.1
ProfilingTimerResolution 80 vs 1000
VendorId 32902 vs 4318
Obviously, there are hundreds of other devices to consider. I need a general formula!
You can not have a simple formula to calculate an index from that parameters.
Explanation
First of all let me assume you can trust collected data, of course if you read 2 for MaxComputeUnits but in reality it's 80 then there is nothing you can do (unless you have your own database of cards with all their specifications).
How can you guess if you do not know task you have to perform? It may be something highly parallel (then more units may be better) or a raw brute calculation (then higher clock frequency or bigger cache may be better). As for normal CPU number of threads isn't the only factor you have to consider for parallel tasks. Just to mention few things you have to consider:
Cache: how much local data each task works with?
Memory: shared with CPU? How many concurrent accesses compared to parallel tasks?
Instruction set: do you need something specific that increases speed even if other parameters aren't so good?
Misc stuff: do you have some specific requirement, for example size of something that must be supported and a fallback method makes everything terribly slow?
To make it short: you can not calculate an index in a reliable way because factors are too many and they're strongly correlated (for example high parallelism may be slowed by small cache or slow memory access but a specific instruction, if supported, may give you great performance even if all other parameters are poor).
One Possible Solution
If you need a raw comparison you may even simply do MaxComputeUnits * MaxClockFrequency (and it may even be enough for many applications) but if you need a more accurate index then don't think it'll be an easy task and you'll get a general purpose formula like (a + b / 2)^2, it's not and results will be very specific to task you have to accomplish.
Write a small test (as much similar as possible to what your task is, take a look to this post on SO) and run it with many cards, with a big enough statistic you may extrapolate an index from an unknown set of parameters. Algorithms can become pretty complex and there is a vast literature about this topic so I won't even try to repeat them here. I would start with Wikipedia article as summary to other more specific papers. If you need an example of what you have to do you may read Exploring the Multiple-GPU Design Space.
Remember that more variables you add to your study more results quality will be unstable, less parameters you use less results will be accurate. To better support extrapolation:
After you collected enough data you should first select and reduce variables with some pre-analysis to a subset of them including only what influences more your benchmark results (for example MaxGroupSize may not be so relevant). This phase is really important and decisions should be made with statistic tools (you may for example calculate p-value).
Some parameters may have a great variability (memory size, number of units) but analysis would be easier with less values (for example [0..5) units, [5..10) units, [10..*) units). You should then partition data (watching their distribution). Different partitions may lead to very different results so you should try different combinations.
There are many other things to consider, a good book about data mining would help you more than 1000 words written here.
As #Adriano as pointed out, there are many things to take into considerations...too many things.
But I can think of few things (and easier things that could be done) to help you out (not to completely solve your problem) :
OCL Version
First thing first, which version of OCL do you need (not really related to performance). But if you use some feature of OCL 1.2...well problem solved
Memory or computation bound
You can usually (and crudely) categorized your algorithms in one of these two categories: memory bounded or computation bounded. In the case it's memory bound (with a lot of transfers between host and device) probably the most interesting info would be the device with Host Unified Memory. If not, the most powerful processors most probably would be more interesting.
Rough benchmark
But most probably it wouldn't be as easy to choose in which category put your application.
In that case you could make a small benchmark. Roughly, this benchmark would test different size of data (if your app has to deal with that) on dummy computations which would more or less match the amount of computations your application requires (estimated by you after you completed the development of your kernels). You could log the point where the amount of data is so big that it cancels the device most powerful but connected via PCIe.
GPU Occupancy
Another very important thing when programming on GPUs is the GPU occupancy. The higher, the best. NVIDIA provides an Excel file that calculates the occupancy based on some input. Based on these concepts, you could more or less reproduce the calculation of the occupancy (some adjustment will most probably needed for other vendors) for both GPUs and choose the one with the highest.
Of course, you need to know the values of these inputs. Some of them are based on your code, so you can calculate them before hands. Some of them are linked to the specs of the GPU. You can query some of them as you already did, for some others you might need to hardcode the values in some files after some googling (but at least you don't need to have these GPUs at hands to test on them). Last but not least, don't forget that OCL provides the clGetKernelWorkGroupInfo() which can provide you some info such as the amount of local or private memory needed by a specific kernel.
Regarding the info about the local memory please note that remark from the standard:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
So, it means that this info could be useless if you have first to dynamically compute the size from the host side. A work-around for that could be to use the fact that the kernels are compiled in JIT. The idea here would be to use the preprocessor option -D when calling clBuildProgram() as I explained here. This would give you something like:
#define SIZE
__mykernel(args){
local myLocalMem[SIZE];
....
}
And what if the easier was:
After all the blabla. I'm guessing that you worry about this because you might want to ship your application to some users without knowing what hardware they have. Would it be very inconvenient (at install time or maybe after by providing them a command or a button) to simply run you application with dummy generated data to measure which device performed better and simply log it in a config file?
Or maybe:
Sometime, depending on you specific problem (that could not involve to many syncs) you don't have to choose. Sometime, you could just simply split the work between the two devices and use both...
Why guess? Choose dynamically on your hardware of the day: Take the code you wish to run on the "best" GPU and run it, on a small amount of sample data, on each available GPU. Whichever finishes first: use it for the rest of your calculations.
I'm loving all of the solutions so far. If it is important to make the best device selection automatically, that's how to do it (weight the values based on your usage needs and take the highest score).
Alternatively, and much simpler, is to just take the first GPU device, but also have a way for the user to see the list of compatible devices and change it (either right away or on the next run).
This alternative is reasonable because most systems only have one GPU.

How is more than one bit read at a time?

This seems like it should be incredibly easy for me to grasp, but I am going insane trying to understand it. I understand that the computer only understands on or off, so what I am trying to understand is a situation where the computer reads more than one on or off value at a time - as in byte-addressable memory - where it would need to read 8 values of on and off to get the byte value. If the computer reads these 8 values sequentially - one after the other - then that would make sense to me. But from what I gather it reads all of these on and off states at the same time. I assume this is accomplished through the circuit design? Would someone be able to either explain this in simple terms to my simple mind or direct me to a resource that could do so? I would love to not have to be committed to the mental hospital over this - thanks.
If you mean reading the data off of a CD, it does happen sequentially. Each bit is gathered individually by a laser, but this operation is still treated as occurring all at once conceptually because it doesn't expose sufficiently fine-grained control to other parts of the computer to for it to be instructed to read just one bit. Simplified, the way this works is that a laser bounces off a mirror which depending on whether it is scratched or not in that particular place either causes the laser to bounce back and hit a photo-transistor (unscratched) setting a single bit to off, or to be deflected and not hit the photo-transistor (scratched), setting the bit to on. Even though it is loading these bits sequentially, it will always load complete bytes into its buffer before sending them to the computer's memory. Multiple bits travel from the hard drive buffer to memory at once. This is possible because the bus connecting them has multiple wires. Each wire can only hold one value at a given time, either high voltage or low (on or off). Eight adjacent wires can hold a byte at once.
A hard drive works by the same principal except rather than a laser and a scratched disk, you have a disk that is magnetized in specific locations, and a very sensitive inductor that will have current flow one way or another depending on the polarity of the accelerating magnetic field below it (i.e. the spinning disk).
Other parts of computer hardware actually do their operations in parallel. For sending information between different parts of the memory hierarchy, you can still think of this as just having multiple wires adjacent to each other.
Processing actually is built around the fact that multiple bits are being accessed at once, i.e. that there are different wires with on and off right next to each other. For example, in modern computers, a register can be thought of as 64 adjacent wires. (Calling them wires is a pretty big simplification at this point, but it is still gets across the appropriate idea.) A processor is built up of logic gates, and logic gates are made up of transistors. Transistors are a physical objects that have four places to attach wires. (1),(2),(3), and (4). A transistor allows current to flow from (1) to (2) if and only if the voltage at (3) is higher than the voltage at (4). Basically, a logic gate sets the voltage at (3) and (4) on a bunch of wires, and has a constant ON supply at (1), and (4) will only be powered if the amount of voltage at (3) and (4) matches the configuration that the logic gate is allowed to power.
From there it is pretty easy to see how input of some pre-specified size can be operated on at once.
If each of the gates in the drawing below is an OR gate, the drawing would show a new composite gate that calculated the OR of 8 bits at once instead of just two.
(calculated end -- registers get written to here)
^
^ ^ (everything that happens in here is "invisible" to the computer)
^ ^ ^ ^
a b c d e f g h (registers get read here)
Processors are built with a bunch of operations of some specified depth in number of transistors so that every 2 ^ - 32 second or so, the processor can flip a switch that allows the voltage stored at the calculated end of the gates to flow back to the registers. So that it can be used in new computations or pushed over wires back to memory.
That, in a nut shell, is what it means for a computer to deal with multiple bits at once.

Which of a misaligned store and misaligned load is more expensive?

Suppose I'm copying data between two arrays that are 1024000+1 bytes apart. Since the offset is not a multiple of word size, I'll need to do some misaligned accesses - either loads or stores (for the moment, let's forget that it's possible to avoid misaligned accesses entirely with some ORing and bit shifting). Which of misaligned loads or misaligned stores will be more expensive?
This is a hypothetical situation, so I can't just benchmark it :-) I'm more interested in what factors will lead to performance difference, if any. A pointer to some further reading would be great.
Thanks!
A misaligned write will need to read two destination words, merge in the new data, and write two words. This would be combined with an aligned read. So, 3R + 2W.
A misaligned read will need to read two source words, and merge the data (shift and bitor). This would be combined with an aligned write. So, 2R + 1W.
So, the misaligned read is a clear winner.
Of course, as you say there are more efficient ways to do this that avoid any mis-aligned operations except at the ends of the arrays.
Actually that depends greatly on the CPU you are using. On newer Intel CPUs there is no penalty for loading and storing unaligned words (at least none that you can notice). Only if you load and store 16byte or 32byte unaligned chunks you may see small performance degradation.
How much data? are we talking about two things unaligned at the ends of a large block of data (in the noise) or one item (word, etc) that is unaligned (100% of the data)?
Are you using a memcpy() to move this data, etc?
I'm more interested in what factors will lead to performance
difference, if any.
Memories, modules, chips, on die blocks, etc are usually organized with a fixed access size, at least somewhere along the way there is a fixed access size. Lets just say 64 bits wide, not an uncommon size these days. So at that layer wherever it is you can only write or read in aligned 64 bit units.
If you think about a write vs read, with a read you send out an address and that has to go to the memory and data come back, a full round trip has to happen. With a write everything you need to know to perform the write goes on the outbound path, so it is not uncommon to have a fire and forget type deal where the memory controller takes the address and data and tells the processor the write has finished even though the information has not net reached the memory. It does take time but not as long as a read (not talking about flash/proms just ram here) since a read requires both paths. So for aligned full width stuff a write CAN BE faster, some systems may wait for the data to make it all the way to the memory and then return a completion which is perhaps about the same amount of time as the read. It depends on your system though, the memory technology can make one or the other faster or slower right at the memory itself. Now the first write after nothing has been happening can do this fire and forget thing, but the second or third or fourth or 16th in a row eventually fills up a buffer somewhere along the path and the processor has to wait for the oldest one to make it all the way to the memory before the most recent one has a place in the queue. So for bursty stuff writes may be faster than reads but for large movements of data they approach each other.
Now alignment. The whole memory width will be read on a read, in this case lets say 64 bits, if you were only really interested in 8 of those bits, then somewhere between the memory and the processor the other 24 bits are discarded, where depends on the system. Writes that are not a whole, aligned, size of the memory mean that you have to read the width of the memory, lets say 64 bits, modify the new bits, say 8 bits, then write the whole 64 bits back. A read-modify-write. A read only needs a read a write needs a read-modify-write, the farther away from the memory requiring the read modify write the longer it takes the slower it is, no matter what the read-modify-write cant be any faster than the read alone so the read will be faster, the trimming of bits off the read generally wont take any time so reading a byte compared to reading 16 bits or 32 or 64 bits from the same location so long as the busses and destination are that width all the way, take the same time from the same location, in general, or should.
Unaligned simply multiplies the problem. Say worst case if you want to read 16 bits such that 8 bits are in one 64 bit location and the other 8 in the next 64 bit location, you need to read 128 bits to satisfy that 16 bit read. How that exactly happens and how much of a penalty is dependent on your system. some busses set up the transfer X number of clocks but the data is one clock per bus width after that so a 128 bit read might be only one clock longer (than the dozens to hundreds) of clocks it takes to read 64, or worst case it could take twice as long in order to get the 128 bits needed for this 16 bit read. A write, is a read-modify-write so take the read time, then modify the two 64 bit items, then write them back, same deal could be X+1 clocks in each direction or could be as bad as 2X number of clocks in each direction.
Caches help and hurt. A nice thing about using caches is that you can smooth out the transfers to the slow memory, you can let the cache worry about making sure all memory accesses are aligned and all writes are whole 64 bit writes, etc. How that happens though is the cache will perform same or larger sized reads. So reading 8 bits may result in one or many 64 bit reads of the slow memory, for the first byte, if you perform a second read right after that of the next byte location and if that location is in the same cache line then it doesnt go out to slow memory, it reads from the cache, much faster. and so on until you cross over into another cache boundary or other reads cause that cache line to be evicted. If the location being written is in cache then the read-modify-write happens in the cache, if not in cache then it depends on the system, a write doesnt necessarily mean the read modify write causes a cache line fill, it could happen on the back side as of the cache were not there. Now if you modified one byte in the cache line, now that line has to be written back it simply cannot be discarded so you have a one to few widths of the memory to write back as a result. your modification was fast but eventually the write happens to the slow memory and that affects the overall performance.
You could have situations where you do a (byte) read, the cache line if bigger than the external memory width can make that read slower than if the cache wasnt there, but then you do a byte write to some item in that cache line and that is fast since it is in the cache. So you might have experiments that happen to show writes are faster.
A painful case would be reading say 16 bits unaligned such that not only do they cross over a 64 bit memory width boundary but the cross over a cache line boundary, such that two cache lines have to be read, instead of reading 128 bits that might mean 256 or 512 or 1024 bits have to be read just to get your 16.
The memory sticks on your computer for example are actually multiple memories, say maybe 8 8 bit wide to make a 64 bit overall width or 16 4 bit wide to make an overall 64 bit width, etc. That doesnt mean you can isolate writes on one lane, but maybe, I dont know those modules very well but there are systems where you can/could do this, but those systems I would consider to be 8 or 4 bit wide as far as the smallest addressable size not 64 bit as far as this discussion goes. ECC makes things worse though. First you need an extra memory chip or more, basically more width 72 bits to support 64 for example. You must do full writes with ECC as the whole 72 bits lets say has to be self checking so you cant do fractions. if there is a correctable (single bit) error the read suffers no real penalty it gets the corrected 64 bits (somewhere in the path where this checking happens). Ideally you want a system to write back that corrected value but that is not how all systems work so a read could turn into a read modify write, aligned or not. The primary penalty is if you were able to do fractional writes you cant now with ECC has to be whole width writes.
Now to my question, lets say you use memcpy to move this data, many C libraries are tuned to do aligned transfers, at least where possible, if the source and destination are unaligned in a different way that can be bad, you might want to manage part of the copy yourself. say they are unaligned in the same way, the memcpy will try to copy the unaligned bytes first until it gets to an aligned boundary, then it shifts into high gear, copying aligned blocks until it gets near the end, it downshifts and copies the last few bytes if any, in an unaligned fashion. so if this memory copy you are talking about is thousands of bytes and the only unaligned stuff is near the ends then yes it will cost you some extra reads as much as two extra cache line fills, but that may be in the noise. Even on smaller sizes even if aligned on say 32 bit boundaries if you are not moving whole cache lines or whole memory widths there may still be an extra cache line involved, aligned or not you might only suffer an extra cache lines worth of reading and later writing...
The pure traditional, non-cached memory view of this, all other things held constant is as Doug wrote. Unaligned reads across one of these boundaries, like the 16 bits across two 64 bit words, costs you an extra read 2R vs 1R. A similar write costs you 2R+2W vs 1W, much more expensive. Caches and other things just complicate the problem greatly making the answer "it depends"...You need to know your system pretty well and what other stuff is going on around it, if any. Caches help and hurt, with any cache a test can be crafted to show the cache makes things slower and with the same system a test can be written to show the cache makes things faster.
Further reading would be go look at the databooks/sheets technical reference manuals or whatever the vendor calls their docs for various things. for ARM get the AXI/AMBA documentation on their busses, get the cache documentation for their cache (PL310 for example). Information on ddr memory the individual chips used in the modules you plug into your computer are all out there, lots of timing diagrams, etc. (note just because you think you are buying gigahertz memory, you are not, dram has not gotten faster in like 10 years or more, it is pretty slow around 133Mhz, it is just that the bus is faster and can queue more transfers, it still takes hundreds to thousands of processor cycles for a ddr memory cycle, read one byte that misses all the caches and you processor waits an eternity). so memory interfaces on the processors and docs on various memories, etc may help, along with text books on caches in general, etc.

Resources