vhdl 32bit counter on 16bit bus - vhdl

what is the best practice to access a changing 32bit register (like a counter) through a 16bit databus ?
I suppose i have to 'freeze' or copy the 32bit value on a read of the LSB until the MSB is also read and vise versa on a write to avoid data corruption if the LSB overflows to the MSB between the 2 accesses.
Is there a standard approach to this ?

As suggested in both the question and Morten's answer, a second register to hold the value at the time of the read of the first half is a common method. In some MCUs this register is common to multiple devices, meaning you need to either disable interrupts across the two accesses or ensure ISRs don't touch the extra register. Writes are similarly handled, frequently in the opposite order (write second word temporary storage, then write first word on device thus triggering the device to read the second word simultaneously).
There have also been cases where you just can't access the register atomically. In such cases, you might need to implement additional logic to figure out the true value. An example of such an algorithm assuming three reads take much less than 1<<15 counter ticks might be:
earlyMSB = highreg;
midLSB = lowreg;
lateMSB = highreg;
fullword = ((midLSB<0x8000 ? lateMSB : earlyMSB)<<16) | midLSB;
Other variants might use an overflow flag to signal the more significant word needs an increment (frequently used to implement that part of the counter in software).

There is no standard way, but an often used approach is to make read one address return the first 16 bits, while the remaining 16 bits are captured at the same time, and read later at another address.

Related

An algorithm to determine if a number belongs to a group or not

I'm not even sure if this is possible but I think it's worth asking anyway.
Say we have 100 devices in a network. Each device has a unique ID.
I want to tell a group of these devices to do something by broadcasting only one packet (A packet that all the devices receive).
For example, if I wanted to tell devices 2,5,75,116 and 530 to do something, I have to broadcast this : 2-5-75-116-530
But this packet can get pretty long if I wanted (for example) 95 of the devices to do something!!!
So I need a method to reduce the length of this packet.
After thinking for a while, I came up with an idea:
what if I used only prime numbers as device IDs? Then I could send the product of device IDs of the group I need, as the packet and every device will check if the remainder of the received number and its device ID is 0.
For example if I wanted devices 2,3,5 and 7 to do something, I would broadcast 2*3*5*7 = 210 and then each device will calculate "210 mod self ID" and only devices with IDs 2,3,5 and 7 will get 0 so they know that they should do something.
But this method is not efficient because the 100th prime numbers is 541 and the broadcasted number may get really big and the "mod" calculation may get really hard.(the devices have 8bit processors).
So I just need a method for the devices to determine if they should do something or ignore the received packet. And I need the packet to be as short as possible.
I tried my best to explain the question, If its still vague, please tell me to explain more.
You can just use a bit string in which every bit represents a device. Then, you just need a bitwise AND to tell if a given machine should react.
You'd need one bit per device, which would be, for example, 32 bytes for 256 devices. Admittedly, that's a little wasteful if you only need one machine to react, but it's pretty compact if you need, say, 95 devices to respond.
You mentioned that you need the device id to be <= 4 bytes, but that's no problem: 4 bytes = 32 bits = enough space to store 2^32 device ids. For example, the device id for the 101st machine (if you start at 0) could just be 100 (0b01100100) = 1 byte. You would just need to use that to figure out which byte of the packet to use (ceil(100 / 8) = the 13th) and bitwise AND that byte against 100 % 8 = 4 = 0b00000100.
As cobarzan said, you also can use a hybrid scheme allowing for individual addressing. In that scenario, you could use the first bit as a signal to indicate multiple- or single-machine addressing. As cobarzan said, that requires more processing, and it means the first byte can only store 7 machine signals, rather than 8.
Like Ed Cottrell suggested, a bit string would do the job. If the machines are labeled {1,..,n}, there are 2n-1 possible subsets (assuming you do not send requests with no intended target). So you need a data structure able to hold every possible signature of such a subset, whatever you decide the signature to be. And n bits (one for each machine) is the best one can do regarding the size of such a data structure. The evaluation performed on the machines takes constant time (on machine with label l just look at the lth bit).
But one could go for some hybrid scheme. Say you have a task for one device only, then it would be a pity to send n bits (all 0s, except one). So you can take one additional bit T which indicates the type of packet. The value of T is set to 0 if you are sending a bit string of length n as described above or set to 1 if you are using a more appropriate scheme (i.e. less bits). In the case of just one machine that needs to perform the task, you could send directly the label of the machine (which is O(log n) bits long). This approach reduces the size of the packet if you have less than O(n/log n) machines you need to perform the task. Evaluation on the machines is more expensive though.

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?

The reason this gets me confused is that all addresses hold a sequence of 1's and 0's. So how does the CPU differentiate, let's say, 00000100(integer) from 00000100(CPU instruction)?
First of all, different commands have different values (opcodes). That's how the CPU knows what to do.
Finally, the questions remains: What's a command, what's data?
Modern PCs are working with the von Neumann-Architecture ( https://en.wikipedia.org/wiki/John_von_Neumann) where data and opcodes are stored in the same memory space. (There are architectures seperating between these two data types, such as the Harvard architecture)
Explaining everything in Detail would totally be beyond the scope of stackoverflow, most likely the amount of characters per post would not be sufficent.
To answer the question with as few words as possible (Everyone actually working on this level would kill me for the shortcuts in the explanation):
Data in the memory is stored at certain addresses.
Each CPU Advice is basically consisting of 3 different addresses (NOT values - just addresses!):
Adress about what to do
Adress about value
Adress about an additional value
So, assuming an addition should be performed, and you have 3 Adresses available in the memory, the application would Store (in case of 5+7) (I used "verbs" for the instructions)
Adress | Stored Value
1 | ADD
2 | 5
3 | 7
Finally the CPU receives the instruction 1 2 3, which then means ADD 5 7 (These things are order-sensitive! [Command] [v1] [v2])... And now things are getting complicated.
The CPU will move these values (actually not the values, just the adresses of the values) into its registers and then processing it. The exact registers to choose depend on datatype, datasize and opcode.
In the case of the command #1 #2 #3, the CPU will first read these memory addresses, then knowing that ADD 5 7 is desired.
Based on the opcode for ADD the CPU will know:
Put Address #2 into r1
Put Address #3 into r2
Read Memory-Value Stored at the address stored in r1
Read Memory-Value stored at the address stored in r2
Add both values
Write result somewhere in memory
Store Address of where I put the result into r3
Store Address stored in r3 into the Memory-Address stored in r1.
Note that this is simplified. Actually the CPU needs exact instructions on whether its handling a value or address. In Assembly this is done by using
eax (means value stored in register eax)
[eax] (means value stored in memory at the adress stored in the register eax)
The CPU cannot perform calculations on values stored in the memory, so it is quite busy moving values From memory to registers and from registers to memory.
i.e. If you have
eax = 0x2
and in memory
0x2 = 110011
and the instruction
MOV ebx, [eax]
this means: move the value, currently stored at the address, that is currently stored in eax into the register ebx. So finally
ebx = 110011
(This is happening EVERYTIME the CPU does a single calculation!. Memory -> Register -> Memory)
Finally, the demanding application can read its predefined memory address #2,
resulting in address #2568 and then knows, that the outcome of the calculation is stored at adress #2568. Reading that Adress will result in the value 12 (5+7)
This is just a tiny tiny example of whats going on. For a more detailed introduction about this, refer to http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
One cannot really grasp the amount of data movement and calculations done for a simple addition of 2 values. Doing what a CPU does (on paper) would take you several minutes just to calculate "5+7", since there is no "5" and no "7" - Everything is hidden behind an address in memory, pointing to some bits, resulting in different values depending on what the bits at adress 0x1 are instructing...
Short form: The CPU does not know what's stored there, but the instructions tell the CPU how to interpret it.
Let's have a simplified example.
If the CPU is told to add a word (let's say, an 32 bit integer) stored at the location X, it fetches the content of that address and adds it.
If the program counter reaches the same location, the CPU will again fetch this word and execute it as a command.
The CPU (other than security stuff like the NX bit) is blind to whether it's data or code.
The only way data doesn't accidentally get executed as code is by carefully organizing the code to never refer to a location holding data with an instruction meant to operate on code.
When a program is started, the processor starts executing it at a predefined spot. The author of a program written in machine language will have intentionally put the beginning of their program there. From there, that instruction will always end up setting the next location the processor will execute to somewhere this is an instruction. This continues to be the case for all of the instructions that make up the program, unless there is a serious bug in the code.
There are two main ways instructions can set where the processor goes next: jumps/branches, and not explicitly specifying. If the instruction doesn't explicitly specify where to go next, the CPU defaults to the location directly after the current instruction. Contrast that to jumps and branches, which have space to specifically encode the address of the next instruction's address. Jumps always jump to the place specified. Branches check if a condition is true. If it is, the CPU will jump to the encoded location. If the condition is false, it will simply go to the instruction directly after the branch.
Additionally, the a machine language program should never write data to a location that is for instructions, or some other instruction at some future point in the program could try to run what was overwritten with data. Having that happen could cause all sorts of bad things to happen. The data there could have an "opcode" that doesn't match anything the processor knows what to do. Or, the data there could tell the computer to do something completely unintended. Either way, you're in for a bad day. Be glad that your compiler never messes up and accidentally inserts something that does this.
Unfortunately, sometimes the programmer using the compiler messes up, and does something that tells the CPU to write data outside of the area they allocated for data. (A common way this happens in C/C++ is to allocate an array L items long, and use an index >=L when writing data.) Having data written to an area set aside for code is what buffer overflow vulnerabilities are made of. Some program may have a bug that lets a remote machine trick the program into writing data (which the remote machine sent) beyond the end of an area set aside for data, and into an area set aside for code. Then, at some later point, the processor executes that "data" (which, remember, was sent from a remote computer). If the remote computer/attacker was smart, they carefully crafted the "data" that went past the boundary to be valid instructions that do something malicious. (To give them more access, destroy data, send back sensitive data from memory, etc).
this is because an ISA must take into account what a valid set of instructions are and how to encode data: memory address/registers/literals.
see this for more general info on how ISA is designed
https://en.wikipedia.org/wiki/Instruction_set
In short, the operating system tells it where the next instruction is. In the case of x64 there is a special register called rip (instruction pointer) which holds the address of the next instruction to be executed. It will automatically read the data at this address, decode and execute it, and automatically increment rip by the number of bytes of the instruction.
Generally, the OS can mark regions of memory (pages) as holding executable code or not. If an error or exploit tries to modify executable memory an error should occur, similarly if the CPU finds itself trying to execute non-executable memory it will/should also signal an error and terminate the program. Now you're into the wonderful world of software viruses!

How do bits become a byte? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Possibly the most basic computer question of all time, but I have not been able to find a straightforward answer and it is driving me crazy. When a computer 'reads' a byte, does it read it as a sequential series of ones and zeros one after the other, or does it somehow read all 8 ones and zeros at once?
A computer system reads the data in both ways depending upon type of operation and the how the digital system is designed.I'll explain this with very simple example of a Full adder circuit.
A full adder adds binary numbers and accounts for values carried in as well as out (Wikipedia)
Example of Parallel operation
Suppose in some task we need to add two 8 bit(1 byte) numbers such that all bits are available at the time of addition.
Then in that case we can design a digital system with 8 full-adders(1 for each bit).
Example of Serial Operation
In some other task you observe that all 8 bits will not be simultaneously available.
Or you think having 8 separate adders is costly as you need to implement other mathematical operations (like subtraction,multiplication and division). So instead of having 8 separate units you have 1 unit which will individually process bits. In this scenario we will need three storage units ( Shift Registers) such that two storage units will store two 8-bit numbers and one storage units will store the result .At a given clock pulse single bit will be transmitted from each of two registers to the full adder which will perform the addition process and transfer 1 bit result to the result shift register in single clock pulse.
This figure contains some additional stuff which is not useful for this thread but you can
study digital logic design and computer architecture if you want to go more deep in this stuff.
Shift register
Shift register operations demo
This is really kind of outside the scope of Stackoverflow, but it brings back such fond memories from college.
It depends. Some times a computer reads bits one at a time. For example over older ethernet manchester code is used. However over old parallel printer cables, there were 8 pins each one signaling a bit, and an entireoctet (byte) is sent at once.
In serial (one-bit-at-a-time) encodings, you're typically measuring transitions in the line or transitions against some well-defined clock source.
In parallel encodings, you're typically reading all the bits into a register at a time and latching the register.
Look up flipflops, registers, and logic gates for information on the low-level parts of this.
Bits are transmitted one at a time in serial transmission, and
multiple numbers of bits in parallel transmission. A bitwise operation
optionally process bits one at a time. Data transfer rates are usually
measured in decimal SI multiples of the unit bit per second (bit/s),
such as kbit/s.
Wikipedia's article on Bit
the processor works with a defined number of registerlength. 8, 16, 32, 64 ... think about a register as an amount of connection, one for each bit... thats the amount of bits that will be processed at once in one processor core, one register at once ... the processor hat different kinds of register, examples are the private instruction register or the public data or adress register
Think of it this way, at least at a physical level: In a transmission cable from point A to B (A and B can be anything, hard drive, CPU, RAM, USB, etc.) each wire in that cable can transmit one bit at a time. Both A and B have a clock pulsing at the same rate. On each pulse, the sender changes the amount of power going down each wire to signify the value of the new bit(s). So, the # of wires in the cable = the # of bits that can be transmitted each "pulse". (Note: This is a very simplified and theoretical explanation).
At a software level, in the CPU, you can never address anything smaller than a byte. You can "access" and manipulate specific bytes by using the bitwise operators (& (AND), | (OR), << (Left Shift), >> (Right Shift), ^ (XOR)).
In hardware, the number of bits being sent each pulse is completely dependent of the hardware itself.

Why is alignment imporant?

I know that some processors fail with misaligned data, and others like the oh-so-common x86, would just be slower with that.
My question is why? Why is it harder for an x86 processor to get the data from the pointer 0x12345679 than it is from the pointer 0x12345678? Just to be clear, I'm aware that page faults may happen if the data is in multiple pages, and I understand that more data may need to be fetched from memory (one part for the start of the value and one for the end), but that isn't always true and this isn't what my question is about. I'm asking, why is it always slower?
Suppose the memory starts at 0x10000000. Why is it harder for the processor to get a 2-byte short from 0x10000001 than it is from 0x10000002? Why is it harder to get a 4-byte int from 0x10000001 than it is from 0x10000000? And so forth.
Because the data bus is wider than eight bits.
Let assume that the data bus is 32 bits. To get 16 bits from address 0x10000001, it has to get the four bytes that starts at 0x10000000 and shift the value to get the two bytes in the middle.
To get 16 bits from the address 0x10000003, it has to get the words that start at 0x10000000 and 0x10000004, and use one byte from each value.
The processor can only access memory in an aligned fashion. This is a consequence of how the interconnect between the processor and memory functions.
When a processor supports unaligned reads, what's really happening is the processor issuing two separate reads (or one read of larger size) and stitching the parts together, which is why it's slower than an aligned read.
One example: if the databus is 32 bits and a 32 bit value is not on a 32 bit boundary, the bytes will have to be fetched in more than one operation and moved around to load the value properly into a processor register.

Resources