I was reading the datasheet for the Atmel ATmega16 microcontroller and i came to this phrase in the USART section:
The two Buffer Registers operate as a circular
FIFO buffer. Therefore the UDR must only be read once for each incoming data! More
important is the fact that the Error Flags (FE and DOR) and the 9th data bit (RXB8) are
buffered with the data in the receive buffer. Therefore the status bits must always be read
before the UDR Register is read. Otherwise the error status will be lost since the buffer state
is lost.
I have no idea what buffering the Error Flags and RXB8 means. Any help will be appreciated.
The main caveat here is that UDR must only be read once per incoming byte and that in reading it, the error flags mentioned get wiped out. So, if one is interested in the error flags or the "9th bit" in RXB8, then they must be read before reading UDR.
However, in ten years of AVR and serial communications designs I've never had to resort to using RXB8. Why? It's only needed when you're using 9 databits for your communications. See page 155 of the datasheet for examples in C and assembler. Most data communications use 7 or (more commonly) 8 bits, so this extra bit is not necessary most of the time. If you need it, simply follow the example on p. 155.
More important is the fact that the Error Flags (FE and DOR) and the 9th data bit (RXB8) are buffered with the data in the receive buffer. Therefore the status bits must always be read before the UDR Register is read. Otherwise the error status will be lost since the buffer state is lost.
This just points out, that the Error Flags and the 9th data bit are (obviously) coupled to the data in the UDR FIFO and are lost as soon as you read UDR.
Example:
If you use 9 data bits, you have to read that 9th bit before reading UDR. Otherwise, the next byte in the FIFO (including its status bits) would overwrite the information of the 9th bit that belonged to the previous byte. The same applies to the error bits.
Related
I am bit confused how start and stop bit are differentiated from the actual data bits. For example say "data" whose binary is 01100100 01100001 01110100 01100001 is being set from System A to System B as a single packet (because it's less than 64 Kibibytes) bit by bit. Please let me know how start bit and stop bits are added to these data bits. There were two related thread on Stacloverflow with only one answer this was not accepted but is very confusing. Can someone explain it in simple terms please. Thank you
When you want to send data over serial line, you need to synchronize transmitter and receiver. The start bit simply marks the beginning of the data chunk (typically one byte with or without parity bit), and the stop bit marks the end of data chunk.
In the beginning, there’s no data being transmitted - let´s say there is ‘0’ on the line for some time. The receiver is waiting for the start bit (both start and stop bits are always ‘1’). When the start bit arrives, it starts an internal timer and on every tick it reads the value from the line, until all data and parity bits are read. Then it waits for the stop bit and then it begins to start waiting for a new start bit.
Without the start bit, the receiver would not now when to start reading data bits. Imagine sending zero byte without parity: The line would just stay in 0 state all the time.
The stop bit is not necessary, it’s there just for enhancing the reliability (both receiver and transmitter must use the same frequency).
So, the start and stop bits don’t need to be distinguished from data bits. Quite the oposite: They allow the receiver to properly identify data bits.
When sending your data, you would take them byte by byte and for each of them you would send the start bit (‘1’) first, then individual bits, then maybe parity bit and then a ‘1’ - the stop bit, everything at a given frequency. Then you would wait at least for one timer tick.
Usually you don’t need to do all of this, because there are specialized chips for this on the board. You just provide your data using a buffer and wait until they’re sent, or you wait for data being received.
I have some questions related to SPIxCON registers of SPI. I use PIC18F26K83.
1) There is a SPIxTCNTH: SPI TRANSFER COUNTER MSB REGISTER. And I can set first 3 bits on it which counters the bits to be transmitted. And according to datasheet it is writable bit. According to the datasheet it counts bits that will be transmitted then why is it writable? Do I need to write it according to bits that I will send? Or is it there to inform user.
2) There is SPIxTWIDTH: SPI TRANSFER WIDTH REGISTER. In case of BMODE=1, it is
Size (in bits) of each transfer counted by the transfer counter
I will be sending values such as 1.1 or 2.3 to DAC. In this case what should I set it to? Is there a standart value for this register?
3) I couldn't get what are FIFO registers are for according to datasheet we cannot control them by software. Isn't it like a buffer? So If I am writing to transmit register faster than transmission speed, the transmit data will be put into the FIFO. And one by one they will be transmitted. Am I correct? I do not need to anything rather than writing to the transmit buffer.
4) I read but could not understand the polarity bits in SPIxCON1. Is it okay if I do not touch these bits in the control register? I do not want to mess up.
5) How can I select slaves? There is a SSET (Slave select enable bit) in the SPIxCON2 register. I can make it 1, but then how can I select the slave?
Thank you for your answers. I am a newbie. Sorry for the simple and maybe non sense questions. Or I can simply show my configuration code, but I believe it would be harder to analyse.
1) The transfer counter (when in use) is written to with the number of bytes, or partial bytes, to send or receive (depending on mode). So you'd set it, if you are using it (BMODE=0 or TXR=0) to the number of bytes that you are expecting to send or receive.
2) You'd need to look at your binary representation of those numbers to see how many bits you'd be sending in each case. Standard value is a full byte.
3) the FIFOs are hidden elements, writing to the SPIxTXB or reading from the SPIxRXB registers accesses the respective FIFO. the FIFOs are only two bytes deeps so you'd still need to check for overrun if you are sending fast TXWE bit (iirc) but if you have lots of data to transfer fast I'd recommend using DMA to do the transfer then you'd just be setting it up and letting it go and can do other things until it is finished.
4) I think the polarity bits just set the line level during idle state to either high or low. It should be set the same for everyone (masters and slaves).
5) If you only have one slave you can tie that line to the slaves enable line. If you have more than one slave you'll need to set up a gpio line for each and (for each one) OR the signals together and attach the OR output to the slave enable (if it is active low, which it usually is). Make sure only one slave is active at a time. Doing a daisy chain of slaves can be done as well. I haven't worked with that kind of setup.
I am currently writing a small debugger in assembly on windows plateform.
I open the debuggee process as follow:
invoke CreateProcess, addr buffer, NULL, NULL, NULL, FALSE, DEBUG_PROCESS+DEBUG_ONLY_THIS_PROCESS, NULL, NULL, addr startinfo, addr pi
It works well, i can get the EIP by looking on the context of the debuggee and so i can get the 1st byte of the instruction that will be executed.
However, I need to get the number of bytes that have been executed in the previous instruction.
Instructions are not size independant. Sometimes an instruction is just 1 byte, and some other time 6 bytes or more.
I tried to substract the previous EIP with the current EIP in order to get the number of bytes that have been executed. But it doesn't work if there is a jmp or a call because the address space is not the same anymore.
I planned to get a map of all opcode and make some cmp, but it seems to be a huge work to do.
If you have some idea in order to get the number of byte of the previous instruction that has been executed (maybe looking into a cache or something like that), please let me know.
Best regards
TL;DR
Keep it simple: single step and decode only the branch instructions and use EIP - last EIP unless the last instruction was a branch (in that case use the decoding to find the length).
If an unknown instruction is found, back off and don't provide its size.
It's impossible to decode an x86 instruction stream backward because x86 encoding is not symmetric (w.r.t. address growth), to see this consider mov eax, 90909090h or similar.
So you need to disassemble each instruction as you single step through the program (a debugger needs this anyway) and record its size.
The control transfer instructions are significantly less than the total number of instructions, so you could decode just that and use the EIP - EIP' (where EIP' is the EIP of the last instruction) trick otherwise.
Intel processors support Last Branch Recording but it requires OS support and you'd need to post-process the data anyway, it's seem too burdensome.
A similar argument can be made for the Intel Processor Trace technology.
I can't think of any event for the performance counters (granted that you can use them) that would result in the the number of bytes of an instruction.
Actually in the backend, the concept of "instruction" has been reduced to a sequence of uOPs (probably with a bit to say that an opcode is the last one in an instruction) and the front-end is mostly decoupled from the architectural value of eip (working almost always with a speculative value of eip) so it may be several instructions ahead of the backend.
I believe each uOP probably have a field to record how to update eip at retirement but not the size of an instruction in bytes.
Similarly in the front-end only in the pre-decode stage an instruction length in bytes is recorded, after that I think it's discarded (I can't think of any use of it).
Instructions in the L1 instruction cache are not yet decoded, so even if there was a way to inspect their content and metadata there would be nothing there.
The usual way this is done is by making a trace: single step thorough the program, disassemble the instruction at eip (see below), record its size, resume the program, repeat until a stop condition.
This gives you a list of addresses and instruction sizes.
If you find an instruction you can't decode you either not record the size for it or try to estimate it with some heuristic (its length must be less than 16B and you could in theory integrate the data with the count from a PMC like BR_INST_RETIRED.ALL_BRANCHES).
It's possible to detect the size of an instruction at runtime but that's totally not feasible in this context.
On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.
I know that some processors fail with misaligned data, and others like the oh-so-common x86, would just be slower with that.
My question is why? Why is it harder for an x86 processor to get the data from the pointer 0x12345679 than it is from the pointer 0x12345678? Just to be clear, I'm aware that page faults may happen if the data is in multiple pages, and I understand that more data may need to be fetched from memory (one part for the start of the value and one for the end), but that isn't always true and this isn't what my question is about. I'm asking, why is it always slower?
Suppose the memory starts at 0x10000000. Why is it harder for the processor to get a 2-byte short from 0x10000001 than it is from 0x10000002? Why is it harder to get a 4-byte int from 0x10000001 than it is from 0x10000000? And so forth.
Because the data bus is wider than eight bits.
Let assume that the data bus is 32 bits. To get 16 bits from address 0x10000001, it has to get the four bytes that starts at 0x10000000 and shift the value to get the two bytes in the middle.
To get 16 bits from the address 0x10000003, it has to get the words that start at 0x10000000 and 0x10000004, and use one byte from each value.
The processor can only access memory in an aligned fashion. This is a consequence of how the interconnect between the processor and memory functions.
When a processor supports unaligned reads, what's really happening is the processor issuing two separate reads (or one read of larger size) and stitching the parts together, which is why it's slower than an aligned read.
One example: if the databus is 32 bits and a 32 bit value is not on a 32 bit boundary, the bytes will have to be fetched in more than one operation and moved around to load the value properly into a processor register.