How instructions are differentiated from data? - cpu

While reading ARM core document, I got this doubt. How does the CPU differentiate the read data from data bus, whether to execute it as an instruction or as a data that it can operate upon?
Refer to the excerpt from the document -
"Data enters the processor core
through the Data bus. The data may be
an instruction to execute or a data
item."
Thanks in advance for enlightening me!
/MS

Simple answer - it doesn't. Machine code instructions are just binary numbers, as are data. More complicated answer - your processor may (or may not) provide segmentation of memory, meaning that attempting to execute what has been specified as data causes a trap of some sort. This is one of the the meaning of a "segmentation fault" - the processor tried to execute something that was not labelled as being executable code.

Each opcode will consist of an instruction of N bytes, which then expects the subsequent M bytes to be data (memory pointers etc.). So the CPU uses each opcode to determine how manyof the following bytes are data.
Certainly for old processors (e.g. old 8-bit types such as 6502 and the like) there was no differentiation. You would normally point the program counter to the beginning of the program in memory and that would reference data from somewhere else in memory, but program/data were stored as simple 8-bit values. The processor itself couldn't differentiate between the two.
It was perfectly possible to point the program counter at what had deemed as data, and in fact I remember an old college tutorial where my professor did exactly that, and we had to point the mistake out to him. His response was "but that's data! It can't execute that! Can it?", at which point I populated our data with valid opcodes to prove that, indeed, it could.

The original ARM design had a three-stage pipeline for executing instructions:
FETCH the instruction into the CPU
DECODE the instruction to configure the CPU for execution
EXECUTE the instruction.
The CPU's internal logic ensures that it knows whether it is fetching data in stage 1 (i.e. an instruction fetch), or in stage 3 (i.e. a data fetch due to a "load" instruction).
Modern ARM processors have a separate bus for fetching instructions (so the pipeline doesn't stall while fetching data), and a longer pipeline (to allow faster clock speeds), but the general idea is still the same.

Each read by the processor is known to be a data fetch or an instruction fetch. All processors old and new know their instruction fetches from data fetches. From the outside you may or may not be able to tell, usually not except for harvard architecture processors of course, which the ARM is not. I have been working with the mpcore (ARM11) lately and there are bits on the external interface that tell you a little about what kind of read it is, mostly to hook up an external cache, combine that with knowledge of if you have the mmu and L1 cache on and you can tell data from instruction, but that is the exception to the rule. From a memory bus perspective it is just data bits you dont know data from instruction, but the logic that initiated that memory cycle and is waiting for the result knew before it started the cycle what kind of fetch it was and what it is going to do with that data when it gets it.

I think its down to where the data is stored in the program and OS support for informing the CPU whether it is code or data.
All code is placed in different segment of the image (along with static data like constant character strings) compared to storage for variables. The OS (and memory management unit) need to know this because they can swap code out of memory by simply discarding it and reloading it from the original disk file (at least that's how Windows does it).
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)
Its still possible to point your program counter at data, but the OS can tell the CPU to prevent this - see NX bit and Windows' "Data Execution Protection" settings (system control panel)

So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)

Related

How does the cpu know to execute instructions and not execute data [duplicate]

While reading ARM core document, I got this doubt. How does the CPU differentiate the read data from data bus, whether to execute it as an instruction or as a data that it can operate upon?
Refer to the excerpt from the document -
"Data enters the processor core
through the Data bus. The data may be
an instruction to execute or a data
item."
Thanks in advance for enlightening me!
/MS
Simple answer - it doesn't. Machine code instructions are just binary numbers, as are data. More complicated answer - your processor may (or may not) provide segmentation of memory, meaning that attempting to execute what has been specified as data causes a trap of some sort. This is one of the the meaning of a "segmentation fault" - the processor tried to execute something that was not labelled as being executable code.
Each opcode will consist of an instruction of N bytes, which then expects the subsequent M bytes to be data (memory pointers etc.). So the CPU uses each opcode to determine how manyof the following bytes are data.
Certainly for old processors (e.g. old 8-bit types such as 6502 and the like) there was no differentiation. You would normally point the program counter to the beginning of the program in memory and that would reference data from somewhere else in memory, but program/data were stored as simple 8-bit values. The processor itself couldn't differentiate between the two.
It was perfectly possible to point the program counter at what had deemed as data, and in fact I remember an old college tutorial where my professor did exactly that, and we had to point the mistake out to him. His response was "but that's data! It can't execute that! Can it?", at which point I populated our data with valid opcodes to prove that, indeed, it could.
The original ARM design had a three-stage pipeline for executing instructions:
FETCH the instruction into the CPU
DECODE the instruction to configure the CPU for execution
EXECUTE the instruction.
The CPU's internal logic ensures that it knows whether it is fetching data in stage 1 (i.e. an instruction fetch), or in stage 3 (i.e. a data fetch due to a "load" instruction).
Modern ARM processors have a separate bus for fetching instructions (so the pipeline doesn't stall while fetching data), and a longer pipeline (to allow faster clock speeds), but the general idea is still the same.
Each read by the processor is known to be a data fetch or an instruction fetch. All processors old and new know their instruction fetches from data fetches. From the outside you may or may not be able to tell, usually not except for harvard architecture processors of course, which the ARM is not. I have been working with the mpcore (ARM11) lately and there are bits on the external interface that tell you a little about what kind of read it is, mostly to hook up an external cache, combine that with knowledge of if you have the mmu and L1 cache on and you can tell data from instruction, but that is the exception to the rule. From a memory bus perspective it is just data bits you dont know data from instruction, but the logic that initiated that memory cycle and is waiting for the result knew before it started the cycle what kind of fetch it was and what it is going to do with that data when it gets it.
I think its down to where the data is stored in the program and OS support for informing the CPU whether it is code or data.
All code is placed in different segment of the image (along with static data like constant character strings) compared to storage for variables. The OS (and memory management unit) need to know this because they can swap code out of memory by simply discarding it and reloading it from the original disk file (at least that's how Windows does it).
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)
Its still possible to point your program counter at data, but the OS can tell the CPU to prevent this - see NX bit and Windows' "Data Execution Protection" settings (system control panel)
So, I think the CPU 'knows' whether memory is data or code. No doubt the modern pipeling CPUs we have now also have instructions to read this memory differently to assist the CPU is processing it as fast as possible (eg code may not be cached, data will always be accessed randomly rather than in a stream)

What is the cache's role when writing to memory?

I have a function that does very little reading, but a lot of writing to RAM. When I run it multiple times on the same core (the main thread), it runs about 5x as fast than if I launch the function on a new thread every run (which doesn't guarantee the same core is used between runs), as I launch and join between runs.
This suggests the cache is being used heavily for the write process, but I don't understand how. I thought the cache was only useful for reads.
Modern processors usually have write-buffers. The reason is that writes are, to a first approximation, pure sinks. The processor doesn't usually have to wait for the store to reach the coherent memory hierarchy before it executes the next instruction.
(Aside: Obviously stores are not pure sinks. A later read from the written-to memory location should return the written value, so the processor must snoop the write-buffer, and either stall the read or forward the written value to it)
Obviously such buffer(s) are of finite size, so when the buffers are full the next store in the program can't be executed and stalls until a slot in the buffer is made available by an older store becoming architecturally visible.
Ordinarily, the way a write leaves the buffer is when the value is written to cache (since a lot of writes are actually read back again quickly, think of the program stack as an example). If the write only sets part of the cacheline, the rest of the cacheline must remain unmodified, so consequently it must be loaded from the memory hierarchy.
There are ways to avoid loading the old cacheline, like non-temporal stores, write-combining memory or cacheline-zeroing instructions.
Non-temporal stores and write-combining memory combine adjacent writes to fill a whole cacheline, sending the new cacheline to the memory hierarchy to replace the old one.
POWER has an instruction that zeroes a full cacheline (dcbz), which also removes the need to load the old value from memory.
x86 with AVX512 has cacheline-sized registers, which suggests that an aligned zmm-register store could avoid loading the old cacheline (though I do not know whether it does).
Note that many of these techniques are not consistent with the usual memory-ordering of the respective processor architectures. Using them may require additional fences/barriers in multi-threaded operation.

Are one-sided RDMA reads atomic for single cache lines?

My group (a project called Isis2) is experimenting with RDMA. We're puzzled by the lack of documentation for the atomicity guarantees of one-sided RDMA reads. I've spent the past hour and a half hunting for any kind of information at all on this to no avail. This includes close reading of the blog at rdmamojo.com, famous for having answers to every RDMA question...
In the case we are focused on, we want to have writers doing atomic writes for objects that will always fit within a single cache line. Say this happens on machine A. Then we plan to have a one-sided atomic RDMA reader on machine B, who might read chunks of memory from A, spanning many of these objects (but again, no object would ever be written non-atomically, and all will fit within some single cache line). So B reads X, Y and Z, and each of those objects lives in one cache line on A, and was written with atomic writes.
Thus the atomic writes will be local, but the RDMA reads will arrive from remote machines and are done with no local CPU involvement.
Are our one-sided reads "semantically equivalent" to atomic local reads despite being initiated on the remote machine? (I suspect so: otherwise, one-sided RDMA reads would be useless for data that is ever modified...). And where are the "rules" documented?
Ok, meanwhile I seem to have found the correct answer, and I believe that Roland's response is not quite right -- partly right but not entirely.
In http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf, which is the Intel architecture manual (I'll need to check again for AMD...) I found this: Atomic memory operation in Intel 64 and IA-32 architecture is guaranteed only for a subset of memory operand
sizes and alignment scenarios. The list of guaranteed atomic operations are described in Section 8.1.1 of IA-32
Intel® Architecture Software Developer’s Manual, Volumes 3A.
Then in that section, which is entitled MULTIPLE-PROCESSOR MANAGEMENT, one finds a lot of information about guaranteed atomic operations (page 2210). In particular, Intel guarantees that its memory subsystems will be atomic for native types (bit, byte, integers of various sizes, float). These objects must be aligned so as to fit within a cache line (64 bytes on the current Intel platforms), not crossing a cache line boundary. But then Intel guarantees that no matter what device is using the memory bus, stores and fetches will be atomic.
For more complex objects, locking is required if you want to be sure you will get a safe execution. Further, if you are doing multicore operations you have to use the locked (atomic) variants of the Intel instructions to be sure of coherency for concurrent writes. You get this automatically for variables marked volatile in C++ or C# (Java too?).
What this adds up to is that local writes to native types can be paired with remotely initiated RDMA reads safely.
But notice that strings, byte arrays -- those would not be atomic because they could easily cross a cache line. Also, operations on complex objects with more than one data field might not be atomic -- for such things you would need a more complex approach, such as the one in the FaRM paper (Fast Remote Memory) by MSR. My own need is simpler and won't require the elaborate version numbering scheme FaRM implements...
The cache coherence protocol implemented in the PCIe controller should guarantee atomicity for single cache line RDMA reads. The PCIe controller has to snoop the caches of CPU cores and take ownership of the cache line (RFO) before returning data to the RDMA adapter. So it should see some snapshot of the cache line.
I don't know of any such guarantee of atomicity. Of course RDMA reads are executed by the remote adapter, and cacheline size is a CPU concept. I don't believe anything ensures that the granularity of reads used by remote RDMA adapter matches the size of writes performed by the remote CPU.
In practice it is likely to work since the remote adapter will probably issue a single PCI transaction etc. but I don't think there is anything architectural that guarantees you don't get "torn" data.

Techniques available to control data/instructions in/out of the cache?

I have encountered some Intel compiler intrinsic functions which I believe allow developers to bypass the cache?
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-mac/GUID-AF42A867-B796-4D29-8FED-C20193FD87E0.htm
I have also come across the GCC compiler prefetch keyword, although I cannot admit to fully appreciating what this does.
With the above in mind I wondered if any members could either elaborate on the above (which I badly described) or provide other techniques which allow the developer to have close control over which data (or instructions) is/isn't loaded in the CPU cache?
This page contains a lot of information about all intrinsics:
Intel Intrinsics Guide
The series of instructions that will write data to memory, avoiding cache evictions are generally named _mm_stream_.... As the name implies, these are ideal for applications that write a large stream of data that is basically contiguous in memory and unlikely to be accessed again in the near future. So, for example, if you are mixing audio buffers and producing a single waveform output this would work well.
One of the keys to using these instructions effectively is taking advantage of write combining. If your write locations are scattered throughout memory, these instructions will stall as badly, or possibly worse than any other kind of memory storage instruction you attempt. Since these writes do not wind up in cache, if you're not filling an entire write buffer then essentially your operation becomes a write-through operation, requiring a stall until the write is completed. If you are writing contiguous memory locations then write combining will apply, and make your data writes much more efficient.
The flip side of that coin is prefetching. Prefetching tells the system to start pulling a memory address into the desired level of cache so that by the time the memory read is complete, you are ready to use the data. This is much harder to use, and requires an appropriate data "stride" which takes into account the cache sizes, cache line size, and the number of instructions which can execute before the memory read completes. Using the hinting parameter, you can "suggest" that the data goes into the L1, L2, or L3 cache, or that it is "non-temporal", meaning that you're just going to use it once and it should be evicted first before any other cache evictions. The hardware has its own prefetching heuristics that work well for most problems without explicit prefetching instructions, but the classic counter-example is a matrix transpose:
Prefetching examples
Prefetching is generally very difficult to use effectively except in some very specific cases like this. Without a more specific problem statement from you, this is about all I can provide.

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.
It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)
The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.
You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.
Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html
I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).
Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.
I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.
I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.
Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.
I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!
If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.
Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.
The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Resources