Are all processor cores on a cache-coherent system required to see the same value of a shared data at any point in time - caching

From what I've learnt, cache coherence is defined by the following 3 requirements:
Read R from an address X on a core C returns the value written by the most recent write W to X on C if no other core has written to X between W and R.
If a core C1 writes to X and a core C2 reads after a sufficient time, and there are no other writes in between, C2's read returns the value from C1's write.
Writes to the same location are serialized: any two writes to X must be seen to occur in the same order on all cores.
As far as I understand these rules, they basically require all threads to see updates made by other threads within some reasonable time and in the same order, but there seems to be no requirement about seeing the same data at any point in time. For example, say thread A wrote a value to a shared memory location X, then thread B wrote another value to X. Threads C and D reading from X must see the same order of updates: A, B. Imagine that thread C has already seen both updates A and B, while thread D has only observed A (the event B is yet to be seen). Provided that the time interval between writes to X and reads from X is small enough (less than what we consider a sufficient time), this situation doesn't violate any rules of coherence, does it?
On the other hand, coherence protocols, e.g. MSI use write-invalidation to guarantee that all cores have an up-to-date value of a shared variable. Wiki says: "The intention is that two clients must never see different values for the same shared data". If what I wrote about the coherence rules is true, I don't understand where this point comes from. I mean, I realize it's useful, but don't see where it is defined.

Related

Data races with MESI optimization

I dont really understand what exactly is causing the problem in this example:
Here is a snippet from my book:
Based on the discussion of the MESI protocol in the preceding section, it would
seem that the problem of data sharing between L1 caches in a multicore machine
has been solved in a watertight way. How, then, can the memory ordering
bugs we’ve hinted at actually happen?
There’s a one-word answer to that question: Optimization. On most hardware,
the MESI protocol is highly optimized to minimize latency. This means
that some operations aren’t actually performed immediately when messages
are received over the ICB. Instead, they are deferred to save time. As with
compiler optimizations and CPU out-of-order execution optimizations, MESI
optimizations are carefully crafted so as to be undetectable by a single thread.
But, as you might expect, concurrent programs once again get the raw end of
this deal.
For example, our producer (running on Core 1) writes 42 into g_data and
then immediately writes 1 into g_ready. Under certain circumstances, optimizations
in the MESI protocol can cause the new value of g_ready to become
visible to other cores within the cache coherency domain before the updated
value of g_data becomes visible. This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet. This means that the consumer (on Core 2) can potentially
see a value of 1 for g_ready before it sees a value of 42 in g_data, resulting in
a data race bug.
Here is the code:
int32_t g_data = 0;
int32_t g_ready = 0;
void ProducerThread() // running on Core 1
{
g_data = 42;
// assume no instruction reordering across this line
g_ready = 1;
}
void ConsumerThread() // running on Core 2
{
while (!g_ready)
PAUSE();
// assume no instruction reordering across this line
ASSERT(g_data == 42);
}
How can g_data be computed but not present in the cache?
This can happen, for example, if Core
1 already has g_ready’s cache line in its local L1 cache, but does not have
g_data’s line yet.
If g_data is not in cache, then why does the previous sentece end with a yet? Would the CPU load the cache line with g_data after it has been computed?
If we read this sentence:
This means that some operations aren’t actually performed immediately when messages are received over the ICB. Instead, they are deferred to save time.
Then what operation is deferred in our example with producer and consumer threads?
So basically I dont understand how under the MESI protocol, some operations are visible to other cores in the wrong order, despite being computed in the right order by a specific core.
PS:
This example is from a book called "Game Engine Architecture, Third Edition" by Jason Gregory, its on the page 309. Here is the book

Dirty bit value after changing data to original state

If the value in some part of cache is 4 and we change it to 5, that sets the dirty bit for that data to 1. But what about, if we set the value back to 4, will dirty bit still stay 1 or change back to 0?
I am interested in this, because this would mean a higher level optimization of the computer system when dealing with read-write operations between main memory and cache.
In order for a cache to work like you said, it would need to reserve half of its data space to store the old values.
Since cache are expensive exactly because they have an high cost per bit, and considering that:
That mechanism would only detect a two levels writing history: A -> B -> A and not any deeper (like A -> B -> C -> A).
Writing would imply the copy of the current values in the old values.
The minimum amount of taggable data in a cache is the line and the whole line need to be changed back to its original value. Considering that a line has a size in the order of 64 Bytes, that's very unlikely to happen.
An hierarchical structure of the caches (L1, L2, L3, ...) its there exactly to mitigate the problem of eviction.
The solution you proposed has little benefits compared to the cons and thus is not implemented.

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

Minimizing global memory reads in OpenCL with vectors?

Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?
If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.
Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).
On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.
Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).
Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.
Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.
As example, I will use 128 bits wide bus and 4 floats (32 bits).
(32b * 4) / 128b = 1 instance/read
For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.
32b / 128b = 4 instance/read
So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.

MPI: shared variable value for all processors

Here's one question about MPI. I need two processors that keeps modifying one variable and I want both processors to have access to the variable with the most up-to-date value.
from mpi4py import MPI
from time import sleep
comm = MPI.COMM_WORLD
rank = comm.rank
assert comm.size == 2
msg = 0
sec = 10
if comm.rank == 0:
for i in range(sec):
print msg
sleep(1)
msg = comm.bcast(msg,root = 1)
else:
for i in range(sec*2):
msg += 1
sleep(0.5)
comm.bcast(msg,root = 1)
So I'm expecting the program to print something like: 0 2 4 ...
But the program turns out to print: 0 1 2 3 4 5 6 7 8 9
I'm curious if there's a mechanism in mpi4py such that the variable msg is shared by both processors? That is, whenever msg is modified by processor 1, the new value becomes immediately available to processor 0. In other words, I want processor 0 to access the most-up-to-date value of msg instead of waiting for every changes that were made on msg by processor 1.
I think you're getting confused about how distributed memory programming works. In MPI, each process (or rank) has its own memory, and therefore when it changes values via load/store operations (like what you're doing with msg += 1), it will not affect the value of the variable on another process. The only way to update remote values is by sending messages, which you are doing with the comm.bcast() call. This sends the local value of msg from rank 1 to all other ranks. Until this point, there's no way for rank 0 to know what's been happening on rank 1.
If you want to have shared values between processes, then you probably need to take a look at something else, perhaps threads. You'll lose the distributed abilities of MPI if you switch to OpenMP, but that might not be what you needed MPI for in the first place. There are ways of doing this with distributed memory models (such as PGAS languages like Unified Parallel C, Global Arrays, etc.), but you will always run into the issue of latency which means that there will be some time that the values on ranks 0 and 1 are not synchronized unless you have some sort of protection to enforce it.
As mentioned by Wesley Bland, this isn't really possible in a pure distributed memory environment, as memory isn't shared.
However, MPI has for some time (since 1997) allowed something like this in the MPI-2, as one-sided communications; these have been updated significantly in MPI-3 (2012). This approach can have real advantages, but one has to be a little careful; since memory isn't really shared, every update requires expensive communications and it's easy to accidentally put significant scalability/performance bottlenecks in your code by over-reliance on shared state.
The Using MPI-2 book has an example of implementing a counter using the MPI-2 one-sided communications; a simple version of that counter is described and implemented in this answer in C. In the mpi4py distribution, under 'demos', there are implementations of these same counters in the 'nxtval' demo; the same simple counter as nxtval-onesided.py and a more complicated but more scalable implementation, also as described in the Using MPI-2 book, as nxtval-scalable.py. You should be able to use either of those implementations more or less as-is in the above code.

Resources