Cache locality considerations - caching

I have been trying to get better awareness of cache locality. I produced the 2 code snippets to gain better understanding of the cache locality characteristics of both.
vector<int> v1(1000, some random numbers);
vector<int> v2(1000, some random numbers);
for(int i = 0; i < v1.size(); ++i)
{
auto min = INT_MAX;
for(int j = 0; j < v2.size(); ++j)
{
if(v2[j] < min)
{
v1[i] = j;
}
}
}
vs
vector<int> v1(1000, some random numbers);
vector<int> v2(1000, some random numbers);
for(int i = 0; i < v1.size(); ++i)
{
auto min = INT_MAX;
auto loc = -1;
for(int j = 0; j < v2.size(); ++j)
{
if(v2[j] < min)
{
loc = j;
}
}
v1[i] = loc;
}
In the first code v1 is being updated directly inside the if statement. Does this cause cache locality issues because during the update it'll replace the current cache line with some contiguous segment of data from v1[i] to v1[cache_line_size/double_size]? If this analysis is correct, then this would seem to slow down the loop over j, since for each iteration of the j loop, it'll likely have cache misses. It seems the second implementation alleviates this issue by using a local variable and not updating v1[i] until after the j loop is complete?
In practice, I think the compiler might optimize the cache locality issue away? For discussion, how about we assume no compiler optimizations.

Related

2-D cache blocking Optimization

I would like to know if there is a way to optimize the speed of time to implementing this code in C. This part is for initializing the matrix in rowwise.
For other parts of my whole code are basically calculate current time and Main function, so I guess this initialization is the most time-cost part
Some hint for me is we can use cache blocking. BTW this code is also for imitate the process which CPU scratch data from cache. I had been figuring it all day but had limited idea.
Thanks!!
void InitializeMatrixRowwise() {
int i, j;
double x;
x = 0.0;
for (i = 0; i < DIMENSION; i++) {
for (j = 0; j < DIMENSION; j++) {
if (i >= j) {
Matrix[i][j] = x;
x += 1.0;
} else
Matrix[i][j] = 1.0;
}
}
}

Google Foobar, maximum unique visits under a resource limit, negative weights in graph

I'm having trouble figuring out the type of problem this is. I'm still a student and haven't taken a graph theory/linear optimization class yet.
The only thing I know for sure is to check for negative cycles, as this means you can rack the resource limit up to infinity, allowing for you to pick up each rabbit. I don't know the "reason" to pick the next path. I also don't know when to terminate, as you could keep using all of the edges and make the resource limit drop below 0 forever, but never escape.
I'm not really looking for code (as this is a coding challenge), only the type of problem this is (Ex: Max Flow, Longest Path, Shortest Path, etc.) If you an algorithm that fits this already that would be extra awesome. Thanks.
The time it takes to move from your starting point to all of the bunnies and to the bulkhead will be given to you in a square matrix of integers. Each row will tell you the time it takes to get to the start, first bunny, second bunny, ..., last bunny, and the bulkhead in that order. The order of the rows follows the same pattern (start, each bunny, bulkhead). The bunnies can jump into your arms, so picking them up is instantaneous, and arriving at the bulkhead at the same time as it seals still allows for a successful, if dramatic, escape. (Don't worry, any bunnies you don't pick up will be able to escape with you since they no longer have to carry the ones you did pick up.) You can revisit different spots if you wish, and moving to the bulkhead doesn't mean you have to immediately leave - you can move to and from the bulkhead to pick up additional bunnies if time permits.
In addition to spending time traveling between bunnies, some paths interact with the space station's security checkpoints and add time back to the clock. Adding time to the clock will delay the closing of the bulkhead doors, and if the time goes back up to 0 or a positive number after the doors have already closed, it triggers the bulkhead to reopen. Therefore, it might be possible to walk in a circle and keep gaining time: that is, each time a path is traversed, the same amount of time is used or added.
Write a function of the form answer(times, time_limit) to calculate the most bunnies you can pick up and which bunnies they are, while still escaping through the bulkhead before the doors close for good. If there are multiple sets of bunnies of the same size, return the set of bunnies with the lowest prisoner IDs (as indexes) in sorted order. The bunnies are represented as a sorted list by prisoner ID, with the first bunny being 0. There are at most 5 bunnies, and time_limit is a non-negative integer that is at most 999.
It's a planning problem, basically. The generic approach to planning is to identify the possible states of the world, the initial state, transitions between states, and the final states. Then search the graph that this data imply, most simply using breadth-first search.
For this problem, the relevant state is (1) how much time is left (2) which rabbits we've picked up (3) where we are right now. This means 1,000 clock settings (I'll talk about added time in a minute) times 2^5 = 32 subsets of bunnies times 7 positions = 224,000 possible states, which is a lot for a human but not a computer.
We can deal with added time by swiping a trick from Johnson's algorithm. As Tymur suggests in a comment, run Bellman--Ford and either find a negative cycle (in which case all rabbits can be saved by running around the negative cycle enough times first) or potentials that, when applied, make all times nonnegative. Don't forget to adjust the starting time by the difference in potential between the starting position and the bulkhead.
There you go. I started Google Foobar yesterday. I'll be starting Level 5 shortly. This was my 2nd problem here at level 4. The solution is fast enough as I tried memoizing the states without using the utils class. Anyway, loved the experience. This was by far the best problem solved by me since I got to use Floyd-Warshall(to find the negative cycle if it exists), Bellman-Ford(as a utility function to the weight readjustment step used popularly in algorithms like Johnson's and Suurballe's), Johnson(weight readjustment!), DFS(for recursing over steps) and even memoization using a self-designed hashing function :)
Happy Coding!!
public class Solution
{
public static final int INF = 100000000;
public static final int MEMO_SIZE = 10000;
public static int[] lookup;
public static int[] lookup_for_bunnies;
public static int getHashValue(int[] state, int loc)
{
int hashval = 0;
for(int i = 0; i < state.length; i++)
hashval += state[i] * (1 << i);
hashval += (1 << loc) * 100;
return hashval % MEMO_SIZE;
}
public static boolean findNegativeCycle(int[][] times)
{
int i, j, k;
int checkSum = 0;
int V = times.length;
int[][] graph = new int[V][V];
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
{
graph[i][j] = times[i][j];
checkSum += times[i][j];
}
if(checkSum == 0)
return true;
for(k = 0; k < V; k++)
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
if(graph[i][j] > graph[i][k] + graph[k][j])
graph[i][j] = graph[i][k] + graph[k][j];
for(i = 0; i < V; i++)
if(graph[i][i] < 0)
return true;
return false;
}
public static void dfs(int[][] times, int[] state, int loc, int tm, int[] res)
{
int V = times.length;
if(loc == V - 1)
{
int rescued = countArr(state);
int maxRescued = countArr(res);
if(maxRescued < rescued)
for(int i = 0; i < V; i++)
res[i] = state[i];
if(rescued == V - 2)
return;
}
else if(loc > 0)
state[loc] = 1;
int hashval = getHashValue(state, loc);
if(tm < lookup[hashval])
return;
else if(tm == lookup[hashval] && countArr(state) <= lookup_for_bunnies[loc])
return;
else
{
lookup_for_bunnies[loc] = countArr(state);
lookup[hashval] = tm;
for(int i = 0; i < V; i++)
{
if(i != loc && (tm - times[loc][i]) >= 0)
{
boolean stateCache = state[i] == 1;
dfs(times, state, i, tm - times[loc][i], res);
if(stateCache)
state[i] = 1;
else
state[i] = 0;
}
}
}
}
public static int countArr(int[] arr)
{
int counter = 0;
for(int i = 0; i < arr.length; i++)
if(arr[i] == 1)
counter++;
return counter;
}
public static int bellmanFord(int[][] adj, int times_limit)
{
int V = adj.length;
int i, j, k;
int[][] graph = new int[V + 1][V + 1];
for(i = 1; i <= V; i++)
graph[i][0] = INF;
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
graph[i + 1][j + 1] = adj[i][j];
int[] distance = new int[V + 1] ;
for(i = 1; i <= V; i++)
distance[i] = INF;
for(i = 1; i <= V; i++)
for(j = 0; j <= V; j++)
{
int minDist = INF;
for(k = 0; k <= V; k++)
if(graph[k][j] != INF)
minDist = Math.min(minDist, distance[k] + graph[k][j]);
distance[j] = Math.min(distance[j], minDist);
}
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
adj[i][j] += distance[i + 1] - distance[j + 1];
return times_limit + distance[1] - distance[V];
}
public static int[] solution(int[][] times, int times_limit)
{
int V = times.length;
if(V == 2)
return new int[]{};
if(findNegativeCycle(times))
{
int ans[] = new int[times.length - 2];
for(int i = 0; i < ans.length; i++)
ans[i] = i;
return ans;
}
lookup = new int[MEMO_SIZE];
lookup_for_bunnies = new int[V];
for(int i = 0; i < V; i++)
lookup_for_bunnies[i] = -1;
times_limit = bellmanFord(times, times_limit);
int initial[] = new int[V];
int res[] = new int[V];
dfs(times, initial, 0, times_limit, res);
int len = countArr(res);
int ans[] = new int[len];
int counter = 0;
for(int i = 0; i < res.length; i++)
if(res[i] == 1)
{
ans[counter++] = i - 1;
if(counter == len)
break;
}
return ans;
}
}

shell sort in openmp

Is anyone familiar with openmp, I don't get a sorted list. what am I doing wrong. I am using critical at the end so only one thread can access that section when it's been sorted. I guess my private values are not correct. Should they even be there or am I better off with just #pragma omp for.
void shellsort(int a[])
{
int i, j, k, m, temp;
omp_set_num_threads(10);
for(m = 2; m > 0; m = m/2)
{
#pragma omp parallel for private (j, m)
for(j = m; j < 100; j++)
{
#pragma omp critical
for(i = j-m; i >= 0; i = i-m)
{
if(a[i+m] >= a[i])
break;
else
{
temp = a[i];
a[i] = a[i+m];
a[i+m] = temp;
}
}
}
}
}
So there's a number of issues here.
So first, as has been pointed out, i and j (and temp) need to be private; m and a need to be shared. A useful thing to do with openmp is to use default(none), that way you are forced to think through what each variable you use in the parallel section does, and what it needs to be. So this
#pragma omp parallel for private (i,j,temp) shared(a,m) default(none)
is a good start. Making m private in particular is a bit of a disaster, because it means that m is undefined inside the parallel region. The loop, by the way, should start with m = n/2, not m=2.
In addition, you don't need the critical region -- or you shouldn't, for a shell sort. The issue, we'll see in a second, is not so much multiple threads working on the same elements. So if you get rid of those things, you end up with something that almost works, but not always. And that brings us to the more fundamental problem.
The way a shell sort works is, basically, you break the array up into many (here, m) subarrays, and insertion-sort them (very fast for small arrays), and then reassemble; then continue by breaking them up into fewer and fewer subarrays and insertion sort (very fast, because they're partly sorted). Sorting those many subarrays is somethign that can be done in parallel. (In practice, memory contention will be a problem with this simple approach, but still).
Now, the code you've got does that in serial, but it can't be counted on to work if you just wrap the j loop in an omp parallel for. The reason is that each iteration through the j loop does one step of one of the insertion sorts. The j+m'th loop iteration does the next step. But there's no guarantee that they're done by the same thread, or in order! If another thread has already done the j+m'th iteration before the first does the j'th, then the insertion sort is messed up and the sort fails.
So the way to make this work is to rewrite the shell sort to make the parallelism more explicit - to not break up the insertion sort into a bunch of serial steps.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void insertionsort(int a[], int n, int stride) {
for (int j=stride; j<n; j+=stride) {
int key = a[j];
int i = j - stride;
while (i >= 0 && a[i] > key) {
a[i+stride] = a[i];
i-=stride;
}
a[i+stride] = key;
}
}
void shellsort(int a[], int n)
{
int i, m;
for(m = n/2; m > 0; m /= 2)
{
#pragma omp parallel for shared(a,m,n) private (i) default(none)
for(i = 0; i < m; i++)
insertionsort(&(a[i]), n-i, m);
}
}
void printlist(char *s, int a[], int n) {
printf("%s\n",s);
for (int i=0; i<n; i++) {
printf("%d ", a[i]);
}
printf("\n");
}
int checklist(int a[], int n) {
int result = 0;
for (int i=0; i<n; i++) {
if (a[i] != i) {
result++;
}
}
return result;
}
void seedprng() {
struct timeval t;
/* seed prng */
gettimeofday(&t, NULL);
srand((unsigned int)(1000000*(t.tv_sec)+t.tv_usec));
}
int main(int argc, char **argv) {
const int n=100;
int *data;
int missorted;
data = (int *)malloc(n*sizeof(int));
for (int i=0; i<n; i++)
data[i] = i;
seedprng();
/* shuffle */
for (int i=0; i<n; i++) {
int i1 = rand() % n;
int i2 = rand() % n;
int tmp = data[i1];
data[i1] = data[i2];
data[i2] = tmp;
}
printlist("Unsorted List:",data,n);
shellsort2(data,n);
printlist("Sorted List:",data,n);
missorted = checklist(data,n);
if (missorted != 0) printf("%d missorted nubmers\n",missorted);
return 0;
}
Variables "j" and "i" need to be declared private on the parallel region. As it is now, I am surprised anything is happening, because "m" can not be private. The critical region is allowing it to work for the "i" loop, but the critical region should be able to be reduced - though I haven't done a shell sort in a while.

OpenMP C parallelisation algorithm

in the book "Using OpenMP" is an example for bad memory access in C and I think this is the main problem in my attempt to parallelism the gaussian algorithm.
The example looks something like this:
k= 0 ;
for( int j=0; j<n ; j++)
for(int i = 0; i<n; i++)
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
So, I do understand why this causes a bad memory access. In C a 2d array is stored by rows and here in every i step a new row will be copied from memory to cache.
I am trying to find a solution for this, but im not getting a good speed up. The effects of my attempts are minor.
Can someone give me a hint what I can do?
The easiest way would be to swap the for loops, but I want to do it columnwise.
The second attempt:
for( int j=0; j<n-1 ; j+=2)
for(int i = 0; i<n; i++)
{
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
a[i][j+1] = a[i][j+1] - a[i][k]*a[k][j+1] ;
}
didn't make a difference at all.
The third attempt:
for( int j=0; j<n ; j++)
{
d= a[k][j] ;
for(int i = 0; i<n; i++)
{
e = a[i][k] ;
a[i][j] = a[i][j] - e*d ;
}
}
Thx alot
Greets Stepp
use flat array instead, eg:
#define A(i,j) A[i+j*ldA]
for( int j=0; j<n ; j++)
{
d= A(k,j) ;
...
}
Your loop order will cause a cache miss on every iteration, as you point out. So just swap the order of the loop statements:
for (int i = 0; i < n; i++) // now "i" is first
for (int j = 0; j < n; j++)
a[i][j] = a[i][j] - a[i][k]*a[k][j];
This will fix the row in a and vary just the columns, which means your memory accesses will be contiguous.
This memory access problem is just related to CACHE usage not to Openmp.
To make a good use of cache in general you should access contiguous memory locations. Remember also that if two or more threads are accessing the same memory area then you can have a "false shearing" problem forcing cache to be reloaded unnecessarily.
See for example:
http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/

InsertionSort vs. InsertionSort vs. BinaryInsertionSort

I have a couple of questions concerning different implementations of insertion sort.
Implementation 1:
public static void insertionSort(int[] a) {
for (int i = 1; i < a.length; ++i) {
int key = a[i];
int j = i - 1;
while (j >= 0 && a[j] > key) {
a[j + 1] = a[j];
--j;
}
a[j + 1] = key;
}
}
Implementation 2:
public static void insertionSort(int[] a) {
for (int i = 1; i < a.length; ++i) {
for (int j = i; j > 0 && a[j - 1] > a[j]; --j) {
swap(a, j, j - 1);
}
}
}
private static void swap(int[] a, int i, int j) {
int tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
Here's my first question: One should think that the first version should be a little faster that the second version (because of lesser assignments) but it isn't (or at least the difference it's negligible). But why?
Second, I was wondering that Java's Arrays.sort() method also uses the second approach (maybe because of code reuse because the swap method is used in different places, maybe because it's easier to understand).
Implementation 3 (binaryInsertionSort):
public static void binaryInsertionSort(int[] a) {
for (int i = 1; i < a.length; ++i) {
int pos = Arrays.binarySearch(a, 0, i, a[i]);
int insertionPoint = (pos >= 0) ? pos : -pos - 1;
if (insertionPoint < i) {
int key = a[i];
// for (int j = i; i > insertionPoint; --i) {
// a[j] = a[j - 1];
// }
System.arraycopy(a, insertionPoint, a, insertionPoint + 1, i - insertionPoint);
a[insertionPoint] = key;
}
}
}
Is the binary insertion sort of any practical use, or is it more of a theoretical thing? On small arrays, the other approaches are much faster, and on bigger arrays mergesort/quicksort has a much better performance.
delete false claim
The number of comparisons in the first two is 1/2*n(n-1), excluding those for the outer loops.
None of these programs make much sense for real work as they stand, because they don't make use of the information at their disposal. For instance, it is easy to add a check to the inner loop to see if any swaps have been made: if not then the array is sorted, and you can finish, perhaps saving most of the work. In practice, these kinds of consideration can dominate the average case.
Postscript
Missed the question about Java: I understand that Java's sort is a pretty complex algorithm, which uses a lot of special cases, such as specialised sorting cases for small arrays, and using quicksort to do its heavy lifting.

Resources