2-D cache blocking Optimization

2-D cache blocking Optimization - caching

I would like to know if there is a way to optimize the speed of time to implementing this code in C. This part is for initializing the matrix in rowwise.
For other parts of my whole code are basically calculate current time and Main function, so I guess this initialization is the most time-cost part
Some hint for me is we can use cache blocking. BTW this code is also for imitate the process which CPU scratch data from cache. I had been figuring it all day but had limited idea.
Thanks!!
void InitializeMatrixRowwise() {
int i, j;
double x;
x = 0.0;
for (i = 0; i < DIMENSION; i++) {
for (j = 0; j < DIMENSION; j++) {
if (i >= j) {
Matrix[i][j] = x;
x += 1.0;
} else
Matrix[i][j] = 1.0;
}
}
}

Related

Cache locality considerations

I have been trying to get better awareness of cache locality. I produced the 2 code snippets to gain better understanding of the cache locality characteristics of both.
vector<int> v1(1000, some random numbers);
vector<int> v2(1000, some random numbers);
for(int i = 0; i < v1.size(); ++i)
{
auto min = INT_MAX;
for(int j = 0; j < v2.size(); ++j)
{
if(v2[j] < min)
{
v1[i] = j;
}
}
}
vs
vector<int> v1(1000, some random numbers);
vector<int> v2(1000, some random numbers);
for(int i = 0; i < v1.size(); ++i)
{
auto min = INT_MAX;
auto loc = -1;
for(int j = 0; j < v2.size(); ++j)
{
if(v2[j] < min)
{
loc = j;
}
}
v1[i] = loc;
}
In the first code v1 is being updated directly inside the if statement. Does this cause cache locality issues because during the update it'll replace the current cache line with some contiguous segment of data from v1[i] to v1[cache_line_size/double_size]? If this analysis is correct, then this would seem to slow down the loop over j, since for each iteration of the j loop, it'll likely have cache misses. It seems the second implementation alleviates this issue by using a local variable and not updating v1[i] until after the j loop is complete?
In practice, I think the compiler might optimize the cache locality issue away? For discussion, how about we assume no compiler optimizations.

performance of dart code in browser vs. VM

I was surprised by the performance of my Dart code in the browser vs. the Dart VM. Here is a simple example that reproduces the issue.
test('speed test', () {
var n = 10000;
var rand = Random(0);
var x = List.generate(n, (i) => rand.nextDouble());
var res = <num>[];
var sw = Stopwatch()..start();
for (int i=0; i<1000; i++) {
for (int j=0; j<n; j++) {
x[j] += i;
}
res.add(x.reduce((a, b) => a + b));
}
sw.stop();
print('Milliseconds: ${sw.elapsedMilliseconds}');
});
If I run this code with dart, I get somewhere around 140 milliseconds. If I run the same code as a browser test with pub run test -p "chrome" ... I get times around 8000 milliseconds.
I am willing to wait for a 0.1 s calculation, but to wait 8 s for something in the browser, no -- it is basically unusable. When I go in release mode, the performance in browser improve but it's still 10x slower.
Am I missing something? Do I have to avoid any calculations in the browser?
Thanks,
Tony

It's interesting how slow this is.
The corresponding JavaScript code:
(function() {
"use strict";
var n = 10000;
var x = [];
var res = [];
for (var i = 0; i < n; i++) x.push(Math.random());
var t0 = Date.now();
for (var i = 0; i < 1000; i++) {
for (var j = 0; j < n; j++) {
x[j] += i;
}
res.push(x.reduce((a, b) => a + b));
}
var t1 = Date.now();
console.log("Milliseconds: " + (t1 - t0));
}());
runs in as little as ~20 milliseconds.
So, it looks like Dart is somehow triggering "slow mode" for its generated Javascript.
If you look at the generated code, it contains:
for (i = 0; i < 1000; ++i) {
for (j = 0; j < 10000; ++j) {
if (j >= x.length)
return H.ioore(x, j);
t1 = x[j];
if (typeof t1 !== "number")
return t1.$add();
C.JSArray_methods.$indexSet(x, j, t1 + i);
}
C.JSArray_methods.add$1(res, C.JSArray_methods.reduce$1(x, new A.main_closure0()));
}
You can try to tweak this code, but the big cost comes from C.JSArray_methods.$indexSet(x, j, t1 + i);. If you change that to x[j] = t1 + i;, the time drops to a few hundred milliseconds. So, this is the problem with the current code.
(You can improve performance a little, ~20%, by making x a List<num> instead of a List<double>. I have no idea why that makes a difference, the generated code is almost the same, the add closure uses checkDouble to check the type instead of checkNum, but they have exactly the same body).
You don't have to avoid any computation in the browser. You may have to optimize a little for slow cases like this (or report them to the compiler developers, because this probably can be recognized and optimized, it just fails to be so for now). For example, you can change your list x of doubles to a Float64List from dart:typed_data:
var x = Float64List.fromList([for (var i = 0; i < n; i++) rand.nextDouble()]);
Then speed increases quite a lot.

The Dart tracking issue for this is https://github.com/dart-lang/sdk/issues/38705.
The performance of this kind of code has recently improved considerably and is much closer to the Dart VM.

Google Foobar, maximum unique visits under a resource limit, negative weights in graph

I'm having trouble figuring out the type of problem this is. I'm still a student and haven't taken a graph theory/linear optimization class yet.
The only thing I know for sure is to check for negative cycles, as this means you can rack the resource limit up to infinity, allowing for you to pick up each rabbit. I don't know the "reason" to pick the next path. I also don't know when to terminate, as you could keep using all of the edges and make the resource limit drop below 0 forever, but never escape.
I'm not really looking for code (as this is a coding challenge), only the type of problem this is (Ex: Max Flow, Longest Path, Shortest Path, etc.) If you an algorithm that fits this already that would be extra awesome. Thanks.
The time it takes to move from your starting point to all of the bunnies and to the bulkhead will be given to you in a square matrix of integers. Each row will tell you the time it takes to get to the start, first bunny, second bunny, ..., last bunny, and the bulkhead in that order. The order of the rows follows the same pattern (start, each bunny, bulkhead). The bunnies can jump into your arms, so picking them up is instantaneous, and arriving at the bulkhead at the same time as it seals still allows for a successful, if dramatic, escape. (Don't worry, any bunnies you don't pick up will be able to escape with you since they no longer have to carry the ones you did pick up.) You can revisit different spots if you wish, and moving to the bulkhead doesn't mean you have to immediately leave - you can move to and from the bulkhead to pick up additional bunnies if time permits.
In addition to spending time traveling between bunnies, some paths interact with the space station's security checkpoints and add time back to the clock. Adding time to the clock will delay the closing of the bulkhead doors, and if the time goes back up to 0 or a positive number after the doors have already closed, it triggers the bulkhead to reopen. Therefore, it might be possible to walk in a circle and keep gaining time: that is, each time a path is traversed, the same amount of time is used or added.
Write a function of the form answer(times, time_limit) to calculate the most bunnies you can pick up and which bunnies they are, while still escaping through the bulkhead before the doors close for good. If there are multiple sets of bunnies of the same size, return the set of bunnies with the lowest prisoner IDs (as indexes) in sorted order. The bunnies are represented as a sorted list by prisoner ID, with the first bunny being 0. There are at most 5 bunnies, and time_limit is a non-negative integer that is at most 999.

It's a planning problem, basically. The generic approach to planning is to identify the possible states of the world, the initial state, transitions between states, and the final states. Then search the graph that this data imply, most simply using breadth-first search.
For this problem, the relevant state is (1) how much time is left (2) which rabbits we've picked up (3) where we are right now. This means 1,000 clock settings (I'll talk about added time in a minute) times 2^5 = 32 subsets of bunnies times 7 positions = 224,000 possible states, which is a lot for a human but not a computer.
We can deal with added time by swiping a trick from Johnson's algorithm. As Tymur suggests in a comment, run Bellman--Ford and either find a negative cycle (in which case all rabbits can be saved by running around the negative cycle enough times first) or potentials that, when applied, make all times nonnegative. Don't forget to adjust the starting time by the difference in potential between the starting position and the bulkhead.

There you go. I started Google Foobar yesterday. I'll be starting Level 5 shortly. This was my 2nd problem here at level 4. The solution is fast enough as I tried memoizing the states without using the utils class. Anyway, loved the experience. This was by far the best problem solved by me since I got to use Floyd-Warshall(to find the negative cycle if it exists), Bellman-Ford(as a utility function to the weight readjustment step used popularly in algorithms like Johnson's and Suurballe's), Johnson(weight readjustment!), DFS(for recursing over steps) and even memoization using a self-designed hashing function :)
Happy Coding!!
public class Solution
{
public static final int INF = 100000000;
public static final int MEMO_SIZE = 10000;
public static int[] lookup;
public static int[] lookup_for_bunnies;
public static int getHashValue(int[] state, int loc)
{
int hashval = 0;
for(int i = 0; i < state.length; i++)
hashval += state[i] * (1 << i);
hashval += (1 << loc) * 100;
return hashval % MEMO_SIZE;
}
public static boolean findNegativeCycle(int[][] times)
{
int i, j, k;
int checkSum = 0;
int V = times.length;
int[][] graph = new int[V][V];
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
{
graph[i][j] = times[i][j];
checkSum += times[i][j];
}
if(checkSum == 0)
return true;
for(k = 0; k < V; k++)
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
if(graph[i][j] > graph[i][k] + graph[k][j])
graph[i][j] = graph[i][k] + graph[k][j];
for(i = 0; i < V; i++)
if(graph[i][i] < 0)
return true;
return false;
}
public static void dfs(int[][] times, int[] state, int loc, int tm, int[] res)
{
int V = times.length;
if(loc == V - 1)
{
int rescued = countArr(state);
int maxRescued = countArr(res);
if(maxRescued < rescued)
for(int i = 0; i < V; i++)
res[i] = state[i];
if(rescued == V - 2)
return;
}
else if(loc > 0)
state[loc] = 1;
int hashval = getHashValue(state, loc);
if(tm < lookup[hashval])
return;
else if(tm == lookup[hashval] && countArr(state) <= lookup_for_bunnies[loc])
return;
else
{
lookup_for_bunnies[loc] = countArr(state);
lookup[hashval] = tm;
for(int i = 0; i < V; i++)
{
if(i != loc && (tm - times[loc][i]) >= 0)
{
boolean stateCache = state[i] == 1;
dfs(times, state, i, tm - times[loc][i], res);
if(stateCache)
state[i] = 1;
else
state[i] = 0;
}
}
}
}
public static int countArr(int[] arr)
{
int counter = 0;
for(int i = 0; i < arr.length; i++)
if(arr[i] == 1)
counter++;
return counter;
}
public static int bellmanFord(int[][] adj, int times_limit)
{
int V = adj.length;
int i, j, k;
int[][] graph = new int[V + 1][V + 1];
for(i = 1; i <= V; i++)
graph[i][0] = INF;
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
graph[i + 1][j + 1] = adj[i][j];
int[] distance = new int[V + 1] ;
for(i = 1; i <= V; i++)
distance[i] = INF;
for(i = 1; i <= V; i++)
for(j = 0; j <= V; j++)
{
int minDist = INF;
for(k = 0; k <= V; k++)
if(graph[k][j] != INF)
minDist = Math.min(minDist, distance[k] + graph[k][j]);
distance[j] = Math.min(distance[j], minDist);
}
for(i = 0; i < V; i++)
for(j = 0; j < V; j++)
adj[i][j] += distance[i + 1] - distance[j + 1];
return times_limit + distance[1] - distance[V];
}
public static int[] solution(int[][] times, int times_limit)
{
int V = times.length;
if(V == 2)
return new int[]{};
if(findNegativeCycle(times))
{
int ans[] = new int[times.length - 2];
for(int i = 0; i < ans.length; i++)
ans[i] = i;
return ans;
}
lookup = new int[MEMO_SIZE];
lookup_for_bunnies = new int[V];
for(int i = 0; i < V; i++)
lookup_for_bunnies[i] = -1;
times_limit = bellmanFord(times, times_limit);
int initial[] = new int[V];
int res[] = new int[V];
dfs(times, initial, 0, times_limit, res);
int len = countArr(res);
int ans[] = new int[len];
int counter = 0;
for(int i = 0; i < res.length; i++)
if(res[i] == 1)
{
ans[counter++] = i - 1;
if(counter == len)
break;
}
return ans;
}
}

cvDilate/cvErode: How to avoid connection between separated objects?

I would like to separate objects in OpenCv like the following image it shows:
But if I am using cvDilate or cvErode the objects grow together... how to do that with OpenCv?

It looks like you will need to write your own dilate function and then add xor functionality yourself.
Per the opencv documentation, here is the rule that cvdilate uses:
dst=dilate(src,element): dst(x,y)=max((x',y') in element))src(x+x',y+y')
Here is pseudocode for a starting point (this does not include xor code):
void my_dilate(img) {
for(i = 0; i < img.height; i++) {
for(j = 0; j < img.width; j++) {
max_pixel = get_max_pixel_in_window(img, i, j);
img.pixel(i,j) = max_pixel;
}
}
}
int get_max_pixel_in_window(img, center_row, center_col) {
int window_size = 3;
int cur_max = 0;
for(i = -window_size; i <= window_size; i++) {
for(j = -window_size; j <= window_size; j++) {
int cur_col = center_col + i;
int cur_row = center_row + j;
if(out_of_bounds(img, cur_col, cur_row)) {
continue;
}
int cur_pix = img.pixel(center_row + i, center_col + j);
if(cur_pix > cur_max) {
cur_max = cur_pix;
}
}
}
return cur_max;
}
// returns true if the x, y coordinate is outside of the image
int out_of_bounds(img, x, y) {
if(x >= img.width || x < 0 || y >= img.height || y <= 0) {
return 1;
}
return 0;
}

As far as I know OpenCV does not have "dilation with XOR" (although that would be very nice to have).
To get similar results you might try eroding (as in 'd'), and using the eroded centers as seeds for a Voronoi segmentation which you could then AND with the original image.

after erosion and dilate try thresholding the image to eliminate weak elements. Only strong regions should remain and thus improve the object separation. By the way could you be a little more clear about your problem with cvDilate or cvErode.

mean absolute difference method for video stabilisation project in C++/CLI

I try to implement video stabilization project in C++/cli.First of all I have bmp image sequences,and I found motion vectors that show how much specific pixel region move between each image frame . For example I have 256*256 image, I selected 200*200 region in first image frame and secong image frame.And I found how much pixel move between first region and second region.When algorithm went to the last image sequnce,program finished the work.Eventually,I obtained motion vectors.I did this operation using mean absolute method.
It worked, but too slowly.My example code block is here,I found only one motion vector first index(x direction and y direction):
//M:image height =256
//N.image width =256
//BS:block size=218
//selecting and reading first and second image frame
frame = 1;
s1 = "C:\\bike\\" + frame + ".bmp";
image = gcnew System::Drawing::Bitmap(s1, true);
s2 = "C:\\bike\\" + (frame + 1) + ".bmp";
image2 = gcnew System::Drawing::Bitmap(s2, true);
for (b = 0; b < M; b++){
for (a = 0; a < N; a++)
{
System::Drawing::Color BitmapColor = image->GetPixel(a, b);
I1[b][a] = (double)((BitmapColor . R * 0.3) + (BitmapColor . G * 0.59) + (BitmapColor . B * 0.11));
}
}
for (b = 0; b < M; b++){
for (a = 0; a < N; a++)
{
System::Drawing::Color BitmapColor = image2->GetPixel(a, b);
I2[b][a] = (double)((BitmapColor . R * 0.3) + (BitmapColor . G * 0.59) + (BitmapColor . B * 0.11));
}
}
//finding blocks
a = 0;
for (i = 19; i < 237; i++){
b = 0;
for (j = 19; j < 237; j++){
Blocks[a][b] = I2[i][j];
b++;
}
a++;
}
//finding motion vectors according to the mean absolute differences
//MAD method
for (m = 0; m < (M - BS); m++){
for (n = 0; n < (N - BS); n++){
toplam = 0;
for (i = 0; i < BS; i++){
for (j = 0; j < BS; j++){
toplam += fabs(I1[m + i][n + j] - Blocks[i][j]);
}
}
// finding vectors
if (difference < mindifference) {
mindifference = difference;
MV_x = m;
MV_y = n;
}
}
}
This code example worked.But this is very slowly.I need to implement code optimization.
How can I do this without using for cycles,such as I do indexing in C++/cli like MATLAB codes(ex. I1(1:20)=100).
Could you help me please?
Best Regards...

A couple things you should note:
First, loops in C++ are not slow compared to built-in functions. In MatLab, the fewer operations the better, so it's best to call built-in functions that are a single operation done with optimized code. In C++, YOUR code gets optimized equally with the built-in functions.
Next, GetPixel is extremely slow. Try Bitmap.LockBits instead. Ironically, this seems to contradict my previous statement, but actually it isn't because looping inside LockBits is faster than you doing the loop, but because GetPixel uses a different method which is much much slower.
Once you switch to LockBits, you can probably double or triple your speed again by unrolling the loop somewhat, if the compiler isn't already doing so.
Finally, make sure you're making good use of cache locality. Try both looping orders (e.g. for (a...) for (b...) and for (b...) for (a...)) and measure the time of each to find out which is faster.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

2-D cache blocking Optimization - caching

Related

Cache locality considerations

performance of dart code in browser vs. VM

Google Foobar, maximum unique visits under a resource limit, negative weights in graph

cvDilate/cvErode: How to avoid connection between separated objects?

mean absolute difference method for video stabilisation project in C++/CLI

Categories

Resources