mean absolute difference method for video stabilisation project in C++/CLI - image

I try to implement video stabilization project in C++/cli.First of all I have bmp image sequences,and I found motion vectors that show how much specific pixel region move between each image frame . For example I have 256*256 image, I selected 200*200 region in first image frame and secong image frame.And I found how much pixel move between first region and second region.When algorithm went to the last image sequnce,program finished the work.Eventually,I obtained motion vectors.I did this operation using mean absolute method.
It worked, but too slowly.My example code block is here,I found only one motion vector first index(x direction and y direction):
//M:image height =256
//N.image width =256
//BS:block size=218
//selecting and reading first and second image frame
frame = 1;
s1 = "C:\\bike\\" + frame + ".bmp";
image = gcnew System::Drawing::Bitmap(s1, true);
s2 = "C:\\bike\\" + (frame + 1) + ".bmp";
image2 = gcnew System::Drawing::Bitmap(s2, true);
for (b = 0; b < M; b++){
for (a = 0; a < N; a++)
{
System::Drawing::Color BitmapColor = image->GetPixel(a, b);
I1[b][a] = (double)((BitmapColor . R * 0.3) + (BitmapColor . G * 0.59) + (BitmapColor . B * 0.11));
}
}
for (b = 0; b < M; b++){
for (a = 0; a < N; a++)
{
System::Drawing::Color BitmapColor = image2->GetPixel(a, b);
I2[b][a] = (double)((BitmapColor . R * 0.3) + (BitmapColor . G * 0.59) + (BitmapColor . B * 0.11));
}
}
//finding blocks
a = 0;
for (i = 19; i < 237; i++){
b = 0;
for (j = 19; j < 237; j++){
Blocks[a][b] = I2[i][j];
b++;
}
a++;
}
//finding motion vectors according to the mean absolute differences
//MAD method
for (m = 0; m < (M - BS); m++){
for (n = 0; n < (N - BS); n++){
toplam = 0;
for (i = 0; i < BS; i++){
for (j = 0; j < BS; j++){
toplam += fabs(I1[m + i][n + j] - Blocks[i][j]);
}
}
// finding vectors
if (difference < mindifference) {
mindifference = difference;
MV_x = m;
MV_y = n;
}
}
}
This code example worked.But this is very slowly.I need to implement code optimization.
How can I do this without using for cycles,such as I do indexing in C++/cli like MATLAB codes(ex. I1(1:20)=100).
Could you help me please?
Best Regards...

A couple things you should note:
First, loops in C++ are not slow compared to built-in functions. In MatLab, the fewer operations the better, so it's best to call built-in functions that are a single operation done with optimized code. In C++, YOUR code gets optimized equally with the built-in functions.
Next, GetPixel is extremely slow. Try Bitmap.LockBits instead. Ironically, this seems to contradict my previous statement, but actually it isn't because looping inside LockBits is faster than you doing the loop, but because GetPixel uses a different method which is much much slower.
Once you switch to LockBits, you can probably double or triple your speed again by unrolling the loop somewhat, if the compiler isn't already doing so.
Finally, make sure you're making good use of cache locality. Try both looping orders (e.g. for (a...) for (b...) and for (b...) for (a...)) and measure the time of each to find out which is faster.

Related

2-D cache blocking Optimization

I would like to know if there is a way to optimize the speed of time to implementing this code in C. This part is for initializing the matrix in rowwise.
For other parts of my whole code are basically calculate current time and Main function, so I guess this initialization is the most time-cost part
Some hint for me is we can use cache blocking. BTW this code is also for imitate the process which CPU scratch data from cache. I had been figuring it all day but had limited idea.
Thanks!!
void InitializeMatrixRowwise() {
int i, j;
double x;
x = 0.0;
for (i = 0; i < DIMENSION; i++) {
for (j = 0; j < DIMENSION; j++) {
if (i >= j) {
Matrix[i][j] = x;
x += 1.0;
} else
Matrix[i][j] = 1.0;
}
}
}

Fast pixel drawing c++ windows forms

I am trying to make a c++ emulator for the original space invaders arcade machine. Everything works just fine but it is slow. After timing several functions in my project i found that the biggest drain is my draw screen functions. It takes +100ms but i need 16ms/frame, e.g. 60Hz. This is my function:
unsigned char *fb = &state->memory[0x2400];
Bitmap ^ bmp = gcnew Bitmap(512, 448);
for (int j = 1; j < 224; j++)
{
for (int i = 0; i < 256; i+=8)
{
unsigned char pix = fb[(j * 256 / 8) + i / 8];
for (int p = 0; p < 8; p++)
{
if (0 != (pix & (1 << p)))
{
bmp->SetPixel(i + p, j, Color::White);
}
else
{
bmp->SetPixel(i + p, j, Color::Black);
}
}
}
}
bmp->RotateFlip(RotateFlipType::Rotate270FlipNone);
this->pictureBox1->Image = bmp;
fb is my framebuffer, a byte array with 8 pixels per byte, 1 for white, 0 for black.
Browsing the internet i found that the "slow part" is the SetPixel method, but i couldn't find a better method to do this.
Thanks to Loathings comment i solved the problem. Instead of redrawing every pixel, i first clear the image to black and then only the white pixels are drawn. Had to use a buffer to prevent flickering
unsigned char *fb = &state->memory[0x2400];
Bitmap ^ bmp = gcnew Bitmap(512, 448);
Image ^buffer = this->pictureBox1->Image;
Graphics ^g = Graphics::FromImage(buffer);
for (int j = 1; j < 224; j++)
{
for (int i = 0; i < 256; i += 8)
{
unsigned char pix = fb[(j * 256 / 8) + i / 8];
for (int p = 0; p < 8; p++)
{
if (0 != (pix & (1 << p)))
{
bmp->SetPixel(i + p, j, Color::White);
}
}
}
}
bmp->RotateFlip(RotateFlipType::Rotate270FlipNone);
g->Clear(Color::Black);
buffer = bmp;
this->pictureBox1->Image = buffer;
and at initialisation i added next code to prevent nullexception being thrown,
Bitmap ^def = gcnew Bitmap(20, 20);
this->pictureBox1->Image = def;
Update: Although this method works for my project, considering the screen will be mostly composed of only black pixels. GetPixel and SetPixel are slow, and shouldn't be used with performance sensitive programs. LockBits will be the way to go. I didn't implement this because i got the performance i needed by just clearing the graphics interface

How many moves to reach a destination? Efficient flood filling

I want to compute the distance of cells from a destination cell, using number of four-way movements to reach something. So the the four cells immediately adjacent to the destination have a distance of 1, and those on the four cardinal directions of each of them have a distance of 2 and so on. There is a maximum distance that might be around 16 or 20, and there are cells that are occupied by barriers; the distance can flow around them but not through them.
I want to store the output into a 2D array, and I want to be able to compute this 'distance map' for any destination on a bigger maze map very quickly.
I am successfully doing it with a variation on a flood fill where the I place incremental distance of the adjacent unfilled cells in a priority queue (using C++ STL).
I am happy with the functionality and now want to focus on optimizing the code, as it is very performance sensitive.
What cunning and fast approaches might there be?
I think you have done everything right. If you coded it correct it takes O(n) time and O(n) memory to compute flood fill, where n is the number of cells, and it can be proven that it's impossible to do better (in general case). And after fill is complete you just return distance for any destination with O(1), it easy to see that it also can be done better.
So if you want to optimize performance, you can only focused on CODE LOCAL OPTIMIZATION. Which will not affect asymptotic but can significantly improve your real execution time. But it's hard to give you any advice for code optimization without actually seeing source.
So if you really want to see optimized code see the following (Pure C):
include
int* BFS()
{
int N, M; // Assume we have NxM grid.
int X, Y; // Start position. X, Y are unit based.
int i, j;
int movex[4] = {0, 0, 1, -1}; // Move on x dimension.
int movey[4] = {1, -1, 0, 0}; // Move on y dimension.
// TO DO: Read N, M, X, Y
// To reduce redundant functions calls and memory reallocation
// allocate all needed memory once and use a simple arrays.
int* map = (int*)malloc((N + 2) * (M + 2));
int leadDim = M + 2;
// Our map. We use one dimension array. map[x][y] = map[leadDim * x + y];
// If (x,y) is occupied then map[leadDim*x + y] = -1;
// If (x,y) is not visited map[leadDim*x + y] = -2;
int* queue = (int*)malloc(N*M);
int first = 0, last =1;
// Fill the boarders to simplify the code and reduce conditions
for (i = 0; i < N+2; ++i)
{
map[i * leadDim + 0] = -1;
map[i * leadDim + M + 1] = -1;
}
for (j = 0; j < M+2; ++j)
{
map[j] = -1;
map[(N + 1) * leadDim + j] = -1;
}
// TO DO: Read the map.
queue[first] = X * leadDim + Y;
map[X * leadDim + Y] = 0;
// Very simple optimized process loop.
while (first < last)
{
int current = queue[first];
int step = map[current];
for (i = 0; i < 4; ++i)
{
int temp = current + movex[i] * leadDim + movey[i];
if (map[temp] == -2) // only one condition in internal loop.
{
map[temp] = step + 1;
queue[last++] = temp;
}
}
++first;
}
free(queue);
return map;
}
Code may seems tricky. And of course, it doesn't look like OOP (I actually think that OOP fans will hate it) but if you want something really fast that's what you need.
It's common task for BFS. Complexity is O(cellsCount)
My c++ implementation:
vector<vector<int> > GetDistance(int x, int y, vector<vector<int> > cells)
{
const int INF = 0x7FFFFF;
vector<vector<int> > distance(cells.size());
for(int i = 0; i < distance.size(); i++)
distance[i].assign(cells[i].size(), INF);
queue<pair<int, int> > q;
q.push(make_pair(x, y));
distance[x][y] = 0;
while(!q.empty())
{
pair<int, int> curPoint = q.front();
q.pop();
int curDistance = distance[curPoint.first][curPoint.second];
for(int i = -1; i <= 1; i++)
for(int j = -1; j <= 1; j++)
{
if( (i + j) % 2 == 0 ) continue;
pair<int, int> nextPoint(curPoint.first + i, curPoint.second + j);
if(nextPoint.first >= 0 && nextPoint.first < cells.size()
&& nextPoint.second >= 0 && nextPoint.second < cells[nextPoint.first].size()
&& cells[nextPoint.first][nextPoint.second] != BARRIER
&& distance[nextPoint.first][nextPoint.second] > curDistance + 1)
{
distance[nextPoint.first][nextPoint.second] = curDistance + 1;
q.push(nextPoint);
}
}
}
return distance;
}
Start with a recursive implementation: (untested code)
int visit( int xy, int dist) {
int ret =1;
if (array[xy] <= dist) return 0;
array[xy] = dist;
if (dist == maxdist) return ret;
ret += visit ( RIGHT(xy) , dist+1);
...
same for left, up, down
...
return ret;
}
You'l need to handle the initalisation and the edge-cases. And you have to decide if you want a two dimentional array or a one dimensonal array.
A next step could be to use a todo list and remove the recursion, and a third step could be to add some bitmasking.
8-bit computers in the 1970s did this with an optimization that has the same algorithmic complexity, but in the typical case is much faster on actual hardware.
Starting from the initial square, scan to the left and right until "walls" are found. Now you have a "span" that is one square tall and N squares wide. Mark the span as "filled," in this case each square with the distance to the initial square.
For each square above and below the current span, if it's not a "wall" or already filled, pick it as the new origin of a span.
Repeat until no new spans are found.
Since horizontal rows tend to be stored contiguously in memory, this algorithm tends to thrash the cache far less than one that has no bias for horizontal searches.
Also, since in the most common cases far fewer items are pushed and popped from a stack (spans instead of individual blocks) there is less time spent maintaining the stack.

A Cache Efficient Matrix Transpose Program?

So the obvious way to transpose a matrix is to use :
for( int i = 0; i < n; i++ )
for( int j = 0; j < n; j++ )
destination[j+i*n] = source[i+j*n];
but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?
Edit: I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.
Edit2: The matrices are stored in column major order, that is to say for a matrix
a1 a2
a3 a4
is stored as a1 a3 a2 a4.
You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < n; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*n];
}
}
}
}
An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.
Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.
I had the exact same problem yesterday.
I ended up with this solution:
void transpose(double *dst, const double *src, size_t n, size_t p) noexcept {
THROWS();
size_t block = 32;
for (size_t i = 0; i < n; i += block) {
for(size_t j = 0; j < p; ++j) {
for(size_t b = 0; b < block && i + b < n; ++b) {
dst[j*n + i + b] = src[(i + b)*p + j];
}
}
}
}
This is 4 time faster than the obvious solution on my machine.
This solution takes care of a rectangular matrix with dimensions which are not a multiple of the block size.
if dst and src are the same square matrix an in place function should really be used instead:
void transpose(double*m,size_t n)noexcept{
size_t block=0,size=8;
for(block=0;block+size-1<n;block+=size){
for(size_t i=block;i<block+size;++i){
for(size_t j=i+1;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}
for(size_t i=block+size;i<n;++i){
for(size_t j=block;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
for(size_t i=block;i<n;++i){
for(size_t j=i+1;j<n;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
I used C++11 but this could be easily translated in other languages.
Instead of transposing the matrix in memory, why not collapse the transposition operation into the next operation you're going to do on the matrix?
Steve Jessop mentioned a cache oblivious matrix transpose algorithm.
For the record, I want to share an possible implementation of a cache oblivious matrix transpose.
public class Matrix {
protected double data[];
protected int rows, columns;
public Matrix(int rows, int columns) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
}
public Matrix transpose() {
Matrix C = new Matrix(columns, rows);
cachetranspose(0, rows, 0, columns, C);
return C;
}
public void cachetranspose(int rb, int re, int cb, int ce, Matrix T) {
int r = re - rb, c = ce - cb;
if (r <= 16 && c <= 16) {
for (int i = rb; i < re; i++) {
for (int j = cb; j < ce; j++) {
T.data[j * rows + i] = data[i * columns + j];
}
}
} else if (r >= c) {
cachetranspose(rb, rb + (r / 2), cb, ce, T);
cachetranspose(rb + (r / 2), re, cb, ce, T);
} else {
cachetranspose(rb, re, cb, cb + (c / 2), T);
cachetranspose(rb, re, cb + (c / 2), ce, T);
}
}
}
More details on cache oblivious algorithms can be found here.
Matrix multiplication comes to mind, but the cache issue there is much more pronounced, because each element is read N times.
With matrix transpose, you are reading in a single linear pass and there's no way to optimize that. But you can simultaneously process several rows so that you write several columns and so fill complete cache lines. You will only need three loops.
Or do it the other way around and read in columns while writing linearly.
With a large matrix, possibly a large sparse matrix, it might be an idea to decompose it into smaller cache friendly chunks (Say, 4x4 sub matrices). You can also flag sub matrices as identity which will help you in creating optimized code paths.

cvDilate/cvErode: How to avoid connection between separated objects?

I would like to separate objects in OpenCv like the following image it shows:
But if I am using cvDilate or cvErode the objects grow together... how to do that with OpenCv?
It looks like you will need to write your own dilate function and then add xor functionality yourself.
Per the opencv documentation, here is the rule that cvdilate uses:
dst=dilate(src,element): dst(x,y)=max((x',y') in element))src(x+x',y+y')
Here is pseudocode for a starting point (this does not include xor code):
void my_dilate(img) {
for(i = 0; i < img.height; i++) {
for(j = 0; j < img.width; j++) {
max_pixel = get_max_pixel_in_window(img, i, j);
img.pixel(i,j) = max_pixel;
}
}
}
int get_max_pixel_in_window(img, center_row, center_col) {
int window_size = 3;
int cur_max = 0;
for(i = -window_size; i <= window_size; i++) {
for(j = -window_size; j <= window_size; j++) {
int cur_col = center_col + i;
int cur_row = center_row + j;
if(out_of_bounds(img, cur_col, cur_row)) {
continue;
}
int cur_pix = img.pixel(center_row + i, center_col + j);
if(cur_pix > cur_max) {
cur_max = cur_pix;
}
}
}
return cur_max;
}
// returns true if the x, y coordinate is outside of the image
int out_of_bounds(img, x, y) {
if(x >= img.width || x < 0 || y >= img.height || y <= 0) {
return 1;
}
return 0;
}
As far as I know OpenCV does not have "dilation with XOR" (although that would be very nice to have).
To get similar results you might try eroding (as in 'd'), and using the eroded centers as seeds for a Voronoi segmentation which you could then AND with the original image.
after erosion and dilate try thresholding the image to eliminate weak elements. Only strong regions should remain and thus improve the object separation. By the way could you be a little more clear about your problem with cvDilate or cvErode.

Resources