inserting zeros between the elements of vector with high performance and speed ( preferred to use STL) - visual-studio-2010

I have extracted raster data of a geotiff image using RasterIO of the GDAL library. Since the image shown by OpenGL needs to have width and height both a multiple of 4, I have used this code after extracting the data.
the first switch block evaluates the rest of RasterXSize(width) divided by 4 and for example if it is 1, it means that we should add 3 columns meaning that we should add 3 zeros at the end of each row. This is done by the code:
for ( int i = 1; i <= RasterYSize; i++)
pRasterData.insert(pRasterData.begin()+i*RasterXSize*depthOfPixel+(i-1)*3,3,0);
and the second switch block evaluates the rest of RasterYSize(height) divided by 4 and for example if it is 1, it means that we should easily add 3 rows to the end of the data which is done by this code:
pRasterData.insert(pRasterData.end(),3*RasterXSize,0);
This is the whole code that I have used for extracting the data and preparing it to be displayed by OpenGL:
void FilesWorkFlow::ReadRasterData(GDALDataset* poDataset)
{
RasterXSize = poDataset -> GetRasterXSize();
RasterYSize = poDataset -> GetRasterYSize();
RasterCount = poDataset -> GetRasterCount();
CPLErr error = CE_None;
GDALRasterBand *poRasterBand;
poRasterBand = poDataset -> GetRasterBand(1);
eType = poRasterBand -> GetRasterDataType();
BytesPerPixel = GDALGetDataTypeSize(eType) / 8;
depthOfPixel = RasterCount * BytesPerPixel;
pRasterData.resize(RasterXSize * RasterYSize * RasterCount * BytesPerPixel);
error = poDataset -> RasterIO(GF_Read,0,0,RasterXSize,RasterYSize,&pRasterData[0],RasterXSize,RasterYSize,eType,RasterCount,0,0,0,0);
int modRasterXSize = RasterXSize % 4;
switch (modRasterXSize)
{
case 1:
{
for ( int i = 1; i <= RasterYSize; i++)
pRasterData.insert(pRasterData.begin()+i*RasterXSize*depthOfPixel+(i-1)*3,3,0);
RasterXSize = RasterXSize+3;
break;
}
case 2:
{
for ( int i = 1; i <= RasterYSize; i++)
pRasterData.insert(pRasterData.begin()+i*RasterXSize*depthOfPixel+(i-1)*2,2,0);
RasterXSize = RasterXSize+2;
break;
}
case 3:
{
for ( int i = 1; i <= RasterYSize; i++)
pRasterData.insert(pRasterData.begin()+i*RasterXSize*depthOfPixel+(i-1)*1,1,0);
RasterXSize = RasterXSize+1;
break;
}
}
int modRasterYSize = RasterYSize % 4;
switch (modRasterYSize)
{
case 1:
{
pRasterData.insert(pRasterData.end(),3*RasterXSize,0);
RasterYSize = RasterYSize+3;
break;
}
case 2:
{
pRasterData.insert(pRasterData.end(),2*RasterXSize,0);
RasterYSize = RasterYSize+2;
break;
}
case 3:
{
pRasterData.insert(pRasterData.end(),1*RasterXSize,0);
RasterYSize = RasterYSize+1;
break;
}
}
}
the first switch block is where my code gets slow and because I am working with a 16997*15931 image it takes a lot of time for the program to run through the for loop.
Note that pRasterData is a member variable of the class FilesWorkFlow and because of the problems I had in sending this variable to the COpenGLControl class written by Brett Fowle in codeguru and used by me in the project with some slight changes, decided to use vector<unsigned char> instead of unsighned char*.
Now I am wondering is there anyway to implement these part of code faster using vectors?
Is there anyway to insert zero in certain parts of vector without using for loops and wasting too much time?
something like std::transform? I don't know!
Remember that I'm using MFC in Visual Studio 2010 and it's better for me to use STL but if you have another suggestions besides using vectors or STL, I'd be glad to hear that?

The reason it is slow is because the members of the vector are getting moved multiple times. Think about the members in the last row of your image. They all have to be moved once for every row of the image. It would be faster to create a whole new image, copying just the pixels you need from the original image and adding zeros where appropriate.
Here's an example:
void
padColumns(
std::vector<unsigned char> &old_image,
size_t old_width,
size_t new_width
)
{
size_t height = image.size() / old_width;
assert(image.size() == old_width*height);
std::vector<unsigned char> new_image(new_width * height);
for (size_t row=0; row!=height; ++row) {
std::copy(
old_image.begin() + row*old_width,
old_image.begin() + row*old_width + old_width,
new_image.begin() + row*new_width
);
std::fill(
new_image.begin() + row*new_width + old_width,
new_image.begin() + row*new_width + new_width,
0
);
}
old_image = new_image;
}

Related

c code is running to slow from nested for loops

my c program is running to slow (right now it is around 40 seconds without parallelization). I have tried using openmp which has brought the timing down significantly but I am looking to use simple and natural ways to make my code run faster other than using parallel for loops. The basic structure of the code is that is takes some command line arguments as inputs and then saves those inputs as variables. Then it recursively computes a variable called Rplus1 using the math.h library and the complex.h library. The problem of the code and where it is taking most of it's time is at the bottom where there are nested for loops. My goal is to get the whole code running in under 5 seconds but as of now it runs in about 40 seconds without using parallel for loops. Please Help!
#include "time.h"
#include "stdio.h"
#include "stdlib.h"
#include "complex.h"
#include "math.h"
#include "string.h"
#include "unistd.h"
#include "omp.h"
#define PI 3.14159265
int main (int argc, char *argv[]){
if(argc >= 8){
double start1 = omp_get_wtime();
// command line arguments are aligned in the following order: [theta] [number of layers in superlattice] [material_1] [lat const_1] [number of unit cells_1] [material_2] [lat const_2] [number of unit cells_2] .... [material_N] [lat const_N] [number of unit cells_N] [Log/Linear] [number of repeating superlattice layers] [yes/no]
int N;
sscanf(argv[2],"%d",&N); // Number of layers in superlattice specified by second input argument
if(strcmp(argv[argc-1],"yes") == 0) //If the substrate is included then add one more layer to the N variable
{
N = N+1;
}
int total;
sscanf(argv[argc-2],"%d",&total); // Number of repeating superlattice layers specified by second to last argument
double layers[N][6], horizangle[1001], vertangle[1001];
double complex (*F_hkl)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*F_0)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*g)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*g_0)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)),SF_table[10];// this array will hold the unit cell structure factors for all of the materials selected for each wavevector in the beam spectrum
double real, real2, lam, c_light = 299792458, h_pl = 4.135667516e-15,E = 10e3, r_0 = 2.818e-15, Lccd = 1.013;// just a few variables to hold values through calculations and constants, speed of light, plancks const, photon energy, and detector distance from sample
double angle;
double complex z;// just a variable to hold complex numbers throughout calculations
int i,j,m,n,t; // integers to index through arrays
lam = (h_pl*c_light)/E;
sscanf(argv[1],"%lf",&angle); //first argument is the angle of incidence, read it
angle = angle*(PI/180.0);
angle2 = -angle;
double (*table)[10] = malloc(10*9*sizeof(double)); // this array holds all the coefficients to calculate the atomic scattering factor below
double (*table2)[10] = malloc(10*2*sizeof(double));
FILE*datfile1 = fopen("/home/vhosts/xraydev.engr.wisc.edu/data/coef_table.bin","rb"); // read the binary file containg all the coefficients
fread(table,sizeof(double),90,datfile1);
fclose(datfile1);
FILE*datfile2 = fopen("/home/vhosts/xraydev.engr.wisc.edu/data/dispersioncs.bin","rb");
fread(table2,sizeof(double),20,datfile2);
fclose(datfile2);
// Calculate scattering factors for all elements
double a,b;
double k_z = (sin(angle)/lam)*1e-10; // incorporate angular dependence of SF but neglect 0.24 degree divergence because of approximation
for(i = 0;i<10;i++) // for each element...
{
SF_table[i] = 0;
for(j = 0;j<4;j++) // summation
{
a = table[2*j][i];
b = table[2*j+1][i];
SF_table[i] = SF_table[i] + a * exp(-b*k_z*k_z);
}
SF_table[i] = SF_table[i] + table[8][i] + table2[0][i] + table2[1][i]*I;
}
free(table);
double mm = 4.0, (*phi)[1001][1001] = malloc(N*1001*1001*sizeof(double));
for(i = 1; i < N+1; i++) // for each layer of material...
{
sscanf(argv[i*3+1],"%lf",&layers[i-1][1]); // get out of plane lattice constant
sscanf(argv[i*3+2],"%lf",&layers[i-1][2]); // get the number of unit cells in the layer
layers[i-1][1] = layers[i-1][1]*1e-10; // convert lat const input to meters
// Define reciprocal space positions at the incident angle h, k, l
layers[i-1][3] = 0; // h
layers[i-1][4] = 0; // k
double l; // l calculated for each wavevector in the spectrum because l changes with angle of incidence
for (m = 0; m < 1001; m++)
{
for (n = 0; n <1001; n++)
{
l = 4;
phi[i-1][m][n] = 2*PI*layers[i-1][1]*sin(angle)/lam; // Caculate phi for each layer
if(strcmp(argv[i*3],"GaAs") == 0)
{
F_hkl[i-1][m][n] = (2+2*cexp(I*PI*l))*(SF_table[2]+SF_table[3]*cexp(I*PI*l/2));
F_0[i-1][m][n] = 0.5*8.0*(31 + table2[0][2] + table2[1][2]*I) + 0.5*8.0*(33 + table2[0][3] + table2[1][3]*I);
g[i-1][m][n] = 2*r_0*F_hkl[i-1][m][n]/mm/layers[i-1][1]*cos(2*angle[m][n]);
g_0[i-1][m][n] = 2*r_0*F_0[i-1][m][n]/mm/layers[i-1][1];
}
if(strcmp(argv[i*3],"AlGaAs") == 0)
{
F_hkl[i-1][m][n] = (2+2*cexp(I*PI*l))*((0.76*SF_table[2]+ 0.24*SF_table[4])+SF_table[3]*cexp(I*PI*l/2));
F_0[i-1][m][n] = 0.24*4.0*(13 + table2[0][4] + table2[1][4]*I) + 0.76*4.0*(31 + table2[0][2] + table2[1][2]*I) + 4.0*(33 + table2[0][3] + table2[1][3]*I);
g[i-1][m][n] = 2*r_0*F_hkl[i-1][m][n]/mm/layers[i-1][1]*cos(2*angle[m][n]);
g_0[i-1][m][n] = 2*r_0*F_0[i-1][m][n]/mm/layers[i-1][1];
}
}
}
}
double complex (*Rplus1)[1001] = malloc(1001*1001*sizeof(double complex));
for (m = 0; m < 1001; m++)
{
for (n = 0; n <1001; n++)
{
Rplus1[m][n] = 0.0;
}
}
double stop1 = omp_get_wtime();
for(i=1;i<N;i++) // For each layer of the film
{
for(j=0;j<layers[i][2];j++) // For each unit cell
{
for (m = 0; m < 1001; m++) // For each row of the diffraction pattern
{
for (n = 0; n <1001; n++) // For each column of the diffraction pattern
{
Rplus1[m][n] = -I*g[i][m][n] + ((1-I*g_0[i][m][n])*(1-I*g_0[i][m][n]))/(I*g[i][m][n] + (cos(-2*phi[i][m][n])+I*sin(-2*phi[i][m][n]))/Rplus1[m][n]);
}
}
}
}
double stop2 = omp_get_wtime();
double elapsed1 = (double)(stop1 - start1);// Second user defined function to use Durbin and Follis recursive formula
double elapsed2 = (double)(stop2 - start1);// Second user defined function to use Durbin and Follis recursive formula
printf("main() through before diffraction function took %f seconds to run\n\n",elapsed1);
printf("main() through after diffraction function took %f seconds to run\n\n",elapsed2);
}
}

Mesh Simplification with Assimp and OpenMesh

For days ago, I ask a question on how to use the edge collapse with Assimp. Smooth the obj and remove duplicated vertices in software are sloved the basic problem that could make edge collapse work, I mean it work because it could be simplicated by MeshLab like this:
It looks good in MeshLab, but I then do it in my engine which used Assimp and OpenMesh. The problem is Assimp imported the specified vertices and Indices, that could let the halfedge miss the opposite pair (Is this called non-manifold?).
The result snapshot use OpenMesh's Quadric Decimation:
To clear to find the problem, I do it without decimation and parse the OpenMesh data structure back directly. Everything is work fine as expect (I mean the result without decimation).
The code that I used to decimate the mesh:
Loader::BasicData Loader::TestEdgeCollapse(float vertices[], int vertexLength, int indices[], int indexLength, float texCoords[], int texCoordLength, float normals[], int normalLength)
{
// Mesh type
typedef OpenMesh::TriMesh_ArrayKernelT<> OPMesh;
// Decimater type
typedef OpenMesh::Decimater::DecimaterT< OPMesh > OPDecimater;
// Decimation Module Handle type
typedef OpenMesh::Decimater::ModQuadricT< OPMesh >::Handle HModQuadric;
OPMesh mesh;
std::vector<OPMesh::VertexHandle> vhandles;
int iteration = 0;
for (int i = 0; i < vertexLength; i += 3)
{
vhandles.push_back(mesh.add_vertex(OpenMesh::Vec3f(vertices[i], vertices[i + 1], vertices[i + 2])));
if (texCoords != nullptr)
mesh.set_texcoord2D(vhandles.back(),OpenMesh::Vec2f(texCoords[iteration * 2], texCoords[iteration * 2 + 1]));
if (normals != nullptr)
mesh.set_normal(vhandles.back(), OpenMesh::Vec3f(normals[i], normals[i + 1], normals[i + 2]));
iteration++;
}
for (int i = 0; i < indexLength; i += 3)
mesh.add_face(vhandles[indices[i]], vhandles[indices[i + 1]], vhandles[indices[i + 2]]);
OPDecimater decimater(mesh);
HModQuadric hModQuadric;
decimater.add(hModQuadric);
decimater.module(hModQuadric).unset_max_err();
decimater.initialize();
//decimater.decimate(); // without this, everything is fine as expect.
mesh.garbage_collection();
int verticesSize = mesh.n_vertices() * 3;
float* newVertices = new float[verticesSize];
int indicesSize = mesh.n_faces() * 3;
int* newIndices = new int[indicesSize];
float* newTexCoords = nullptr;
int texCoordSize = mesh.n_vertices() * 2;
if(mesh.has_vertex_texcoords2D())
newTexCoords = new float[texCoordSize];
float* newNormals = nullptr;
int normalSize = mesh.n_vertices() * 3;
if(mesh.has_vertex_normals())
newNormals = new float[normalSize];
Loader::BasicData data;
int index = 0;
for (v_it = mesh.vertices_begin(); v_it != mesh.vertices_end(); ++v_it)
{
OpenMesh::Vec3f &point = mesh.point(*v_it);
newVertices[index * 3] = point[0];
newVertices[index * 3 + 1] = point[1];
newVertices[index * 3 + 2] = point[2];
if (mesh.has_vertex_texcoords2D())
{
auto &tex = mesh.texcoord2D(*v_it);
newTexCoords[index * 2] = tex[0];
newTexCoords[index * 2 + 1] = tex[1];
}
if (mesh.has_vertex_normals())
{
auto &normal = mesh.normal(*v_it);
newNormals[index * 3] = normal[0];
newNormals[index * 3 + 1] = normal[1];
newNormals[index * 3 + 2] = normal[2];
}
index++;
}
index = 0;
for (f_it = mesh.faces_begin(); f_it != mesh.faces_end(); ++f_it)
for (fv_it = mesh.fv_ccwiter(*f_it); fv_it.is_valid(); ++fv_it)
{
int id = fv_it->idx();
newIndices[index] = id;
index++;
}
data.Indices = newIndices;
data.IndicesLength = indicesSize;
data.Vertices = newVertices;
data.VerticesLength = verticesSize;
data.TexCoords = nullptr;
data.TexCoordLength = -1;
data.Normals = nullptr;
data.NormalLength = -1;
if (mesh.has_vertex_texcoords2D())
{
data.TexCoords = newTexCoords;
data.TexCoordLength = texCoordSize;
}
if (mesh.has_vertex_normals())
{
data.Normals = newNormals;
data.NormalLength = normalSize;
}
return data;
}
Also provide the tree obj I tested, and the face data that generated by Assimp, I fetch out from visual studio debugger, that shows the problem that some of the indices could not find the index pair.
Few weeks thinking about this and fails, I thought I want some Academic/Mathematical solution for automatically generating these decimated mesh, but now I'm trying to find the simple way to implement this, the way I am able to do is changing the structure for loading multi-object (file.obj) in single custom object (class obj), and switch the object when needed it. The benefit of this is I could manage what should present and ignore any algorithm problem.
By the way, I list some obstacles that push me back to simple way.
Assimp Unique Indices and Vertices, this is nothing wrong, but for the algorithm, no way to make the adjacency half-edge structure for this.
OpenMesh for reading only object file(*.obj), this could be done when using read_mesh function, but the disadvantage is the lack example of document and hard to using in my engine.
Write a custom 3d model importer for any format is hard.
In conclusion, there are two ways to make level of details work in engine, one is using the mesh simplication algorithm and more test to ensure quality, the other is just switch the 3dmodel that made by 3d software, It is not automatic but stable. I use the second method, and I show the result here :)
However, this is not a real solution with my question, so I won't assign me an answer.

Improving the Efficiency of Compact/Scatter in CUDA

Summary:
Any ideas about how to further improve upon the basic scatter operation in CUDA? Especially if one knows it will only be used to compact a larger array into a smaller one? or why the below methods of vectorizing memory ops and shared memory didn't work? I feel like there may be something fundamental I am missing and any help would be appreciated.
EDIT 03/09/15: So I found this Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics". I had assumed atomics would be intrinsically slower for this purpose, however I was wrong - especially since I don't think I care about maintaining element order in the array during my simulation. I'll have to think about it some more and then implement it to see what happens!
EDIT 01/04/16: I realized I never wrote about my results. Unfortunately in that Parallel for All Blog post they compared the global atomic method for compact to the Thrust prefix-sum compact method, which is actually quite slow. CUB's Device::IF is much faster than Thrust's - as is the prefix-sum version I wrote using CUB's Device::Scan + custom code. The warp-aggregrate global atomic method is still faster by about 5-10%, but nowhere near the 3-4x faster I had been hoping for based on the results in the blog. I'm still using the prefix-sum method as while maintaining element order is not necessary, I prefer the consistency of the prefix-sum results and the advantage from the atomics is not very big. I still try various methods to improve compact, but so far only marginal improvements (2%) at best for dramatically increased code complexity.
Details:
I am writing a simulation in CUDA where I compact out elements I am no longer interested in simulating every 40-60 time steps. From profiling it seems that the scatter op takes up the most amount of time when compacting - more so than the filter kernel or the prefix sum. Right now I use a pretty basic scatter function:
__global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
freq_Index is the number of elements in the old array. The flag array is the result from the filter. Scan_ID is the result from the prefix sum on the flag array.
Attempts I've made to improve it are to read the flagged frequencies into shared memory first and then write from shared memory to global memory - the idea being that the writes to global memory would be more coalesced amongst the warps (e.g. instead of thread 0 writing to position 0 and thread 128 writing to position 1, thread 0 would write to 0 and thread 1 would write to 1). I also tried vectorizing the reads and the writes - instead of reading and writing floats/ints I read/wrote float4/int4 from the global arrays when possible, so four numbers at a time. This I thought might speed up the scatter by having fewer memory ops transferring larger amounts of memory. The "kitchen sink" code with both vectorized memory loads/stores and shared memory is below:
const int compact_threads = 256;
__global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int gID = blockIdx.x*blockDim.x + threadIdx.x; //global ID
int tID = threadIdx.x; //thread ID within block
__shared__ float row[4*compact_threads];
__shared__ int start_index[1];
__shared__ int end_index[1];
float4 myResult;
int st_index;
int4 myFlag;
int4 index;
for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
if(tID == 0){
index = reinterpret_cast<const int4*>(scan_Index)[id];
myFlag = reinterpret_cast<const int4*>(flag)[id];
start_index[0] = index.x;
st_index = index.x;
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[0] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
}
__syncthreads();
if(tID > 0){
myFlag = reinterpret_cast<const int4*>(flag)[id];
st_index = start_index[0];
index = reinterpret_cast<const int4*>(scan_Index)[id];
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[index.x-st_index] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
}
__syncthreads();
int count = end_index[0] - st_index;
int rem = st_index & 0x3; //equivalent to modulo 4
int offset = 0;
if(rem){ offset = 4 - rem; }
if(tID < offset && tID < count){
new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
}
int tempID = 4*tID+offset;
if((tempID+3) < count){
reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
}
tempID = tID + offset + (count-offset)/4*4;
if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
}
int id = gID + freq_Index/4 * 4;
if(id < freq_Index){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
Obviously it gets a bit more complicated. :) While the above kernel seems stable when there are hundreds of thousands of elements in the array, I've noticed a race condition when the array numbers in the tens of millions. I'm still trying to track the bug down.
But regardless, neither method (shared memory or vectorization) together or alone improved performance. I was especially surprised by the lack of benefit from vectorizing the memory ops. It had helped in other functions I had written, though now I am wondering if maybe it helped because it increased Instruction-Level-Parallelism in the calculation steps of those other functions rather than the fewer memory ops.
I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.
//kernel phase 1
int myID = blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int mask;
unsigned int cnt=0;//;//
for(int j = 0; j < 32; j++){
int index = (warpID<<10)+(j<<5)+lnID;
bool pred;
if(index > true_length) pred = false;
else pred = predicate(input[index]);
mask = __ballot(pred);
if(lnID == 0) {
flag[(warpID<<5)+j] = mask;
cnt += __popc(mask);
}
}
if(lnID == 0) counter[warpID] = cnt; //store sum
}
//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array
//kernel phase 3
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int predmask;
unsigned int cnt;
predmask = flag[(warpID<<5)+lnID];
cnt = __popc(predmask);
//parallel prefix sum
#pragma unroll
for(int offset = 1; offset < 32; offset<<=1){
unsigned int n = __shfl_up(cnt, offset);
if(lnID >= offset) cnt += n;
}
unsigned int global_index = 0;
if(warpID > 0) global_index = scan_Index[warpID - 1];
for(int i = 0; i < 32; i++){
unsigned int mask = __shfl(predmask, i); //broadcast from thread i
unsigned int sub_group_index = 0;
if(i > 0) sub_group_index = __shfl(cnt, i-1);
if(mask & (1 << lnID)){
compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID];
}
}
}
}
EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.

how to avoid number repeation by using random class in c#?

hi i am using Random class for getting random numbers but my requirement is once it generate one no that should not be repeate again pls help me.
Keep a list of the generated numbers and check this list before returning the next random.
Since you have not specified a language, I'll use C#
List<int> generated = new List<int>;
public int Next()
{
int r;
do { r = Random.Next() } while generated.Contains(r);
generated.Add(r);
return r;
}
The following C# code shows how to obtain 7 random cards with no duplicates. It is the most efficient method to use when your random number range is between 1 and 64 and are integers:
ulong Card, SevenCardHand;
int CardLoop;
const int CardsInDeck = 52;
Random RandObj = new Random(Seed);
for (CardLoop = 0; CardLoop < 7; CardLoop++)
{
do
{
Card = (1UL << RandObj.Next(CardsInDeck));
} while ((SevenCardHand & Card) != 0);
SevenCardHand |= Card;
}
If the random number range is greater than 64, then the next most efficient way to get random numbers without any duplicates is as follows from this C# code:
const int MaxNums = 1000;
int[] OutBuf = new int[MaxNums];
int MaxInt = 250000; // Reps the largest random number that should be returned.
int Loop, Val;
// Init the OutBuf with random numbers between 1 and MaxInt, which is 250,000.
BitArray BA = new BitArray(MaxInt + 1);
for (Loop = 0; Loop < MaxNums; Loop++)
{
// Avoid duplicate numbers.
for (; ; )
{
Val = RandObj.Next(MaxInt + 1);
if (BA.Get(Val))
continue;
OutBuf[Loop] = Val;
BA.Set(Val, true);
break;
}
}
The drawback with this technique is that it tends to use more memory, but it should be significantly faster than other approaches since it does not have to look through a large container each time a random number is obtained.

Find local maxima in grayscale image using OpenCV

Does anybody know how to find the local maxima in a grayscale IPL_DEPTH_8U image using OpenCV? HarrisCorner mentions something like that but I'm actually not interested in corners ...
Thanks!
A pixel is considered a local maximum if it is equal to the maximum value in a 'local' neighborhood. The function below captures this property in two lines of code.
To deal with pixels on 'plateaus' (value equal to their neighborhood) one can use the local minimum property, since plateaus pixels are equal to their local minimum. The rest of the code filters out those pixels.
void non_maxima_suppression(const cv::Mat& image, cv::Mat& mask, bool remove_plateaus) {
// find pixels that are equal to the local neighborhood not maximum (including 'plateaus')
cv::dilate(image, mask, cv::Mat());
cv::compare(image, mask, mask, cv::CMP_GE);
// optionally filter out pixels that are equal to the local minimum ('plateaus')
if (remove_plateaus) {
cv::Mat non_plateau_mask;
cv::erode(image, non_plateau_mask, cv::Mat());
cv::compare(image, non_plateau_mask, non_plateau_mask, cv::CMP_GT);
cv::bitwise_and(mask, non_plateau_mask, mask);
}
}
Here's a simple trick. The idea is to dilate with a kernel that contains a hole in the center. After the dilate operation, each pixel is replaced with the maximum of it's neighbors (using a 5 by 5 neighborhood in this example), excluding the original pixel.
Mat1b kernelLM(Size(5, 5), 1u);
kernelLM.at<uchar>(2, 2) = 0u;
Mat imageLM;
dilate(image, imageLM, kernelLM);
Mat1b localMaxima = (image > imageLM);
Actually after I posted the code above I wrote a better and very very faster one ..
The code above suffers even for a 640x480 picture..
I optimized it and now it is very very fast even for 1600x1200 pic.
Here is the code :
void localMaxima(cv::Mat src,cv::Mat &dst,int squareSize)
{
if (squareSize==0)
{
dst = src.clone();
return;
}
Mat m0;
dst = src.clone();
Point maxLoc(0,0);
//1.Be sure to have at least 3x3 for at least looking at 1 pixel close neighbours
// Also the window must be <odd>x<odd>
SANITYCHECK(squareSize,3,1);
int sqrCenter = (squareSize-1)/2;
//2.Create the localWindow mask to get things done faster
// When we find a local maxima we will multiply the subwindow with this MASK
// So that we will not search for those 0 values again and again
Mat localWindowMask = Mat::zeros(Size(squareSize,squareSize),CV_8U);//boolean
localWindowMask.at<unsigned char>(sqrCenter,sqrCenter)=1;
//3.Find the threshold value to threshold the image
//this function here returns the peak of histogram of picture
//the picture is a thresholded picture it will have a lot of zero values in it
//so that the second boolean variable says :
// (boolean) ? "return peak even if it is at 0" : "return peak discarding 0"
int thrshld = maxUsedValInHistogramData(dst,false);
threshold(dst,m0,thrshld,1,THRESH_BINARY);
//4.Now delete all thresholded values from picture
dst = dst.mul(m0);
//put the src in the middle of the big array
for (int row=sqrCenter;row<dst.size().height-sqrCenter;row++)
for (int col=sqrCenter;col<dst.size().width-sqrCenter;col++)
{
//1.if the value is zero it can not be a local maxima
if (dst.at<unsigned char>(row,col)==0)
continue;
//2.the value at (row,col) is not 0 so it can be a local maxima point
m0 = dst.colRange(col-sqrCenter,col+sqrCenter+1).rowRange(row-sqrCenter,row+sqrCenter+1);
minMaxLoc(m0,NULL,NULL,NULL,&maxLoc);
//if the maximum location of this subWindow is at center
//it means we found the local maxima
//so we should delete the surrounding values which lies in the subWindow area
//hence we will not try to find if a point is at localMaxima when already found a neighbour was
if ((maxLoc.x==sqrCenter)&&(maxLoc.y==sqrCenter))
{
m0 = m0.mul(localWindowMask);
//we can skip the values that we already made 0 by the above function
col+=sqrCenter;
}
}
}
The following listing is a function similar to Matlab's "imregionalmax". It looks for at most nLocMax local maxima above threshold, where the found local maxima are at least minDistBtwLocMax pixels apart. It returns the actual number of local maxima found. Notice that it uses OpenCV's minMaxLoc to find global maxima. It is "opencv-self-contained" except for the (easy to implement) function vdist, which computes the (euclidian) distance between points (r,c) and (row,col).
input is one-channel CV_32F matrix, and locations is nLocMax (rows) by 2 (columns) CV_32S matrix.
int imregionalmax(Mat input, int nLocMax, float threshold, float minDistBtwLocMax, Mat locations)
{
Mat scratch = input.clone();
int nFoundLocMax = 0;
for (int i = 0; i < nLocMax; i++) {
Point location;
double maxVal;
minMaxLoc(scratch, NULL, &maxVal, NULL, &location);
if (maxVal > threshold) {
nFoundLocMax += 1;
int row = location.y;
int col = location.x;
locations.at<int>(i,0) = row;
locations.at<int>(i,1) = col;
int r0 = (row-minDistBtwLocMax > -1 ? row-minDistBtwLocMax : 0);
int r1 = (row+minDistBtwLocMax < scratch.rows ? row+minDistBtwLocMax : scratch.rows-1);
int c0 = (col-minDistBtwLocMax > -1 ? col-minDistBtwLocMax : 0);
int c1 = (col+minDistBtwLocMax < scratch.cols ? col+minDistBtwLocMax : scratch.cols-1);
for (int r = r0; r <= r1; r++) {
for (int c = c0; c <= c1; c++) {
if (vdist(Point2DMake(r, c),Point2DMake(row, col)) <= minDistBtwLocMax) {
scratch.at<float>(r,c) = 0.0;
}
}
}
} else {
break;
}
}
return nFoundLocMax;
}
The first question to answer would be what is "local" in your opinion. The answer may well be a square window (say 3x3 or 5x5) or circular window of a certain radius. You can then scan over the entire image with the window centered at each pixel and pick the highest value in the window.
See this for how to access pixel values in OpenCV.
This is very fast method. It stored founded maxima in a vector of
Points.
vector <Point> GetLocalMaxima(const cv::Mat Src,int MatchingSize, int Threshold, int GaussKernel )
{
vector <Point> vMaxLoc(0);
if ((MatchingSize % 2 == 0) || (GaussKernel % 2 == 0)) // MatchingSize and GaussKernel have to be "odd" and > 0
{
return vMaxLoc;
}
vMaxLoc.reserve(100); // Reserve place for fast access
Mat ProcessImg = Src.clone();
int W = Src.cols;
int H = Src.rows;
int SearchWidth = W - MatchingSize;
int SearchHeight = H - MatchingSize;
int MatchingSquareCenter = MatchingSize/2;
if(GaussKernel > 1) // If You need a smoothing
{
GaussianBlur(ProcessImg,ProcessImg,Size(GaussKernel,GaussKernel),0,0,4);
}
uchar* pProcess = (uchar *) ProcessImg.data; // The pointer to image Data
int Shift = MatchingSquareCenter * ( W + 1);
int k = 0;
for(int y=0; y < SearchHeight; ++y)
{
int m = k + Shift;
for(int x=0;x < SearchWidth ; ++x)
{
if (pProcess[m++] >= Threshold)
{
Point LocMax;
Mat mROI(ProcessImg, Rect(x,y,MatchingSize,MatchingSize));
minMaxLoc(mROI,NULL,NULL,NULL,&LocMax);
if (LocMax.x == MatchingSquareCenter && LocMax.y == MatchingSquareCenter)
{
vMaxLoc.push_back(Point( x+LocMax.x,y + LocMax.y ));
// imshow("W1",mROI);cvWaitKey(0); //For gebug
}
}
}
k += W;
}
return vMaxLoc;
}
Found a simple solution.
In this example, if you are trying to find 2 results of a matchTemplate function with a minimum distance from each other.
cv::Mat result;
matchTemplate(search, target, result, CV_TM_SQDIFF_NORMED);
float score1;
cv::Point displacement1 = MinMax(result, score1);
cv::circle(result, cv::Point(displacement1.x+result.cols/2 , displacement1.y+result.rows/2), 10, cv::Scalar(0), CV_FILLED, 8, 0);
float score2;
cv::Point displacement2 = MinMax(result, score2);
where
cv::Point MinMax(cv::Mat &result, float &score)
{
double minVal, maxVal;
cv::Point minLoc, maxLoc, matchLoc;
minMaxLoc(result, &minVal, &maxVal, &minLoc, &maxLoc, cv::Mat());
matchLoc.x = minLoc.x - result.cols/2;
matchLoc.y = minLoc.y - result.rows/2;
return minVal;
}
The process is:
Find global Minimum using minMaxLoc
Draw a filled white circle around global minimum using min distance between minima as radius
Find another minimum
The the scores can be compared to each other to determine, for example, the certainty of the match,
To find more than just the global minimum and maximum try using this function from skimage:
http://scikit-image.org/docs/dev/api/skimage.feature.html#skimage.feature.peak_local_max
You can parameterize the minimum distance between peaks, too. And more. To find minima, use negated values (take care of the array type though, 255-image could do the trick).
You can go over each pixel and test if it is a local maxima. Here is how I would do it.
The input is assumed to be type CV_32FC1
#include <vector>//std::vector
#include <algorithm>//std::sort
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/core/core.hpp"
//structure for maximal values including position
struct SRegionalMaxPoint
{
SRegionalMaxPoint():
values(-FLT_MAX),
row(-1),
col(-1)
{}
float values;
int row;
int col;
//ascending order
bool operator()(const SRegionalMaxPoint& a, const SRegionalMaxPoint& b)
{
return a.values < b.values;
}
};
//checks if pixel is local max
bool isRegionalMax(const float* im_ptr, const int& cols )
{
float center = *im_ptr;
bool is_regional_max = true;
im_ptr -= (cols + 1);
for (int ii = 0; ii < 3; ++ii, im_ptr+= (cols-3))
{
for (int jj = 0; jj < 3; ++jj, im_ptr++)
{
if (ii != 1 || jj != 1)
{
is_regional_max &= (center > *im_ptr);
}
}
}
return is_regional_max;
}
void imregionalmax(
const cv::Mat& input,
std::vector<SRegionalMaxPoint>& buffer)
{
//find local max - top maxima
static const int margin = 1;
const int rows = input.rows;
const int cols = input.cols;
for (int i = margin; i < rows - margin; ++i)
{
const float* im_ptr = input.ptr<float>(i, margin);
for (int j = margin; j < cols - margin; ++j, im_ptr++)
{
//Check if pixel is local maximum
if ( isRegionalMax(im_ptr, cols ) )
{
cv::Rect roi = cv::Rect(j - margin, i - margin, 3, 3);
cv::Mat subMat = input(roi);
float val = *im_ptr;
//replace smallest value in buffer
if ( val > buffer[0].values )
{
buffer[0].values = val;
buffer[0].row = i;
buffer[0].col = j;
std::sort(buffer.begin(), buffer.end(), SRegionalMaxPoint());
}
}
}
}
}
For testing the code you can try this:
cv::Mat temp = cv::Mat::zeros(15, 15, CV_32FC1);
temp.at<float>(7, 7) = 1;
temp.at<float>(3, 5) = 6;
temp.at<float>(8, 10) = 4;
temp.at<float>(11, 13) = 7;
temp.at<float>(10, 3) = 8;
temp.at<float>(7, 13) = 3;
vector<SRegionalMaxPoint> buffer_(5);
imregionalmax(temp, buffer_);
cv::Mat debug;
cv::cvtColor(temp, debug, cv::COLOR_GRAY2BGR);
for (auto it = buffer_.begin(); it != buffer_.end(); ++it)
{
circle(debug, cv::Point(it->col, it->row), 1, cv::Scalar(0, 255, 0));
}
This solution does not take plateaus into account so it is not exactly the same as matlab's imregionalmax()
I think you want to use the
MinMaxLoc(arr, mask=NULL)-> (minVal, maxVal, minLoc, maxLoc)
Finds global minimum and maximum in array or subarray
function on you image

Resources