I have an image, read using "cv::imread". I have to flatten it so that I could use CUDA & GPU for my image processing algorithms acceleration.
My problem: When I read my image, I can show it correctly using imshow, however when I flatten it and convert it to a Mat object to be used with imshow, only part of my image is displayed. The size of the output image is also wrong, meaning that some data is really lost. What's the problem with my for loop?
// The problematic part of my code
// The Camera Man gray test image
const char* img_gray_name = "../../Test_Images/cameraman.tiff";
const char* img_blur_name = "../cameraman-blur.tiff";
const char* image_general_name = "cameraman_blur";
cv::Mat img = cv::imread(img_gray_name);
unsigned long int img_gray_size = img.rows * img.cols * sizeof(uchar);
uchar *h_img_in;// input image, converted to a flat array to be
// processed by GPU
h_img_in = (uchar *)malloc(img_gray_size);
//*************** The bug should be here! ***************//
for (int i = 0; i < img.rows; ++i) {
for (int j = 0; j < img.cols; ++j) {
h_img_in[i*img.cols+j] = img.at<uchar>(i, j);
Mat img_test;
img_test = Mat(cv::Size(img.cols, img.rows), CV_8U, h_img_in);
imwrite(img_blur_name, img_test);
// create image window named "camera man"
// show the image on window
cv::imshow(image_general_name, img_test);
P.S.: I also tested with a new 2D array instead of 1D h_img_in, result is the same; This means that something goes wrong with my usage of "img.at(i, j)".
I have a skeleton audio app which uses kAudioUnitSubType_HALOutput to play audio via a AURenderCallback. I'm generating a simple pure tone just to test things out, but the tone changes pitch noticeably from time to time; sometimes drifting up or down, and sometimes changing rapidly. It can be up to a couple of tones out at ~500Hz. Here's the callback:
static OSStatus outputCallback(void *inRefCon, AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp, UInt32 inOutputBusNumber,
UInt32 inNumberFrames, AudioBufferList *ioData) {
static const float frequency = 1000;
static const float rate = 48000;
static float phase = 0;
SInt16 *buffer = (SInt16 *)ioData->mBuffers[0].mData;
for (int s = 0; s < inNumberFrames; s++) {
buffer[s] = (SInt16)(sinf(phase) * INT16_MAX);
phase += 2.0 * M_PI * frequency / rate;
return noErr;
I understand that audio devices drift over time (especially cheap ones like the built-in IO), but this is a lot of drift — it's unusable for music. Any ideas?
Recording http://files.danhalliday.com/stackoverflow/audio.png
You're never resetting phase, so its value will increase indefinitely. Since it's stored in a floating-point type, the precision of the stored value will be degraded as the value increases. This is probably the cause of the frequency variations you're describing.
Adding the following lines to the body of the for() loop should significantly mitigate the problem:
if (phase > 2.0 * M_PI)
phase -= 2.0 * M_PI;
Changing the type of phase from float to double will also help significantly.
I have a performance problem when using LDS memory with AMD Radeon HD 6850.
I have two kernels as parts of a N-particle simulation. Each work unit has to calculate force which acts on a corresponding particle based on relative position to other particles. The problematic kernel is:
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
__global const float4 *pos,
__global float4 *vel,
__global float4 *accel,
__local float4 *pblock,
const float bound)
const int gid = get_global_id(0); //global id of work item
const int id = get_local_id(0); //local id of work item within work group
const int s_wg = get_local_size(0); //work group size
const int n_wg = get_num_groups(0); //number of work groups
const float4 myPos = pos[gid];
const float4 myVel = vel[gid];
const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
float4 acc = (float4)0.0f;
for (int jw = 0; jw < n_wg; ++jw)
pblock[id] = pos[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
for (int i = 0; i < s_wg; )
#pragma unroll UNROLL_FACTOR
for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
float4 r = myPos - pblock[i];
float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
acc *= 24.0f / myPos.w;
//update velocity only
float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);
//write to global memory
vel[gid] = newVel;
accel[gid] = acc;
The simulation runs fine in terms of results, but the problem is in the performance when using the local memory for caching the particle positions to relieve the big amount of reading from the global memory. Actually if the line
float4 r = myPos - pblock[i];
is replaced by
float4 r = myPos - pos[jw * s_wg + i];
the kernel runs faster. I don't really get that since reading from global should be much slower than reading from local.
Moreover, when the line
float4 r = myPos - pblock[i];
is removed completely and all following occurences of r are replaced by myPos - pblock[i], the speed is the same as before as if the line was not there at all. This I don't get even more as accessing private memory in r should be the fastest but the compiler somehow "optimizes" this line out.
Global work size is 4608, local worksize is 192. It is run with AMD APP SDK v2.9 and Catalyst drivers 13.12 in Ubuntu 12.04.
Can anyone please help me with this? Is that my fault or is that a problem of the GPU / drivers / ... ? Or is it a feature? :-)
I'm gonna make a wild guess:
When using float4 r = myPos - pos[jw * s_wg + i]; the compiler is smart enough to notice that the barrier put after the initialization of pblock[id] is not necessary anymore and remove it. Very likely all these barriers (in the for loop) impact your performances, so removing them is very noticeable.
Yeah but global access cost a lot too...So I'm guessing that behind the scene cache memories are well utilized. There is also the fact that you use vectors and as a matter of fact the architecture of the AMD Radeon HD 6850 uses VLIW processors...maybe it helps also to make a better use of the cache memories...maybe.
I've just found out a article benchmarking GPU/APU Cache and Memory Latencies. Your GPU is in the list. You might get some more answers (sorry didn't really read it - too tired).
After some more digging it turned out that the code causes some LDS bank conflicts. The reason is that for AMD there are 32 banks with 4 bytes length, but the float4 covers 16 bytes and therefore the half-wavefront accesses different addresses in the same banks. The solution was to make __local float* for x and y coordinates separately and read them also separately with the proper shift of array index as (id + i) % s_wg. Nevertheless, the overall gain in performance is small, most likely due to the overall latencies described in the link provided by #CaptainObvious (well then one has to increase the global work size to hide them).
I'm trying to detect a object using cvblob. So I use cvRenderBlob() method. Program compiled successfully but when at the run time it is returning an unhandled exception. When I break it, the arrow is pointed out to CvLabel *labels = (CvLabel *)imgLabel->imageData + imgLabel_offset + (blob->miny * stepLbl); statement in the cvRenderBlob() method definition of the cvblob.cpp file. But if I use cvRenderBlobs() method it's working fine. I need to detect only one blob that is the largest one. Some one please help me to handle this exception.
Here is my VC++ code,
CvCapture* capture = 0;
IplImage* frame = 0;
int key = 0;
CvBlobs blobs;
CvBlob *blob;
capture = cvCaptureFromCAM(0);
if (!capture) {
printf("Could not initialize capturing....\n");
return 1;
int screenx = GetSystemMetrics(SM_CXSCREEN);
int screeny = GetSystemMetrics(SM_CYSCREEN);
while (key!='q') {
frame = cvQueryFrame(capture);
if (!frame) break;
IplImage* imgHSV = cvCreateImage(cvGetSize(frame), 8, 3);
cvCvtColor(frame, imgHSV, CV_BGR2HSV);
IplImage* imgThreshed = cvCreateImage(cvGetSize(frame), 8, 1);
cvInRangeS(imgHSV, cvScalar(61, 156, 205),cvScalar(161, 256, 305), imgThreshed); // for light blue color
IplImage* imgThresh = imgThreshed;
cvSmooth(imgThresh, imgThresh, CV_GAUSSIAN, 9, 9);
cvShowImage("Thresh", imgThresh);
IplImage* labelImg = cvCreateImage(cvGetSize(imgHSV), IPL_DEPTH_LABEL, 1);
unsigned int result = cvLabel(imgThresh, labelImg, blobs);
blob = blobs[cvGreaterBlob(blobs)];
cvRenderBlob(labelImg, blob, frame, frame);
/*cvRenderBlobs(labelImg, blobs, frame, frame);*/
/*cvFilterByArea(blobs, 60, 500);*/
cvFilterByLabel(blobs, cvGreaterBlob(blobs));
cvShowImage("Video", frame);
key = cvWaitKey(1);
First off, I'd like to point out that you are actually using the regular c syntax. C++ uses the class Mat. I've been working on some blob extraction based on green objects in the picture. Once thresholded properly, which means we have a "binary" image, background/foreground. I use
findContours() //this function expects quite a bit, read documentation
Descriped more clearly in the documentation on structural analysis. It will give you the contour of all the blobs in the image. In a vector which is handling another vector, which is handling points in the image; like so
vector<vector<Point>> contours;
I too need to find the biggest blob, and though my approach can be faulty to some extend, I won't need it to be different. I use
minAreaRect() // expects a set of points (contained by the vector or mat classes
Descriped also under structural analysis
Then access the size of the rect
int sizeOfObject = 0;
int idxBiggestObject = 0; //will track the biggest object
if(contours.size() != 0) //only runs code if there is any blobs / contours in the image
for (int i = 0; i < contours.size(); i++) // runs i times where i is the amount of "blobs" in the image.
myVector = minAreaRect(contours[i])
if(myVector.size.area > sizeOfObject)
sizeOfObject = myVector.size.area; //saves area to compare with further blobs
idxBiggestObject = i; //saves index, so you know which is biggest, alternatively, .push_back into another vector
So okay, we really only measure a rotated bounding box, but in most cases it will do. I hope that you will either switch to c++ syntax, or get some inspiration from the basic algorithm.
Is there a good way for displaying unicode text in opengl under Windows? For example, when you have to deal with different languages. The most common approach like
GLuint list;
list = glGenLists(FONTLISTRANGE);
wglUseFontBitmapsW(hDC, 0, FONTLISTRANGE, list);
just won't do because you can't create enough lists for all unicode characters.
You should also check out the FTGL library.
FTGL is a free cross-platform Open
Source C++ library that uses Freetype2
to simplify rendering fonts in OpenGL
applications. FTGL supports bitmaps,
pixmaps, texture maps, outlines,
polygon mesh, and extruded polygon
rendering modes.
This project was dormant for awhile, but is recently back under development. I haven't updated my project to use the latest version, but you should check it out.
It allows for using any True Type Font via the FreeType font library.
I recommend reading this OpenGL font tutorial. It's for the D programming language but it's a nice introduction to various issues involved in implementing a glyph caching system for rendering text with OpenGL. The tutorial covers Unicode compliance, antialiasing, and kerning techniques.
D is pretty comprehensible to anyone who knows C++ and most of the article is about the general techniques, not the implementation language.
Id recommend FTGL as already recommended above, however I have implemented a freetype/OpenGL renderer myself and thought you might find the code handy if you want reinvent this wheel yourself. I'd really recommend FTGL though, its a lot less hassle to use. :)
* glTextRender class by Semi Essessi
* FreeType2 empowered text renderer
#include "glTextRender.h"
#include "jEngine.h"
#include "glSystem.h"
#include "jMath.h"
#include "jProfiler.h"
#include "log.h"
#include <windows.h>
FT_Library glTextRender::ftLib = 0;
//TODO::maybe fix this so it use wchar_t for the filename
glTextRender::glTextRender(jEngine* j, const char* fontName, int size = 12)
#ifdef _DEBUG
jProfiler profiler = jProfiler(L"glTextRender::glTextRender");
char fontName2[1024];
#ifdef _DEBUG
wchar_t fn[128];
LogWriteLine(L"\x25CB\x25CB\x25CF Font: %s was requested before FreeType was initialised", fn);
// constructor code for glTextRender
gl = j->gl;
face = 0;
// remember that for some weird reason below font size 7 everything gets scrambled up
height = max(6,(int)floorf((float)size*((float)gl->getHeight())*0.001666667f));
aHeight = ((float)height)/((float)gl->getHeight());
// look in base fonts dir
if(FT_New_Face(ftLib, fontName2, 0, &face ))
// if we dont have it look in windows fonts dir
char buf[1024];
strcat(buf, "\\fonts\\");
strcat(buf, fontName);
if(FT_New_Face(ftLib, buf, 0, &face ))
//TODO::check in mod fonts directory
#ifdef _DEBUG
wchar_t fn[128];
LogWriteLine(L"\x25CB\x25CB\x25CF Request for font: %s has failed", fn);
face = 0;
// FreeType uses 64x size and 72dpi for default
// doubling size for ms
FT_Set_Char_Size(face, mulPow2(height,7), mulPow2(height,7), 96, 96);
// set up cache table and then generate the first 256 chars and the console prompt character
for(int i=0;i<65536;i++)
for(unsigned short i = 0; i < 256; i++) getChar((wchar_t)i);
#ifdef _DEBUG
wchar_t fn[128];
LogWriteLine(L"\x25CB\x25CB\x25CF Font: %s loaded OK", fn);
// destructor code for glTextRender
for(int i=0;i<65536;i++)
// TODO:: work out stupid freetype crashz0rs
static int foo = 0;
if(face && foo < 1)
face = 0;
face = 0;
// return true if init works, or if already initialised
bool glTextRender::initFreeType()
if(!FT_Init_FreeType(&ftLib)) return true;
else return false;
} else return true;
void glTextRender::shutdownFreeType()
ftLib = 0;
void glTextRender::print(const wchar_t* str)
// store old stuff to set start position
// get viewport size
GLint viewport[4];
glGetIntegerv(GL_VIEWPORT, viewport);
float color[4];
glGetFloatv(GL_CURRENT_COLOR, color);
// set blending for AA
// call display lists to render text
for(unsigned int i=0;i<wcslen(str);i++) glCallList(getChar(str[i]));
// restore old states
void glTextRender::printf(const wchar_t* str, ...)
if(!str) return;
wchar_t* buf = 0;
va_list parg;
va_start(parg, str);
// allocate buffer
int len = (_vscwprintf(str, parg)+1);
buf = new wchar_t[len];
if(!buf) return;
vswprintf(buf, str, parg);
delete[] buf;
GLuint glTextRender::getChar(const wchar_t c)
int i = (int)c;
if(cached[i]) return listID[i];
// load glyph and get bitmap
if(FT_Load_Glyph(face, FT_Get_Char_Index(face, i), FT_LOAD_DEFAULT )) return 0;
FT_Glyph glyph;
if(FT_Get_Glyph(face->glyph, &glyph)) return 0;
FT_Glyph_To_Bitmap(&glyph, FT_RENDER_MODE_NORMAL, 0, 1);
FT_BitmapGlyph bitmapGlyph = (FT_BitmapGlyph)glyph;
FT_Bitmap& bitmap = bitmapGlyph->bitmap;
int w = roundPow2(bitmap.width);
int h = roundPow2(bitmap.rows);
// convert to texture in memory
GLubyte* texture = new GLubyte[2*w*h];
for(int j=0;j<h;j++)
bool cond = j>=bitmap.rows;
for(int k=0;k<w;k++)
texture[2*(k+j*w)] = 0xFFu;
texture[2*(k+j*w)+1] = ((k>=bitmap.width)||cond) ? 0x0u : bitmap.buffer[k+bitmap.width*j];
// store char width and adjust max height
// note .5f
float ih = 1.0f/((float)gl->getHeight());
width[i] = ((float)divPow2(face->glyph->advance.x, 7))*ih;
aHeight = max(aHeight,(.5f*(float)bitmap.rows)*ih);
// create gl texture
glGenTextures(1, &(texID[i]));
glBindTexture(GL_TEXTURE_2D, texID[i]);
delete[] texture;
// create display list
listID[i] = glGenLists(1);
glNewList(listID[i], GL_COMPILE);
glBindTexture(GL_TEXTURE_2D, texID[i]);
// adjust position to account for texture padding
glTranslatef(.5f*(float)bitmapGlyph->left, 0.0f, 0.0f);
glTranslatef(0.0f, .5f*(float)(bitmapGlyph->top-bitmap.rows), 0.0f);
// work out texcoords
float tx=((float)bitmap.width)/((float)w);
float ty=((float)bitmap.rows)/((float)h);
// render
// note .5f
glTexCoord2f(0.0f, 0.0f);
glVertex2f(0.0f, .5f*(float)bitmap.rows);
glTexCoord2f(0.0f, ty);
glVertex2f(0.0f, 0.0f);
glTexCoord2f(tx, ty);
glVertex2f(.5f*(float)bitmap.width, 0.0f);
glTexCoord2f(tx, 0.0f);
glVertex2f(.5f*(float)bitmap.width, .5f*(float)bitmap.rows);
// move position for the next character
// note extra div 2
glTranslatef((float)divPow2(face->glyph->advance.x, 7), 0.0f, 0.0f);
// char is succesfully cached for next time
cached[i] = true;
return listID[i];
void glTextRender::setPosition(float x, float y)
float fac = ((float)gl->getHeight());
xPos = fac*x+FONT_BORDER_PIXELS; yPos = fac*(1-y)-(float)height-FONT_BORDER_PIXELS;
float glTextRender::getAdjustedWidth(const wchar_t* str)
float w = 0.0f;
for(unsigned int i=0;i<wcslen(str);i++)
if(cached[str[i]]) w+=width[str[i]];
return w;
You may have to generate you own "glyph cache" in texture memory as you go, potentially with some sort of LRU policy to avoid destroying all of the texture memory. Not nearly as easy as your current method, but may be the only way given the number of unicode chars
You should consider using an Unicode rendering library (eg. Pango) to render the stuff into a bitmap and put that bitmap on the screen or into a texture.
Rendering unicode text is not simple. So you cannot simply load 64K rectangular glyphs and use it.
Characters may overlap. Eg in this smiley:
( ͡° ͜ʖ ͡°)
Some code points stack accents on the previous character. Consider this excerpt from this notable post:
...he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all
enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e
liquid pain, the song of re̸gular expression parsing will
extinguish the voices of mortal man from the sphere I can see it
can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of
the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes
he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no
NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s
͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳
TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘
If you truly want to render Unicode correctly you should be able to render this one correctly too.
UPDATE: Looked at this Pango engine, and it's the case of banana, the gorilla, and the entire jungle. First it depends on the Glib because it used GObjects, second it cannot render directly into a byte buffer. It has Cario and FreeType backends, so you must use one of them to render the text and export it into bitmaps eventually. That's doesn't look good so far.
In addition to that, if you want to store the result in a texture, use pango_layout_get_pixel_extents after setting the text to get the sizes of rectangles to render the text to. Ink rectangle is the rectangle to contain the entire text, it's left-top position is the position relative to the left-top of the logical rectangle. (The bottom line of the logical rectangle is the baseline). Hope this helps.
Queso GLC is great for this, I've used it to render Chinese and Cyrillic characters in 3D.
The Unicode text sample it comes with should get you started.
You could also group the characters by language. Load each language table as needed, and when you need to switch languages, unload the previous language table and load the new one.
Unicode is supported in the title bar. I have just tried this on a Mac, and it ought to work elsewhere too. If you have (say) some imported data including text labels, and some of the labels just might contain unicode, you could add a tool that echoes the label in the title bar.
It's not a great solution, but it is very easy to do.