OpenCL possible reason a clGetEventInfo would cause a segfault? - events

I have a pretty complicated OpenCL app. It fires up 5 different contexts on 5 different GPUs, and executes the same kernel on all of them, splitting up the work into 1024 "chunks" to be processed.
Each time a kernel finishes, a result is checked for, and it's given a new chunk. Sometimes, when running, as the app is starting (very rarely mid-run) it will immediately segfault on the GetEventInfo call.
This is done in a loop using callbacks and clGetEventInfo calls to ensure something is finished before moving on to the next step.
GDB output:
(gdb) back
#0 0x00007fdc686ab525 in clGetEventInfo () from /usr/lib/libOpenCL.so.1
#1 0x00000000004018c1 in ready (event=0x26a00000267) at gputest.c:165
#2 0x0000000000404b5a in main (argc=9, argv=0x7fffdfe3b268) at gputest.c:544
The ready function:
int ready(cl_event event) {
int rdy;
if(!event)
return 0;
clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &rdy, NULL);
if(rdy == CL_COMPLETE)
return 1;
return 0;
}
How the kernel is run, the event set, and checked. Some pseudocode inserted for brevity:
while(test if loop is complete) {
for(j = 0; j < GPUS; j++) {
if(gpu[j].waiting && loops < 9999) {
gpu[j].waiting = 0;
offset[j] = loops * 1024 * 1024;
loops++;
EC("kernel init", clEnqueueNDRangeKernel(queues[j], kernel_init[j], 1, &(offset[j]), &global_work_size, &work128, 0, NULL, &events[j]));
gpu[j].readsearch = events[j];
gpu[j].reading = 1;
}
}
for(j = 0; j < GPUS; j++) {
if(gpu[j].reading && ready(gpu[j].readsearch)) {
gpu[j].reading = 0;
gpu[j].waiting = 1;
// unrelated reporting other code here
}
}
}
Its pretty simple. There is more to the code, but it's unrelated. The ready/checking function is very simple. I even added debugging to the ready function to printf the event # to see what was happening when it crashed - nothing really. No pattern I could see.
What could be causing this?

Ugh. Found the problem. Since you cannot initialize values when you create/declare a struct, I was using some values uninitialized. I malloc'ed the gpu structs then just started using them. With if(gpu[x].reading &&...) being random data and completely uninitialized. So sometimes it was non-zero, which allowed the ready() function to fire off. Since the gpu[x].readsearch event was never set in the first place, clGetEventInfo bombed trying to use whatever was at the memory location.
This would be time number 482,847 that accidentally using uninitialized variables has burned me.

Related

Why this is an infinite loop

i have declared a map below using stl and inserted some values in it.
#include<bits/stdc++.h>
int main()
{
map<int,int> m;
m[1]=1;
m[2]=1;
m[3]=1;
m[4]=1;
m[5]=1;
m[6]=1;
for(auto it=m.begin();it!=m.end();)
{
cout<<it->first<<" "<<it->second<<endl;
it=it++;
}
return 0;
}
When i executed the above written code it ended up in an infinite loop. Can someone tell me why it does so?
I am incrementing the value of iterator it and then it gets stored in it which should get incremented next time the loop is executed and eventually it should terminate normally.Am i wrong?
The bad line is it = it++;. It is undefined behavior! Because it is not defined, when it is increased, in your case it is increased before the assingment to itsself again, that the value of it before it is increased is assigned to it again and so it keeps at the first position. The correct line would be it = ++it; or only ++it;/it++;, because it changes itsself.
Edit
That is only undefined with the builtin types, but in here that is defined by the source-code of the map in the stl.
If you try doing something similar with an int, you'll get a warning:
int nums[] = { 1, 2, 3, 4, 5 };
for (int i = 0; i < sizeof nums / sizeof *nums; ) {
cout << nums[i] << '\n';
i = i++;
}
warning: operation on 'i' may be undefined [-Wsequence-point]
However, when you're using a class (std::map::iterator) which has operator overloading, the compiler probably isn't smart enought to detect this.
In other words, what you're doing is a sequence point violation, so the behavior is undefined behavior.
The post-increment operation would behave like this:
iterator operator ++ (int) {
auto copy = *this;
++*this;
return copy;
}
So, what happens to your increment step is that iterator it would get overwritten by the copy of its original value. If the map isn't empty, your loop would remain stuck on the first element.

rw_semaphore's negative count value

I am debugging a kernel crash dump. There seems to be a problem with one process was trying to memory map a new region. The problem is that it was not able to hold the memory map semaphore.
When I looked into process's mm_struct and printed its content. I saw that the struct rw_semaphore mmap_sem were as seen below. Now, does he value of count seem suspicious? It has a negative value, as if there was a race condition where it was decremented twice by two different threads after checking for zero.
mmap_sem = {
count = -4294967295,
wait_lock = {
{
rlock = {
raw_lock = {
slock = 262148
}
}
}
},
wait_list = {
next = 0xffff8801f0113e48,
prev = 0xffff8801f0113e48
}
},
Sorry for the confusion. I thought crash pulls the correct data types and uses that properly when printing out the all the values ...
Looks like crash utility is not read the count member as an int ....
When I print it as int, I get the correct value.
crash> p (int) (((struct mm_struct *) 0xffff8801f15fa540)->mmap_sem).count
$13 = 1

Issues with CAPlayThrough Example

I am trying to learn Xcode Core Audio and stumbled upon this example:
https://developer.apple.com/library/mac/samplecode/CAPlayThrough/Introduction/Intro.html#//apple_ref/doc/uid/DTS10004443
My intention is to capture the raw audio. Everytime I hit a break point, I lose the audio. Since it is using CARingBuffer.
How would you remove the time factor.I don't need real-time audio.
Since it is using CARingBuffer it should keep on writing to same memory location? So why don't I hear the audio? If I have a breakpoint?
I am reading the Learning Core Audio book. But, so far I cannot figure out this part of the following code:
CARingBufferError CARingBuffer::Store(const AudioBufferList *abl, UInt32 framesToWrite, SampleTime startWrite)
{
if (framesToWrite == 0)
return kCARingBufferError_OK;
if (framesToWrite > mCapacityFrames)
return kCARingBufferError_TooMuch; // too big!
SampleTime endWrite = startWrite + framesToWrite;
if (startWrite < EndTime()) {
// going backwards, throw everything out
SetTimeBounds(startWrite, startWrite);
} else if (endWrite - StartTime() <= mCapacityFrames) {
// the buffer has not yet wrapped and will not need to
} else {
// advance the start time past the region we are about to overwrite
SampleTime newStart = endWrite - mCapacityFrames; // one buffer of time behind where we're writing
SampleTime newEnd = std::max(newStart, EndTime());
SetTimeBounds(newStart, newEnd);
}
// write the new frames
Byte **buffers = mBuffers;
int nchannels = mNumberChannels;
int offset0, offset1, nbytes;
SampleTime curEnd = EndTime();
if (startWrite > curEnd) {
// we are skipping some samples, so zero the range we are skipping
offset0 = FrameOffset(curEnd);
offset1 = FrameOffset(startWrite);
if (offset0 < offset1)
ZeroRange(buffers, nchannels, offset0, offset1 - offset0);
else {
ZeroRange(buffers, nchannels, offset0, mCapacityBytes - offset0);
ZeroRange(buffers, nchannels, 0, offset1);
}
offset0 = offset1;
} else {
offset0 = FrameOffset(startWrite);
}
offset1 = FrameOffset(endWrite);
if (offset0 < offset1)
StoreABL(buffers, offset0, abl, 0, offset1 - offset0);
else {
nbytes = mCapacityBytes - offset0;
StoreABL(buffers, offset0, abl, 0, nbytes);
StoreABL(buffers, 0, abl, nbytes, offset1);
}
// now update the end time
SetTimeBounds(StartTime(), endWrite);
return kCARingBufferError_OK; // success
}
Thanks!
If I understood the question well, the signal is lost while input unit (producer) being halted on a breakpoint. I presume this may be the expected behavior. CoreAudio is a pull-model engine running of the real time thread. This means under some conditions your producer hits a breakpoint, the ring buffer empties, the output unit (consumer) keeps on running, but gets nothing from the buffer while the playthrough chain is interrupted, hence the possible silence.
Perhaps this code from the example is not really the simplest one: I see it also zeroes audio buffers if ring buffer gets overrun/underrun, AFAICT.
The term "raw audio" in the question is also not self-explanatory, I'm not sure what does it mean. I would suggest trying to learn async i/o using simpler circular buffers. There are few of them (without obligatory time values) on GitHub.
Please also be so kind to format the source code for easier reading.

Cannot get OpenAL to play sound

I've searched the net, I've searched here. I've found code that I could compile and it works fine, but for some reason my code won't produce any sound. I'm porting an old game to the PC (Windows,) and I'm trying to make it as authentic as possible, so I'm wanting to use generated wave forms. I've pretty much copied and pasted the working code (only adding in multiple voices,) and it still won't work (even thought the exact same code for a single voice works fine.) I know I'm missing something obvious, but I just cannot figure out what. Any help would be appreciated thank you.
First some notes... I was looking for something that would allow me to use the original methodology. The original system used paired bytes for music (sound effects - only 2 - were handled in code.) A time byte that counted down every time the routine was called, and a note byte that was played until time reached zero. this was done by patching into the interrupt vector, windows doesn't allow that, so I set up a timer that routing that accomplished the same thing. The timer kicks in, updates the display, and then runs the music sequence. I set this up with a defined time so that I only have one place to adjust the timing at (to get it as close as possible to the original sequence. The music is a generated wave form (and I've double checked the math, and even examined the generated data in debug mode,) and it looks good. The sequence looks good, but doesn't actually produce sound. I tried SDL2 first, and it's method of only playing 1 sound doesn't work for me, also, unless I make the sample duration extremely short (and the sound produced this way is awful,) I can't match the timing (it plays the entire sample through it's own interrupt without letting me make adjustments.) Also, blending the 3 voices together (when they all run with different timings,) is a mess. Most of the other engines I examined work in much the same way, they want to use their own callback interrupt and won't allow me to tweak it appropriately. This is why I started working with OpenAL. It allows multiple voices (sources,) and allows me to set the timings myself. On advice from several forums, I set it up so that the sample lengths are all multiples of full cycles.
Anyway, here's the code.
int main(int argc, char* argv[])
{
FreeConsole(); //Get rid of the DOS console, don't need it
if (InitLog() < 0) return -1; //Start logging
UINT_PTR tim = NULL;
SDL_Event event;
InitVideo(false); //Set to window for now, will put options in later
curmusic = 5;
InitAudio();
SetTimer(NULL,tim,_FREQ_,TimerProc);
SDL_PollEvent(&event);
while (event.type != SDL_KEYDOWN) SDL_PollEvent(&event);
SDL_Quit();
return 0;
}
void CALLBACK TimerProc(HWND hWind, UINT Msg, UINT_PTR idEvent, DWORD dwTime)
{
RenderOutput();
PlayMusic();
//UpdateTimer();
//RotateGate();
return;
}
void InitAudio(void)
{
ALCdevice *dev;
ALCcontext *cxt;
Log("Initializing OpenAL Audio\r\n");
dev = alcOpenDevice(NULL);
if (!dev) {
Log("Failed to open an audio device\r\n");
exit(-1);
}
cxt = alcCreateContext(dev, NULL);
alcMakeContextCurrent(cxt);
if(!cxt) {
Log("Failed to create audio context\r\n");
exit(-1);
}
alGenBuffers(4,Buffer);
if (alGetError() != AL_NO_ERROR) {
Log("Error during buffer creation\r\n");
exit(-1);
}
alGenSources(4, Source);
if (alGetError() != AL_NO_ERROR) {
Log("Error during source creation\r\n");
exit(-1);
}
return;
}
void PlayMusic()
{
static int oldsong, ofset, mtime[4];
double freq;
ALuint srate = 44100;
ALuint voice, i, note, len, hold;
short buf[4][_BUFFSIZE_];
bool test[4] = {false, false, false, false};
if (curmusic != oldsong) {
oldsong = (int)curmusic;
if (curmusic > 0)
ofset = moffset[(curmusic - 1)];
for (voice = 1; voice < 4; voice++)
alSourceStop(Source[voice]);
mtime[voice] = 0;
return;
}
if (curmusic == 0) return;
//Only 3 voices for music, but have
for (voice = 0; voice < 3; voice ++) { // 4 set asside for eventual sound effects
if (mtime[voice] == 0) { //is note finished
alSourceStop(Source[voice]); //It is, so stop the channel (source)
mtime[voice] = music[ofset++]; //Get the next duration
if (mtime[voice] == 0) {oldsong = 0; return;} //zero marks end, so restart
note = music[ofset++]; //Get the next note
if (note > 127) { //Old HW data was designed for could only
if (note == 255) note = 127; //use values 128 - 255 (255 = 127)
freq = (15980 / (voice + (int)(voice / 3))) / (256 - note); //freq of note
len = (ALuint)(srate / freq); //A single cycle of that freq.
hold = len;
while (len < (srate / (1000 / _FREQ_))) len += hold; //Multiply till 1 interrup cycle
while (len > _BUFFSIZE_) len -= hold; //Don't overload buffer
if (len == 0) len = _BUFFSIZE_; //Just to be safe
for (i = 0; i < len; i++) //calculate sine wave and put in buffer
buf[voice][i] = (short)((32760 * sin((2 * M_PI * i * freq) / srate)));
alBufferData(Buffer[voice], AL_FORMAT_MONO16, buf[voice], len, srate);
alSourcei(openAL.Source[i], AL_LOOPING, AL_TRUE);
alSourcei(Source[i], AL_BUFFER, Buffer[i]);
alSourcePlay(Source[voice]);
}
} else --mtime[voice];
}
}
Well, it turns out there were 3 problems with my code. First, you have to link the built wave buffer to the AL generated buffer "before" you link the buffer to the source:
alBufferData(buffer,AL_FORMAT_MONO16,&wave_sample,sample_lenght * sizeof(short),frequency);
alSourcei(source,AL_BUFFER,buffer);
Also in the above example, I multiplied the sample_length by how many bytes are in each sample (in this case "sizeof(short)".
The final problem was that you need to un-link a buffer from the source before you change the buffer data
alSourcei(source,AL_BUFFER,NULL);
The music would play, but not correctly until I added that line to the note change code.

Synchronized Block takes more time after instrumenting with ASM

I am trying to instrument java synchronized block using ASM. The problem is that after instrumenting, the execution time of the synchronized block takes more time. Here it increases from 2 msecs to 200 msecs on Linux box.
I am implementing this by identifying the MonitorEnter and MonitorExit opcode.
I try to instrument at three level 1. just before the MonitorEnter 2. after MonitorEnter 3. Before MonitorExit.
1 and 3 together works fine, but when i do 2, the execution time increase dramatically.
Even if we instrument another single SOP statement, which is intended to be executed just once, it give higher values.
Here the sample code (prime number, 10 loops):
for(int w=0;w<10;w++){
synchronized(s){
long t1 = System.currentTimeMillis();
long num = 2000;
for (long i = 1; i < num; i++) {
long p = i;
int j;
for (j = 2; j < p; j++) {
long n = p % i;
}
}
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1) );
}
Here the code for instrumention (here System.currentMilliSeconds() gives the time at which instrumention happened, its no the measure of execution time, the excecution time is from obove SOP statement):
public void visitInsn(int opcode)
{
switch(opcode)
{
// Scenario 1
case 194:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io /PrintStream;");
visitLdcInsn("TIME Arrive: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
break;
// scenario 3
case 195:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME exit : "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
break;
}
super.visitInsn(opcode);
// scenario 2
if(opcode==194)
{
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME enter: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
}
}
I am not able to find the reason why it is happening and how t correct it.
Thanks in advance.
The reason lies in the internals of the JVM that you were using for running the code. I assume that this was a HotSpot JVM but the answers below are equally right for most other implementations.
If you trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
result += i;
}
This will be translated directly into Java byte code by the Java compiler but at run time the JVM will easily see that this code is not doing anything. Executing this code will have no effect on the outside (application) world, so why should the JVM execute it? This consideration is exactly what compiler optimization does for you.
If you however trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
System.out.println(result);
}
the Java runtime cannot optimize away your code anymore. The whole loop must always run since the System.out.println(int) method is always doing something real such that your code will run slower.
Now let's look at your example. In your first example, you basically write this code:
synchronized(s) {
// do nothing useful
}
This entire code block can easily be removed by the Java run time. This means: There will be no synchronization! In the second example, you are writing this instead:
synchronized(s) {
long t1 = System.currentTimeMillis();
// do nothing useful
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
}
This means that the effective code might be look like this:
synchronized(s) {
long t1 = System.currentTimeMillis();
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
}
What is important here is that this optimized code will be effectively synchronized what is an important difference with respect to execution time. Basically, you are measuring the time it costs to synchronize something (and even that might be optimized away after a couple of runs if the JVM realized that the s is not locked elsewhere in your code (buzzword: temporary optimization with the possibility of deoptimization if loaded code in the future will also synchronize on s).
You should really read this:
http://www.ibm.com/developerworks/java/library/j-jtp02225/
http://www.ibm.com/developerworks/library/j-jtp12214/
Your test for example misses a warm-up, such that you are also measuring how much time the JVM will use for byte code to machine code optimization.
On a side note: Synchronizing on a String is almost always a bad idea. Your strings might be or might not be interned what means that you cannot be absolutely sure about their identity. This means, that synchronization might or might not work and you might even inflict synchronization of other parts of your code.

Resources