std::copy runtime_error when working with uint16_t's - c++11

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!

The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

Related

Open CL Kernel - every workitem overwrites global memory?

I'm trying to write a kernel to get the character frequencies of a string.
First, here is the code I have for kernel right now:
_kernel void readParallel(__global char * indata, __global int * outdata)
{
int startId = get_global_id(0) * 8;
int maxId = startId + 7;
for (int i = startId; i < maxId; i++)
{
++outdata[indata[i]];
}
}
The variable inData holds the string in the global memory, and outdata is an array of 256 int values in the global memory. Every workitem reads 8 symbols from the string and should increase the appropriate ASCII-code in the array. The code compiles and executes, but outdata contains less occurrences overall than the number of characters in inData. I think the problem is that workitems overwrites the global memory. It would be nice if you can give me some tips to solve this.
By the way,. I am a rookie in OpenCL ;-) and, yes, I looked for solutions in other questions.
You are experiencing the effects of your uses of global memory not being atomic (C++-oriented description of what those are or another description by the Intel TBB folks). What happens, chronologically, is:
Some workgroup "thread" loads outData[123] into some register r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" increments r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" writes r1 to outData[123]
So, the value written to outData[123] "throws away" the updates during the time period between the read and the write (I'm ignoring the possibility of parallel writes corrupting each other rather than one of them winning out).
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use work-item-specific, warp-specific and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.
On an unrelated note, and as #huseyintugrulbuyukisik correctly points out, your code uses signed char values to index the array. To fix that, do one of the following:
reinterpret those char's as unsigned chars for array indices (and reinterpret back when reading the array.
upcast the char values to a larger integral type and add 128 to get an offset into the outArray.
Define your kernel to only support ASCII characters (no higher than 127), in which case you can ignore this issue (although that will be a potential crasher if you get invalid input.
If you only care about the frequency of printable characters (but can also have non-printing characters in the input), you could perform a run-time check before counting a character.

MPI : Wondering about process sequence and for-loops

I am trying to build something with MPI, so since i am not very familiar with it, i started with some arrays and printing stuff. I noticed that a plain C command (not an MPI one) works simultaneously on every process, i.e. printing something like that :
printf("Process No.%d",rank);
Them i noticed that the numbers of the processes got all scrambled and because the right sequence of the processes would fit me, i tried using a for-loop like that :
for(rank=0; rank<processes; rank++) printf("Process No.%d",rank);
And that started a third world war in my computer. Lots of strange errors in a strange format that i couldn't understand and that made me suspicious. How is it possible since an if-loop stating a ranks value , like the master rank:
if(rank==0) printf("Process No.%d",rank);
cant use a for-loop for the same reason. Well, that is my first question.
My second question is about an other for-loop i used, that it got ignored.
printf("PROCESS --------------->**%d**\n",id);
for (i = 0; i < PARTS; ++i){
printf("Array No.%d\n", i+1);
for (j = 0; j < MAXWORDS; ++j)
printf("%d, ",0);
printf("\n\n");
}
I run that for-loop and every process printed only the first line:
$ mpiexec -n 6 `pwd`/test
PROCESS --------------->**0**
PROCESS --------------->**1**
PROCESS --------------->**3**
PROCESS --------------->**2**
PROCESS --------------->**4**
PROCESS --------------->**5**
And not the following amount of zeros (there was an array there at first that i removed cause i was trying to figure out why it didn't get printed).
So, why is it about MPI and for-loops that don't get along?
--edit 1: grammar
--edit 2: Code paste
It is not the same as above, but same problem in the the last for-loop with fprintf.
This is a paste zone, sorry for that, i couldn't deal with the code system here
--edit 3: fixed
Well i finally figured it out. For first i have to say that the fprintf function when used inside MPI is a mess. Apparently there is a kind of overlap while every process writes in a text file. I tested it with the printf function and it worked. The second thing i was doing is, i was calling the MPI_Scatter function from inside root:
if(rank==root) MPI_Scatter();
..which only scatters the data inside the process and not the others.
Now that i have fixed those two issues, the program works as it should, apart a minor problem when i printf the my_list arrays. It seems like every array has a random number of inputs, but when i tested using a counter for every array, it's only the data that is printed like this. Tried using fflush(stdout); but it returned me an error.
usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: final link failed: `Input/output error collect2: ld returned 1 exit status`
MPI in and of itself does not have a problem with for loops. However, just like with any other software, you should always remember that it will work the way you code it, not the way you intend it. It appears that you are having two distinct issues, both of which are only tangentially related to MPI.
The first issue is that the variable PARTS is defined in such a way that it depends on another variable, procs, which is not initialized at before-hand. This means that the value of PARTS is undefined, and probably ends up causing a divide-by-zero as often as not. PARTS should actually be set after line 44, when procs is initialized.
The second issue is with the loop for(i = 0; i = LISTS; i++) labeled /*Here is the problem*/. First of all, the test condition of the loop always sets i to the value of LISTS, regardless of the initial value of 0 and the increment at the end of the loop. Perhaps it was intended to be i < LISTS? Secondly, LISTS is initialized in a way that depends on PARTS, which depends on procs, before that variable is initialized. As with PARTS, LISTS must be initialized after the statement MPI_Comm_size(MPI_COMM_WORLD, &procs); on line 44.
Please be more careful when you write your loops. Also, make sure that you initialize variables correctly. I highly recommend using print statements (for small programs) or the debugger to make sure your variables are being set to the expected values.

How to use audioConverterFillComplexBuffer and its callback?

I need a step by step walkthrough on how to use audioConverterFillComplexBuffer and its callback. No, don't tell me to read the Apple docs. I do everything they say and the conversion always fails. No, don't tell me to go look for examples of audioConverterFillComplexBuffer and its callback in use - I've duplicated about a dozen such examples both line for line and modified and the conversion always fails. No, there isn't any problem with the input data. No, it isn't an endian issue. No, the problem isn't my version of OS X.
The problem is that I don't understand how audioConverterFillComplexBuffer works, so I don't know what I'm doing wrong. And nothing out there is helping me understand, because it seems like nobody on Earth really understands how audioConverterFillComplexBuffer works, either. From the people who actually use it(I spy cargo cult programming in their code) to even the authors of Learning Core Audio and/or Apple itself(http://stackoverflow.com/questions/13604612/core-audio-how-can-one-packet-one-byte-when-clearly-one-packet-4-bytes).
This isn't just a problem for me, it's a problem for anybody who wants to program high-performance audio on the Mac platform. Threadbare documentation that's apparently wrong and examples that don't work are no fun.
Once again, to be clear: I NEED A STEP BY STEP WALKTHROUGH ON HOW TO USE audioConverterFillComplexBuffer plus its callback and so does the entire Mac developer community.
This is a very old question but I think is still relevant. I've spent a few days fighting this and have finally achieved a successful conversion. I'm certainly no expert but I'll outline my understanding of how it works. Note I'm using Swift, which I'm also just learning.
Here are the main function arguments:
inAudioConverter: AudioConverterRef: This one is simple enough, just pass in a previously created AudioConverterRef.
inInputDataProc: AudioConverterComplexInputDataProc: The very complex callback. We'll come back to this.
inInputDataProcUserData, UnsafeMutableRawPointer?: This is a reference to whatever data you may need to be provided to the callback function. Important because even in swift the callback can't inherit context. E.g. you may need to access an AudioFileID or keep track of the number of packets read so far.
ioOutputDataPacketSize: UnsafeMutablePointer<UInt32>: This one is a little misleading. The name implies it's the packet size but reading the documentation we learn it's the total number of packets expected for the output format. You can calculate this as outPacketCount = frameCount / outStreamDescription.mFramesPerPacket.
outOutputData: UnsafeMutablePointer<AudioBufferList>: This is an audio buffer list which you need to have already initialized with enough space to hold the expected output data. The size can be calculated as byteSize = outPacketCount * outMaxPacketSize.
outPacketDescription: UnsafeMutablePointer<AudioStreamPacketDescription>?: This is optional. If you need packet descriptions, pass in a block of memory the size of outPacketCount * sizeof(AudioStreamPacketDescription).
As the converter runs it will repeatedly call the callback function to request more data to convert. The main job of the callback is simply to read the requested number packets from the source data. The converter will then convert the packets to the output format and fill the output buffer. Here are the arguments for the callback:
inAudioConverter: AudioConverterRef: The audio converter again. You probably won't need to use this.
ioNumberDataPackets: UnsafeMutablePointer<UInt32>: The number of packets to read. After reading, you must set this to the number of packets actually read (which may be less than the number requested if we reached the end).
ioData: UnsafeMutablePointer<AudioBufferList>: An AudioBufferList which is already configured except for the actual data. You need to initialise ioData.mBuffers.mData with enough capacity to hold the expected number of packets, i.e. ioNumberDataPackets * inMaxPacketSize. Set the value of ioData.mBuffers.mDataByteSize to match.
outDataPacketDescription: UnsafeMutablePointer<UnsafeMutablePointer<AudioStreamPacketDescription>?>?: Depending on the formats used, the converter may need to keep track of packet descriptions. You need to initialise this with enough capacity to hold the expected number of packet descriptions.
inUserData: UnsafeMutableRawPointer?: The user data that you provided to the converter.
So, to start you need to:
Have sufficient information about your input and output data, namely the number of frames and maximum packet sizes.
Initialise an AudioBufferList with sufficient capacity to hold the output data.
Call AudioConverterFillComplexBuffer.
And on each run of the callback you need to:
Initialise ioData with sufficient capacity to store ioNumberDataPackets of source data.
Initialise outDataPacketDescription with sufficient capacity to store ioNumberDataPackets of AudioStreamPacketDescriptions.
Fill the buffer with source packets.
Write the packet descriptions.
Set ioNumberDataPackets to the number of packets actually read.
return noErr if successful.
Here's an example where I read the data from an AudioFileID:
var converter: AudioConverterRef?
// User data holds an AudioFileID, input max packet size, and a count of packets read
var uData = (fRef, maxPacketSize, UnsafeMutablePointer<Int64>.allocate(capacity: 1))
err = AudioConverterNew(&inStreamDesc, &outStreamDesc, &converter)
err = AudioConverterFillComplexBuffer(converter!, { _, ioNumberDataPackets, ioData, outDataPacketDescription, inUserData in
let uData = inUserData!.load(as: (AudioFileID, UInt32, UnsafeMutablePointer<Int64>).self)
ioData.pointee.mBuffers.mDataByteSize = uData.1
ioData.pointee.mBuffers.mData = UnsafeMutableRawPointer.allocate(byteCount: Int(uData.1), alignment: 1)
outDataPacketDescription?.pointee = UnsafeMutablePointer<AudioStreamPacketDescription>.allocate(capacity: Int(ioNumberDataPackets.pointee))
let err = AudioFileReadPacketData(uData.0, false, &ioData.pointee.mBuffers.mDataByteSize, outDataPacketDescription?.pointee, uData.2.pointee, ioNumberDataPackets, ioData.pointee.mBuffers.mData)
uData.2.pointee += Int64(ioNumberDataPackets.pointee)
return err
}, &uData, &numPackets, &bufferList, nil)
Again, I'm no expert, this is just what I've learned by trial and error.

Pixel modifying code runs quick in main app, really slow in Delphi 6 DirectShow filter with other problems

I have a Delphi 6 application that sends bitmaps to a DirectShow DLL in real-time, 25 frames a second. The DirectShow DLL is my code too and is also written in Delphi 6 using the DSPACK DirectShow component suite. I have a simple block of code that goes through each pixel in the bitmap modifying the brightness and contrast of the image, if a certain flag is set, otherwise the bitmap is pushed out the DirectShow DLL unmodified (push source video filter). The code used to be in the main application and then I just moved it into the DirectShow DLL. When it was in the main application it ran fine. I could see the changes in the bitmap as expected. However, now that the code resides in the DirectShow DLL it has the following problems:
When the code block below is active the DirectShow DLL is really slow. I have a quad core i5 and it's really slow. I can also see a big spike in the CPU consumption. In contrast, the very same code running in the main application ran fine on an old single core P4. It did hit the CPU noticeably on that old machine but the video was smooth and there were no problems. The images are only 352 x 288 pixels in size.
I don't see the expected changes to the visible bitmap. I can trace the code in the DirectShow DLL and see the numerical values of each pixel properly altered by the code, but the viewable image in the Graph Edit ActiveMovie window looks completely unchanged.
If I deactivate the code, which I can do in real-time, the ActiveMovie window shows video that is as smooth as glass, perfectly rendered with the CPU barely touched. If I reactivate the code the video is now really choppy, probably showing only 1 to 2 frames a second with a long delay before the first frame is shown, and the CPU spikes. Not completely, but a lot more than I would expect.
I tried compiling the DirectShow DLL with everything on including range checking, overflow checking, etc. and there were no warnings or errors during run-time. I then tried compiling for fastest speed and it still had the exact same problems listed above. Something is really wrong and I can't figure out what. Note, I do indeed lock the canvas before modifying the bitmap and unlock it after I'm done. If it weren't for the "everything on" compilation run I noted above I'd say it felt like an FPU Exception was being raised and silently swallowed with every pixel computation, but as I said, no errors or Exceptions are occurring.
UPDATE: I am putting this here so that the solution, which is embedded in one of Roman R's comment, is plainly visible. The problem that I was not setting the PixelFormat property to pf24Bit before accessing the ScanLine property. As Roman suggested, not doing this must make the TBitmap code create a temporary copy of the bitmap. As soon as I added the line of code below the problems went away, both that of changes not being visible and the soft page faults. It's an insidious problem because the only object that is affected is the pointer you use to access the ScanLine property, since (assumption) it contains a pointer to a temporary copy of the bitmap. That's must be why the subsequent TextOut() call still worked since it worked on the original copy of the bitmap.
clip.PixelFormat := pf24bit; // The missing code line that fixes the problem.
Here's the code block I've been referring to:
function IntToByte(i: Integer): Byte;
begin
if i > 255 then
Result := 255
else if i < 0 then
Result := 0
else
Result := i;
end;
// ---------------------------------------------------------------
procedure brightnessTurboBoost(var clip: TBitmap; rangeExpansionPowerOf2: integer; shiftValue: Byte);
var
p0: PByte;
x,y: Integer;
begin
if (rangeExpansionPowerOf2 = 0) and (shiftValue = 0) then
exit; // These parameter settings will not change the pixel values.
for y := 0 to clip.Height-1 do
begin
p0 := clip.scanline[y];
// Can't just do the whole buffer as a big block of bytes since the
// individual scan lines may be padded for CPU alignment.
for x := 0 to (clip.Width - 1) * 3 do
begin
if rangeExpansionPowerOf2 >= 1 then
p0^ := IntToByte((p0^ shl rangeExpansionPowerOf2) + shiftValue)
else
p0^ := IntToByte(p0^ + shiftValue);
Inc(p0);
end;
end;
end;
There are a few things to say about this code snippet.
First of all, you are using Scanline property of TBitmap class. I have not been dealign with Delphi for many years, so I might be wrong about this but I am under impression that Scanline is not actually a thin accessor, is it? It might be internally hiding things which can dramatically affect performance, such as "if he wants to access the bits of the image, then we have to first convert it to DIB before returning pointers". So a thing looking so simple might appear to be a killer.
"if rangeExpansionPowerOf2 >= 1 then" in the inner loop body? You don't really want to compare this all the way. Either make two separate functions or duplicate the whole loop without in two version for zero and non-zero rangeExpansionPowerOf2 and do this if only once.
"for ... to (clip.Width - 1) * 3 do" I am not really sure that Delphi optimizes the upper boundary evaluation to make it only once. You might be doing those multiplication thrice for every pixel, while you could do it only once the whole image.
For top perofrmance IntToByte is definitely implemented in MMX to avoid ifs and process multiple bytes at once.
Still as you say that images are only 352x288, I would suspect that #1 is ruining the performance.

D Dynamic Arrays - RAII

I admit I have no deep understanding of D at this point, my knowledge relies purely on what documentation I have read and the few examples I have tried.
In C++ you could rely on the RAII idiom to call the destructor of objects on exiting their local scope.
Can you in D?
I understand D is a garbage collected language, and that it also supports RAII.
Why does the following code not cleanup the memory as it leaves a scope then?
import std.stdio;
void main() {
{
const int len = 1000 * 1000 * 256; // ~1GiB
int[] arr;
arr.length = len;
arr[] = 99;
}
while (true) {}
}
The infinite loop is there so as to keep the program open to make residual memory allocations easy visible.
A comparison of a equivalent same program in C++ is shown below.
It can be seen that C++ immediately cleaned up the memory after allocation (the refresh rate makes it appear as if less memory was allocated), whereas D kept it even though it had left scope.
Therefore, when does the GC cleanup?
scope declarations are going in D2, so I'm not terribly certain on the semantics, but what I'd imagine is happening is that scope T[] a; only allocates the array struct on the stack (which needless to say, already happens, regardless of scope). As they are going, don't use scope (using scope(exit) and friends is different -- keep using them).
Dynamic arrays always use the GC to allocate their memory -- there's no getting around that. If you want something more deterministic, using std.container.Array would be the simplest manner, as I think you could pretty much drop it in where your scope vector3b array is:
Array!vector3b array
Just don't bother setting the length to zero -- the memory will be free'd once it goes out of scope (Array uses malloc/free from libc under the hood).
No, you cannot assume that the garbage collector will collect your object at any point in time.
There is, however, a delete keyword (as well as a scope keyword) that can delete an object deterministically.
scope is used like:
{
scope auto obj = new int[5];
//....
} //obj cleaned up here
and delete is used like in C++ (there's no [] notation for delete).
There are some gotcha's, though:
It doesn't always work properly (I hear it doesn't work well with arrays)
The developers of D (e.g. Andrei) are intending to remove them in later versions, because it can obviously mess up things if used incorrectly. (I personally hate this, given that it's so easy to screw things up anyway, but they're sticking with removing it, and I don't think people can convince them otherwise although I'd love it if that was the case.)
In its place, there is already a clear method that you can use, like arr.clear(); however, I'm not quite sure what it exactly does yet myself, but you could look at the source code in object.d in the D runtime if you're interested.
As to your amazement: I'm glad you're amazed, but it shouldn't be really surprising considering that they're both native code. :-)

Resources