In-place RGB->BGR color conversion is slower in OpenCV

In-place RGB->BGR color conversion is slower in OpenCV - performance

Is it the case that the in-place RGB->BGR color conversion routine in OpenCV saves some memory, but takes longer? If yes, can anyone explain why?
My application calls the cv::cvtColor(srcMat, dstMat, cv::COLOR_RGB2BGR) routine in OpenCV (version 4.2.0). In an effort to make the application faster, I tried the in-place version of this routine (by invoking it with the same Mat object for source and destination). I expected the speed to slightly improve, since the in-place version does not allocate new memory.
To test my expectation, I ran my application in a loop over 10,000 250x250 RGB images. To my surprise, my application became slower when the in-place version was used. In fact, I saw that the larger the image (500x500 vs 250x250), the greater the difference between the in-place and regular version.
Is this expected? If so, is it because the in-place version does a swap operation (more statements) and the regular version is only a copy operation?
Would anyone be willing to try to reproduce this behavior? It can be done so easily by timing the following snippet in 2 different ways: 1) use the snippet below, and 2) following the brief instructions in the comments in the snippet for the in-place version.
// Read image
Mat srcMat = imread(filename);
// Comment out this line for the in-place version
Mat dstMat;
for (int i=0; i<10000; i++)
{
// Use srcMat instead of dstMat in the in-place version
cv::cvtColor(srcMat, dstMat, cv::COLOR_RGB2BGR);
}
Thanks.

You may dig in the sources for finding the reason.
There are few possible code path (Using OpenCL or not, using IPP or not).
In my machine the execution of cv::cvtColor reaches the function CvtColorIPPLoopCopy in color.hpp:
template <typename Cvt>
bool CvtColorIPPLoopCopy(const uchar * src_data, size_t src_step, int src_type, uchar * dst_data, size_t dst_step, int width, int height, const Cvt& cvt)
{
Mat temp;
Mat src(Size(width, height), src_type, const_cast<uchar*>(src_data), src_step);
Mat source = src;
if( src_data == dst_data )
{
src.copyTo(temp);
source = temp;
}
bool ok;
parallel_for_(Range(0, source.rows),
CvtColorIPPLoop_Invoker<Cvt>(source.data, source.step, dst_data, dst_step,
source.cols, cvt, &ok),
source.total()/(double)(1<<16) );
return ok;
}
The code checks if src_data == dst_data, and if equal it copies the source image into temporary image:
if( src_data == dst_data )
{
src.copyTo(temp);
source = temp;
}
The extra data copy may be the reason for in-place processing taking longer time.
Note:
I can't say this is the reason for sure, because there are other possible code paths.
There are many highly performance optimized functions that do not support "in-place" processing.
When OpenCV needs to execute a function that doesn't support "in-place" processing the solution may be copying the source image to temporary location.
The same practice may be used for other execution code paths.
As I commented,
In-place processing prevents some compilation (and execution) optimizations due to loop carried dependencies.
In some cases there are also parallelization issues regarding "in-place" processing.
That's the reason many optimized "primitive" functions do not support "in-place" processing.

Related

Using Thrust Functions with raw pointers: Controlling the allocation of memory

I have a question regarding the thrust library when using CUDA.
I am using a thrust function, i.e. exclusive_scan, and I want to use raw pointers. I am using raw (device) pointers because I want to have full control of when the memory is allocated and deallocated.
After the function call, I will hand over the pointer to another data structure and then free the memory in either the destructor of this data structure, or in the next function call, when I recompute my (device) pointers. I came across for example this problem here now, which recommends to wrap the data structure in a device_vector. But then I run into the problem that the memory is freed once my device_vector goes out of scope, which I do not want. Having the device pointer globally is also not an option, since I am hacking code, i.e. it is used as a buffer and I would have to rewrite a lot if I wanted to do something like that.
Does anyone have a good workaround regarding this? The only chance I do see right now is to rewrite the thrust-function on my own, only using raw device-pointers.
EDIT: I misread, I can wrap it in a device_ptr instead of a device_vector.
Asking further though, how could I solve this if there wasn't the option of using a device_ptr?

There is no problem using plain pointers in thrust methods.
For data on the device do:
....
struct DoSomething {
__device__ int operator()(int item) { return 1; }
};
int* IntData;
cudaMalloc(&IntData, sizeof(int) * count);
auto dev_data = device_pointer_cast(IntData);
thrust::generate(dev_data, dev_data + count, DoSomething());
thrust::sort(dev_data, dev_data + count);
....
cudaFree(IntData);
For data on the host use plain malloc/free and raw_pointer_cast instead of device_pointer_cast.
See: thrust: Memory management

Purpose of _Compiler_barrier() on 32bit read

I have been stepping through the function calls that are involved when I assign to an atomic_long type on VS2017 with a 64bit project. I specifically wanted to see what happens when I copy an atomic_long into a none-atomic variable, and if there is any locking around it.
atomic_long ll = 10;
long t2 = ll;
Ultimately it ends up with this call (I've removed some code that was ifdefed out)
inline _Uint4_t _Load_seq_cst_4(volatile _Uint4_t *_Tgt)
{ /* load from *_Tgt atomically with
sequentially consistent memory order */
_Uint4_t _Value;
_Value = *_Tgt;
_Compiler_barrier();
return (_Value);
}
Now, I've read from MSDN that a plain read of a 32bit value will be atomic:
Simple reads and writes to properly-aligned 32-bit variables are
atomic operations.
...which explains why there is no Interlocked function for just reading; only those for changing/comparing. What I'd like to know is what the _Compiler_barrier() bit is doing. This is #defined as
__MACHINE(void _ReadWriteBarrier(void))
...and I've found on MSDN again that this
Limits the compiler optimizations that can reorder memory accesses
across the point of the call.
But I don't get this, as there are no other memory accesses apart from the return call; surely the compiler wouldn't move the assignment below that would it?
Can someone please clarify the purpose of this barrier?

_Load_seq_cst_4 is an inline function. The compiler barrier is there to block reordering with later code in the calling function this inlines into.
For example, consider reading a SeqLock. (Over-simplified from this actual implementation).
#include <atomic>
atomic<unsigned> sequence;
atomic_long value;
long seqlock_try_read() {
// this would normally be the body of a retry-loop;
unsigned seq1 = sequence;
long tmpval = value;
unsigned seq2 = sequence;
if (seq1 == seq2 && (seq1 & 1 == 0)
return tmpval;
else
// writer was modifying it, we should retry the loop
}
If we didn't block compile-time reordering, the compiler could merge both reads of sequence into a single access, like perhaps like this
long tmpval = value;
unsigned seq1 = sequence;
unsigned seq2 = sequence;
This would defeat the locking mechanism (where the writer increments sequence once before modifying the data, then again when it's done). Readers are entirely lockless, but it's not a "lock-free" algo because if the writer gets stuck mid-update, the readers can't read anything.
The barrier within each load function blocks reordering with other things after inlining.
(The C++11 memory model is very weak, but the x86 memory model is strong, only allowing StoreLoad reordering. Blocking compile-time reordering with later loads/stores is sufficient to give you an acquire / sequential-consistency load at runtime. x86: Are memory barriers needed here?)
BTW, a better example might be something where some non-atomic variables are read/written after seeing a certain value in an atomic flag. MSVC probably already avoids reordering or merging of atomic accesses, and in the seqlock the data being protected also has to be atomic.
Why don't compilers merge redundant std::atomic writes?

Does ios_base::sync_with_stdio(false) affect <fstream>?

It is well known that ios_base::sync_with_stdio(false) will help the performance of cin and cout in <iostream> by preventing sync b/w C and C++ I/O. However, I am curious as to whether it makes any difference at all in <fstream>.
I ran some tests with GNU C++11 and the following code (with and without the ios_base::sync_with_stdio(false) snippet):
#include <fstream>
#include <iostream>
#include <chrono>
using namespace std;
ofstream out("out.txt");
int main() {
auto start = chrono::high_resolution_clock::now();
long long val = 2;
long long x=1<<22;
ios_base::sync_with_stdio(false);
while (x--) {
val += x%666;
out << val << "\n";
}
auto end = chrono::high_resolution_clock::now();
chrono::duration<double> diff = end-start;
cout<<diff.count()<<" seconds\n";
return 0;
}
The results are as follows:
With sync_with_stdio(false): 0.677863 seconds (average 3 trials)
Without sync_with_stdio(false): 0.653789 seconds (average 3 trials)
Is this to be expected? Is there a reason for a nearly identical, if not slower speed, with sync_with_stdio(false)?
Thank you for your help.

The idea of sync_with_stdio() is to allow mixing input and output to standard stream objects (stdin, stdout, and stderr in C and std::cin, std::cout, std::cerr, and std::clog as well as their wide character stream counterparts in C++) without any need to worry about characters being buffered in any of the buffers of the involved objects. Effectively, with std::ios_base::sync_with_stdio(true) the C++ IOStreams can't use their own buffers. In practice that normally means that buffering on std::streambuf level is entirely disabled. Without a buffer IOStreams are rather expensive, though, as they process individual character involving potentially multiple virtual function calls. Essentially, the speed-up you get from std::ios_base::sync_with_stdio(false) is allowing both the C and C++ library to user their own buffers.
An alternative approach could be to share the buffer between the C and C++ library facilities, e.g., by building the C library facilities on top of the more powerful C++ library facilities (before people complain that this would be a terrible idea, making C I/O slower: that is actually not true at all with a proper implementation of the standard C++ library IOStreams). I'm not aware of any non-experimental implementation which does use that. With this setup std::ios_base::sync_with_stdio(value) wouldn't have any effect at all.
Typical implementations of IOStreams use different stream buffers for the standard stream objects from those used for file streams. Part of the reason is probably that the standard stream objects are normally not opened using a name but some other entity identifying them, e.g., a file descriptor on UNIX systems and it would require a "back door" interface to allow using a std::filebuf for the standard stream objects. However, at least early implementations of Dinkumware's standard C++ library which shipped (ships?), e.g., with MSVC++, used std::filebuf for the standard stream objects. This std::filebuf implementation was just a wrapper around FILE*, i.e., literally implementing what the C++ standard says rather than semantically implementing it. That was already a terrible idea to start with but it was made worse by inhibiting std::streambuf level buffering for all file streams with std::ios_base::sync_with_stdio(true) as that setting also affected file streams. I do not know whether this [performance] problem was fixed since. Old issue in the C/C++ User Journal and/or P.J.Plauger's "The [draft] Standard C++ Library" should show a discussion of this implementation.
tl;dr: According to the standard std::ios_base::sync_with_stdio(false) only changes the constraints for the standard stream objects to make their use faster. Whether it has other effects depends on the IOStream implementation and there was at least one (Dinkumware) where it made a difference.

Open CL Kernel - every workitem overwrites global memory?

I'm trying to write a kernel to get the character frequencies of a string.
First, here is the code I have for kernel right now:
_kernel void readParallel(__global char * indata, __global int * outdata)
{
int startId = get_global_id(0) * 8;
int maxId = startId + 7;
for (int i = startId; i < maxId; i++)
{
++outdata[indata[i]];
}
}
The variable inData holds the string in the global memory, and outdata is an array of 256 int values in the global memory. Every workitem reads 8 symbols from the string and should increase the appropriate ASCII-code in the array. The code compiles and executes, but outdata contains less occurrences overall than the number of characters in inData. I think the problem is that workitems overwrites the global memory. It would be nice if you can give me some tips to solve this.
By the way,. I am a rookie in OpenCL ;-) and, yes, I looked for solutions in other questions.

You are experiencing the effects of your uses of global memory not being atomic (C++-oriented description of what those are or another description by the Intel TBB folks). What happens, chronologically, is:
Some workgroup "thread" loads outData[123] into some register r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" increments r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" writes r1 to outData[123]
So, the value written to outData[123] "throws away" the updates during the time period between the read and the write (I'm ignoring the possibility of parallel writes corrupting each other rather than one of them winning out).
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use work-item-specific, warp-specific and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.
On an unrelated note, and as #huseyintugrulbuyukisik correctly points out, your code uses signed char values to index the array. To fix that, do one of the following:
reinterpret those char's as unsigned chars for array indices (and reinterpret back when reading the array.
upcast the char values to a larger integral type and add 128 to get an offset into the outArray.
Define your kernel to only support ASCII characters (no higher than 127), in which case you can ignore this issue (although that will be a potential crasher if you get invalid input.
If you only care about the frequency of printable characters (but can also have non-printing characters in the input), you could perform a run-time check before counting a character.

std::copy runtime_error when working with uint16_t's

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!

The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio