What makes OpenCV so large on Windows? Anything I can do about it? - windows

The OpenCV x64 distribution (through emgucv) for Windows has almost half a gigabyte of DLLs, including a single 224Mb opencv_gpu.dll. It seems unlikely that any human could have produced that amount of code, so what gives? Large embedded resources? Code generation bloat (this doesn't seem likely given that it's a native c/c++ project)
I want to use it for face recognition, but it's a problem to have such a large binary dependency in git, and it's a hassle to manage outside of source control.
[Update]
There are no embedded resources (at least the kind Windows DLLs usually have, but since this is a cross-platform product, I'm not sure that's significant.) Maybe lots of initialized C table structures to perform matrix operations?

The size of opencv_gpu is result of numerous template instantiations compiled for several CUDA architecture versions.
For example for convolution:
7 data types (from CV_8U to CV_64F)
~30 hadcoded sizes of convolution kernel
8 CUDA architectures (bin: 1.1 1.2 1.3 2.0 2.1(2.0) 3.0 + ptx: 2.0 3.0)
This produces about 1700 variants of convolution.
This way opencv_gpu can grow up to 1 Gb for the latest OpenCV release.
If you are not going to use any CUDA acceleration then you can safely drop the opencv_gpu.dll

Related

Lightweight 2D library for embedded linux

I'm developing an application on relatively restricted embedded Linux platform, meaning it has 256MB of flash; no problem with RAM however. The application uses SPI TFT screen, exposed through framebuffer driver. The only thing required from UI is to support text presentation with various fonts and sizes, including text animations (fade, slide, etc.). On the prototype, which ran on RPi 3 I used libcairo so it went well. Now, provided the tight space constraints on the real platform, it doesn't seem feasible to use libcairo anymore, since according to what I've seen it requires more than 100 MB of space with all dependencies it has. Note however, that I come from bare metal world and never dealt with complex UI, so I might be completely wrong about libcairo and its size. So guys, please suggest what 2D library I could pick for my case (C++ is preferred, but C is also ok), and just in case there is a way to use libcairo with few megs footprint, please point me to the right direction.
Regards

Numerical differences between older Mac Mini and newer Macbook

I have a project that I compile on both my Mac Mini (Core2 Duo) and a 2014 Macbook quadcore i7. Both are running the latest version of Yosemite. The application is single threaded and I am compiling the tool and libraries using the exact same version of cmake and the clang (xcode) compiler. I am getting test failures due to slight numeric differences.
I am wondering if the inconsistency is coming from the clang compiler automatically doing processor specific optimizations, (which I did not select in cmake)? Could the difference be between the processors? Do the frameworks use processor specific optimizations? I am using the BLAS/Lapack routines the from the Accelerate framework. They are called from the SuperLU sparse matrix factorization package.
In general you should not expect results from BLAS or LAPACK to be bitwise reproducible across machines. There are a number of factors that implementors tune to get the best performance, all of which result in small differences in rounding:
your two machines have different numbers of processors, which will result in work being divided differently for threading purposes (even if your application is single threaded, BLAS may use multiple threads internally).
your two machines handle hyper threading quite differently, which may also cause BLAS to use different numbers of threads.
the cache and TLB hierarchy is different between your two machines, which means that different block sizes are optimal for data reuse.
the SIMD vector size on the newer machine is twice as large as that on the older machine, which again will effect how arithmetic is grouped.
finally, the newer machine supports FMA (and using FMA is necessary to get the best performance on it); this also contributes to small differences in rounding.
Any one of these factors would be enough to result in small differences; taken together it should be expected that the results will not be bitwise identical. And that's OK, so long as both results satisfy the error bounds of the computation.
Making the results identical would require severely limiting the performance on the newer machine, which would result in your shiny expensive hardware going to waste.

x32 ABI is this a tool, how to use this?

I am in requirement of increasing performance of my application which is on 32-bit so I thought to shift to 64-bit for increasing performance. But I came to know a bout x32abi
Below I have some links for information Just I want to know is this a tool or what it is
How to use it? I am confused with the links
https://sites.google.com/site/x32abi/
http://en.wikipedia.org/wiki/X32_ABI
http://www.linuxplumbersconf.org/2011/ocw/sessions/531
x32 is not a tool. It's an ABI, which is kind of an agreement between compilers, libraries and the OS. The idea is to be able to speed up some applications on x64 CPUs by using less space for pointers (smaller cache footprint, better locality for pointer-heavy data structures. And improved efficiency for atomic load/store/RMW of a pair of pointer-sized integers, since it's only 8 bytes total, not 16.)
There's a lot of cooperation required for this to happen. You need to build a special version of the Linux version and related tools to be able to compile and run programs with the x32 ABI.
See more details here:
http://sourceware.org/glibc/wiki/x32
https://en.wikipedia.org/wiki/X32_ABI

usage of OES_get_program_binary in GLSL

I want to know some examples for usage of OES_get_program_binary. In other words, I want to know some examples in which program binaries are really useful. In particular, I want to know the scenarios. Thanks.
The utility of OES_get_program_binary is outlined pretty clearly in the extension specification itself.
On OpenGL ES devices, a common method for using shaders is to precompile them for each specific device. However, there are a lot of GPUs out there. Even if we assume that each GPU within a specific generation can run the same precompiled shaders (which is almost certainly not true in many cases), that still means you need separate precompiled shaders for Tegra2, one for PowerVR Series 5 GPUs, PowerVR's series 5X, and Qualcomm's current GPU. And that doesn't take into account next-gen mobile GPUs, like PowerVR Series 6 and Tegra 3, and whatever Qualcomm's coming out with next. And any number of other GPUs I haven't mentioned.
The only alternative is to ship text shaders and compile them as needed. As you might imagine, running a compiler on low-power ARM chips is rather expensive.
OES_get_program_binary provides a reasonable alternative. It lets you take a compiled, linked program object and save a compiled binary image to local storage. This means that, when you go to load that program again, you don't have to load it from text shaders (unless the version has changed); you can load it from the binary directly. This should make applications start up faster on subsequent executions.

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?
I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor
The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.
An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.
Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.
DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Resources