Faster alternatives for calcOpticalFlowSF - performance

Is there any faster alternatives for calcOpticalFlowSF? Its just sooo slow and wanna run this thing with a sequence of frames coming from a video. How can I do that?

There several methods for optical flow based motion estimation but you have to consider several things:
are you restricted to CPU implementation / GPU's implementation could decrease drastically the run-time
do you need dense motion fields or just a set of sparse motion vectors / sparse OF methods are more scalable and thus need less run-time
accuracy / the very high accuracy of dense methods is most only critical on motion boundaries. In many application you could approximate a dense motion field by a grid of sparse motion vectors and thus can use sparse methods as the pyramidal Lucas Kanade (OpenCV)
current libaries / methods are:
Dense Methods:
OpenCV 2.4.4 provides on GPU BroxOpticalFlow that is fast too
the FlowLib of the GPU4Vision Group provides a high accurate GPU implementation
GPU implementation of the TV-L1 on GPU is provided by
Sparse Methods:
OpenCV since 2.4.2 provides the pyramidal Lucas Kanade on GPU /earlier verions also very fast implementation on the CPU
the RLOFLib provides an more accurate implementation for GPU / CPU and Matlab
the Gain Adaptive Lucas Kanade / KLT is also available for GPU
You could also take a look to the current Optical Flow benchmarks, where the researcher sometimes provides Link. Common Optical Flow benchmarks are the Middlebury and the KITTI

Related

Which GLSL Multi Colour Linear/Radial Gradients Strategy to use?

I'm developing using OpenGL ES 2 & GLSL and I'm stuck on how to approach multi coloured / fractioned gradients ( linear and radial ).
I don't know which approach is the best practice:
Get a texture of the gradient colours & sample this in the fragment Shader ( essentially working with a regular texture ).
Computer generate a texture of the gradient first & sample this in the fragment Shader as above ( no need for PNGs etc of the gradient ) - caching this texture to save regeneration.
Use interpolation in the fragment Shader to calculate the fragment value by fragment position - this looks like I'd have to use multiple ifs, a loop, stuff you don't want executed per fragment.
Other strategy I haven't conceived of.
I know this question is a bit on the subjective side, but having looked around online for information I've not found anything concrete about how to proceed...
Well, I can tell you how to proceed, but you may not like the answer. ;) The main two approaches are sampling a texture, or doing shader calculations. To decide which one is more efficient in your case, you need to implement both, and start benchmarking. There are way too many factor influencing the performance of each to give a generic answer.
One of the major factors is of course how complex your calculations are. But modern GPUs have very high raw performance for pure calculations. Not quite as much for the mobile GPUs you're most likely using since you're asking about ES, but even the latest mobile GPUs have become quite powerful. Branches aren't free, but not necessarily as harmful as you might expect.
On the other hand, texture sampling looks like a single operation in the shader, but based on that alone you should not assume that it's automatically faster than executing a bunch of computations. Texture sampling performance can be limited by many factors, including throughput of the texture sampling hardware units, memory bandwidth, cache hit rates, etc. Particularly if your textures need to be fairly large to give you the necessary precision, memory bandwidth can hurt you, and accessing memory on a mobile device consumes significant power. Also, just the additional memory usage is undesirable since you mostly deal with very constrained amounts of memory.
Of course the performance characteristics can vary greatly between different GPUs. So if you want to make reliable conclusions, you need to benchmark on a variety of devices.
For the approach where you implement the computations in the shader, make sure that it is as optimal as it can be. Avoid branches where reasonably possible, or at least benchmark various options to see how much the branches hurt performance. If there are parts of the computation that are the same for each fragment, pre-compute the values and pass them into the shader. Replace expensive operations by cheaper ones where possible. For example, instead of dividing by a uniform value, pass in the inverse as a uniform, and use a multiplication instead. Use vector operations where possible.

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

space partitioning algorithm speed

I am developing a 3D Game Engine as a project. I would like to use space partitioning algorithms for each triangle/polygon in my scene to efficiently detect collisions. I just want to know (before I start programming the specifics) how fast is a typical space partitioning algorithm in modern computer games? I have dynamical objects so I am thinking that I might have to repartition my scene every frame. Is that possible and still achieve a reasonable frame rate? It would be very much appreciated if the answer could include data (e.g. the FPS, number of polygons, etc.). If that is too much trouble, just tell me if it is plausible to repartition every frame.
Any help would be appreciated.
I would first suggest reading the chapter on Broad-Phase Collision Detection with CUDA. It goes into comparison of different broad phase collision algorithms. Your performance depends on a few key variables. The first variable is whether or not your algorithm is implemented using GPU acceleration. The previously cited article contains some benchmarks on framerate vs. number of objects. At the very least it should give you a good idea of what is achievable.
If you don’t plan on GPU acceleration, sweep and prune is one of the easiest to implement. I don’t know the exact figures, but it performs well when there are few objects and your objects don’t vary too much between frames. In this situation, you have O(n) performance. This is a good one to start with as a baseline, because it is trivial to implement.
If you are really getting serious, I recommend the book Real-Time Collision Detection. I used this one when doing research on collision detection methods. It was a great resource.

Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA

I have a question on the eigen-decomposition of hundreds of small matrices using CUDA.
I need to calculate the eigenvalues and eigenvectors of hundreds (e.g. 500) of small (64-by-64) real symmetric matrices concurrently. I tried to implement it by the Jacobi method using chess tournament ordering (see this paper (PDF) for more information).
In this algorithm, 32 threads are defined in each block, while each block handles one small matrix, and the 32 threads work together to inflate 32 off-diagonal elements until convergence. However, I am not very satisfied with its performance.
I am wondering where there is any better algorithm for my question, i.e. the eigen-decomposition of many 64-by-64 real symmetric matrices. I guess the householder's method may be a better choice but not sure whether it can be efficiently implemented in CUDA. There are not a lot of useful information online, since most of other programmers are more interested in using CUDA/OpenCL to decompose one large matrix instead of a lot of small matrices.
At least for the Eigenvalues, a sample can be found in the Cuda SDK
http://www.nvidia.de/content/cudazone/cuda_sdk/Linear_Algebra.html
Images seem broken, but download of samples still works. I would suggest downloading the full SDK and having a look at that exsample. Also, this Paper could be helpfull:
http://docs.nvidia.com/cuda/samples/6_Advanced/eigenvalues/doc/eigenvalues.pdf

VowpalWabbit: Differences and scalability

I am trying to ascertain how VowpalWabbit's "state" is maintained as the size of our input set grows. In a typical machine learning environment, if I have 1000 input vectors, I would expect to send all of those at once, wait for a model building phase to complete, and then use the model to create new predictions.
In VW, it appears that the "online" nature of the algorithm shifts this paradigm to be more performant and capable of adjusting in real-time.
How is this real-time model modification implemented ?
Does VW take increasing resources with respect to total input data size over time ? That is, as i add more data to my VW model (when it is small), do the real-time adjustment calculations begin to take longer once the cumulative # of feature vector inputs increases to 1000s, 10000s, or millions?
Just to add to carlosdc's good answer.
Some of the features that set vowpal wabbit apart, and allow it to scale to tera-feature (1012) data-sizes are:
The online weight vector:
vowpal wabbit maintains an in memory weight-vector which is essentially the vector of weights for the model that it is building. This is what you call "the state" in your question.
Unbounded data size:
The size of the weight-vector is proportional to the number of features (independent input variables), not the number of examples (instances). This is what makes vowpal wabbit, unlike many other (non online) learners, scale in space. Since it doesn't need to load all the data into memory like a typical batch-learner does, it can still learn from data-sets that are too big to fit in memory.
Cluster mode:
vowpal wabbit supports running on multiple hosts in a cluster, imposing a binary tree graph structure on the nodes and using the all-reduce reduction from leaves to root.
Hash trick:
vowpal wabbit employs what's called the hashing trick. All feature names get hashed into an integer using murmurhash-32. This has several advantages: it is very simple and time-efficient not having to deal with hash-table management and collisions, while allowing features to occasionally collide. It turns out (in practice) that a small number of feature collisions in a training set with thousands of distinct features is similar to adding an implicit regularization term. This counter-intuitively, often improves model accuracy rather than decrease it. It is also agnostic to sparseness (or density) of the feature space. Finally, it allows the input feature names to be arbitrary strings unlike most conventional learners which require the feature names/IDs to be both a) numeric and b) unique.
Parallelism:
vowpal wabbit exploits multi-core CPUs by running the parsing and learning in two separate threads, adding further to its speed. This is what makes vw be able to learn as fast as it reads data. It turns out that most supported algorithms in vw, counter-intuitively, are bottlenecked by IO speed, rather than by learning speed.
Checkpointing and incremental learning:
vowpal wabbit allows you to save your model to disk while you learn, and then to load the model and continue learning where you left off with the --save_resume option.
Test-like error estimate:
The average loss calculated by vowpal wabbit "as it goes" is always on unseen (out of sample) data (*). This eliminates the need to bother with pre-planned hold-outs or do cross validation. The error rate you see during training is 'test-like'.
Beyond linear models:
vowpal wabbit supports several algorithms, including matrix factorization (roughly sparse matrix SVD), Latent Dirichlet Allocation (LDA), and more. It also supports on-the-fly generation of term interactions (bi-linear, quadratic, cubic, and feed-forward sigmoid neural-net with user-specified number of units), multi-class classification (in addition to basic regression and binary classification), and more.
There are tutorials and many examples in the official vw wiki on github.
(*) One exception is if you use multiple passes with --passes N option.
VW is a (very) sophisticated implementation of stochastic gradient descent. You can read more about stochastic gradient descent here
It turns out that a good implementation of stochastic gradient descent is basically I/O bound, it goes as fast as you can get it the data, so VW has some sophisticated data structures to "compile" the data.
Therefore the answer the answer to question (1) is by doing stochastic gradient descent and the answer to question (2) is definitely not.

Resources