Multi-channel Lattice Recursive Least Squares - algorithm

I'm trying to implement multi-channelt lattice RLS, i.e. the recursive least squares algorithm which performs noise cancellation with multiple inputs, but a single 'desired output'.
I have the basic RLS algorithm working with multiple components, but it's too inefficient and memory intensive for my purpose.
Wikipedia has an excellent example of lattice RLS, which works great.
https://en.wikipedia.org/wiki/Recursive_least_squares_filter
However, the sources it cites do not go into much detail on how to extend this to the multi-channel case, and re-doing the full derivation is a bit beyond me.
Does anyone know a good source which describes or implements this algorithm in the multi-channel case? Many thanks.

Use separate parallel adaptive filters...one for each noise reference and combine these outputs to subtract from your noisy signal. LMS usually works best but RLS is fine. Problems arise if any of the noise references are heavily correlated with the desired signal.

Related

Recurrences in NEAT/HyperNEAT algorithm and intermediate results

I am currently implementing a HyperNEAT-like algorithm in C language, but I am facing two crucial aspects of the algorithm that I am not able to implement properly. I have been delving into original source code for NEAT and HyperNEAT algorithms with no success. These issues are referred related to NEAT/CPPN recurrences due to inner feedback loops.
First issue
What is the proper computation sequence in NEAT/CPPNs with feedback loops? I provide an example of recurrence in the topology in next figure:
Feedback loops in topology
At firsts computation, feedback links do not hold any result from former computations. Should I perform the first computation with empty links?
Second issue
Imagine I want to produce an image by passing pixels coordinates to NEAT as inputs. As far as I know, the NEAT model should receive one input sample per pixel. Should I keep the intermediate results of the topology from former pixels?
If the NEAT is feedforward this issue has no effect, but if it presents feedback loops the results change. (The same issue applies for CPPN in HyperNEAT when indirect encoding the substrates).
I am aware of these questions are also related with graph theory, but I want to know how they are performed in NEAT algorithms.
Thanks!

How to Scale SPICE Matrix so LU-decomposition doesn't Fail

I am implementing a SPICE solver. I have the following problem: say I put two diodes and a current source in serial (standard diodes). I use MNA and Boost LU-decomposition. The problem is that the nodal matrix becomes very quickly near-singular. I think I have to scale the values but I don't know how and I couldn't find anything on the Internet. Any ideas how to do this scaling?
In the perspective of numerical, there is a scale technique for this kind of near-singular matrices. Basically, this technique is to divide each row of A by the sum (or maximum) of the absolute values in that row. You can find KLU which is a linear solver for circuit simulations for more details.
In perspective of SPICE simulation, it uses so-call Gmin stepping technique to iteratively compute and approach a real answer. You can find this in the documents of a SPICE project QUCS (Quite Universal Circuit Simulator).
Scaling does not help when the matrix has both very large and very small entries.
It is necessary to use some or all of the many tricks that were developed for circuit solver applications. A good start is clipping the range of the exponential and log function arguments to reasonable values -- in most circuits a diode forward voltage is never more than 1V and the diode reverse current not less than 1pA.
Actually, look at all library functions and wrap them in code that makes their arguments and results suitable for circuit-solving purposes. Simple clipping is sometimes good enough, but it is way better to make sure the functions stay (twice) differentiable and continuous.

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

How do people prove the correctness of Computer Vision methods?

I'd like to pose a few abstract questions about computer vision research. I haven't quite been able to answer these questions by searching the web and reading papers.
How does someone know whether a computer vision algorithm is correct?
How do we define "correct" in the context of computer vision?
Do formal proofs play a role in understanding the correctness of computer vision algorithms?
A bit of background: I'm about to start my PhD in Computer Science. I enjoy designing fast parallel algorithms and proving the correctness of these algorithms. I've also used OpenCV from some class projects, though I don't have much formal training in computer vision.
I've been approached by a potential thesis advisor who works on designing faster and more scalable algorithms for computer vision (e.g. fast image segmentation). I'm trying to understand the common practices in solving computer vision problems.
You just don't prove them.
Instead of a formal proof, which is often impossible to do, you can test your algorithm on a set of testcases and compare the output with previously known algorithms or correct answers (for example when you recognize the text, you can generate a set of images where you know what the text says).
In practice, computer vision is more like an empirical science: You gather data, think of simple hypotheses that could explain some aspect of your data, then test those hypotheses. You usually don't have a clear definition of "correct" for high-level CV tasks like face recognition, so you can't prove correctness.
Low-level algorithms are a different matter, though: You usually have a clear, mathematical definition of "correct" here. For example if you'd invent an algorithm that can calculate a median filter or a morphological operation more efficiently than known algorithms or that can be parallelized better, you would of course have to prove it's correctness, just like any other algorithm.
It's also common to have certain requirements to a computer vision algorithm that can be formalized: For example, you might want your algorithm to be invariant to rotation and translation - these are properties that can be proven formally. It's also sometimes possible to create mathematical models of signal and noise, and design a filter that has the best possible signal to noise-ratio (IIRC the Wiener filter or the Canny edge detector were designed that way).
Many image processing/computer vision algorithms have some kind of "repeat until convergence" loop (e.g. snakes or Navier-Stokes inpainting and other PDE-based methods). You would at least try to prove that the algorithm converges for any input.
This is my personal opinion, so take it for what it's worth.
You can't prove the correctness of most of the Computer Vision methods right now. I consider most of the current methods some kind of "recipe" where ingredients are thrown down until the "result" is good enough. Can you prove that a brownie cake is correct?
It is a bit similar in a way to how machine learning evolved. At first, people did neural networks, but it was just a big "soup" that happened to work more or less. It worked sometimes, didn't on other cases, and no one really knew why. Then statistical learning (through Vapnik among others) kicked in, with some real mathematical backup. You could prove that you had the unique hyperplane that minimized a particular loss function, PCA gives you the closest matrix of fixed rank to a given matrix (considering the Frobenius norm I believe), etc...
Now, there are still a few things that are "correct" in computer vision, but they are pretty limited. What comes to my mind is the wavelet : they are the sparsest representation in an orthogonal basis of function. (i.e : the most compressed way to represent an approximation of an image with minimal error)
Computer Vision algorithms are not like theorems which you can prove, they usually try to interpret the image data into the terms which are more understandable to us humans. Like face recognition, motion detection, video surveillance etc. The exact correctness is not calculable, like in the case of image compression algorithms where you can easily find the result by the size of the images.
The most common methods used to show the results in Computer Vision methods(especially classification problems) are the graphs of precision Vs recall, accuracy Vs false positives. These are measured on standard databases available on various sites. Usually the harsher you set the parameters for correct detection, the more false positives you generate. The typical practice is to choose the point from the graph according to your requirement of 'how many false positives are tolerable for the application'.

What are canonical examples of parallel computation?

I am writing a paper to test a new application that will demonstrate the benefits of parallelized computation (compared to the traditional serialized version of this application). I want to use the canonical examples for parallel computation in my paper.
My first example is the parallel computation of pi. I would ideally like an example where each iteration is very time consuming (because of the additional overhead associated with parallelizing); my first thought is a Bayesian simulation with MCMC and Gibbs sampling.
What other problems are typically discussed in this context? What are good examples of large embarassingly parallel problems?
just a few more -
Multiplying matrices
Inverting matrices
FFT
String matching
Rendering 3d scenes (via scan line conversion or ray tracing)
One example I've used in the past of an embarrassingly parallel problem is visualizing the mandelbrot set. Each pixel can be computed independently.
Conway's Life is interesting as well, in that each value of the "next" board can be computed independently, but will depend on the relevant bits of the "current" board being done already.
I would suggest that canonical examples of parallel computation and embarassingly parallel problems are, if not completely, then nearly, disjoint sets. To put it another way, people working in parallel computation aren't terribly excited about embarassingly parallel problems; we call them that because we'd be embarassed to be working on them.
I'd be looking, if I were you, at these (a not entirely original list):
linear algebra on large dense matrices, both direct and iterative approaches;
linear algebra on huge sparse matrices
branch and bound approaches to linear programming (and related) problems;
sequence matching for bioinformatics (outside my field, I may have mis-expressed this);
continuos optimisation.
I expect there are many more.
EDIT: You may be interested in this list of problems which have been selected for benchmarking the next generation of European (academic) supercomputers. It will give you some idea of where that niche is heading.
Molecular dynamics simluations allow you to change the size of the problem until your computer resources are exhausted (i.e. 256 particles vs. 256,000,000 particles). Its truly a "canonical" example if you run the MD simulations under NVT conditions ;-)
My favorite example is monte carlo simulation.
Word counting seems to be the canonical example for MapReduce.
http://en.wikipedia.org/wiki/MapReduce#Example
Finding collisions in hash functions using Paul C. van Oorschot and Michael J. Weiner's method (PDF) comes up often in various cryptographic settings.
I used the Mandelbrot set demo to explain to my mom what parallel programming is about : http://www.ateji.com/px/demo.html
All the examples you mentions are mostly heavy data-parallel codes. You'll probably want to mention also task-oriented codes, such as servers responding to many requests in parallel, and data-flow or stream programming examples (MapReduce is a good representative of this class).

Resources