Locally weighted logistic regression - algorithm

I have been trying to implement a locally-weighted logistic regression algorithm in Ruby. As far as I know, no library currently exists for this algorithm, and there is very little information available, so it's been difficult.
My main resource has been the dissertation of Dr. Kan Deng, in which he described the algorithm in what I feel is pretty light detail. My work so far on the library is here.
I've run into trouble when trying to calculate B (beta). From what I understand, B is a (1+d x 1) vector that represents the local weighting for a particular point. After that, pi (the probability of a positive output) for that point is the sigmoid function based on the B for that point. To get B, use the Newton-Raphson algorithm recursively a certain number of times, probably no more than ten.
Equation 4-4 on page 66, the Newton-Raphson algorithm itself, doesn't make sense to me. Based on my understanding of what X and W are, (x.transpose * w * x).inverse * x.transpose * w should be a (1+d x N) matrix, which doesn't match up with B, which is (1+d x 1). The only way that would work, then, is if e were a (N x 1) vector.
At the top of page 67, under the picture, though, Dr. Deng just says that e is a ratio, which doesn't make sense to me. Is e Euler's Constant, and it just so happens that that ratio is always 2.718:1, or is it something else? Either way, the explanation doesn't seem to suggest, to me, that it's a vector, which leaves me confused.
The use of pi' is also confusing to me. Equation 4-5, the derivative of the sigmoid function w.r.t. B, gives a constant multiplied by a vector, or a vector. From my understanding, though, pi' is just supposed to be a number, to be multiplied by w and form the diagonal of the weight algorithm W.
So, my two main questions here are, what is e on page 67 and is that the 1xN matrix I need, and how does pi' in equation 4-5 end up a number?
I realize that this is a difficult question to answer, so if there is a good answer then I will come back in a few days and give it a fifty point bounty. I would send an e-mail to Dr. Deng, but I haven't been able to find out what happened to him after 1997.
If anyone has any experience with this algorithm or knows of any other resources, any help would be much appreciated!

As far as I can see, this is just a version of Logistic regression in which the terms in the log-likelihood function have a multiplicative weight depending on their distance from the point you are trying to classify. I would start by getting familiar with an explanation of logistic regression, such as http://czep.net/stat/mlelr.pdf. The "e" you mention seems to be totally unconnected with Euler's constant - I think he is using e for error.
If you can call Java from Ruby, you may be able to make use of the logistic classifier in Weka described at http://weka.sourceforge.net/doc.stable/weka/classifiers/functions/Logistic.html - this says "Although original Logistic Regression does not deal with instance weights, we modify the algorithm a little bit to handle the instance weights." If nothing else, you could download it and look at its source code. If you do this, note that it is a fairly sophisticated approach - for instance, they check beforehand to see if all the points actually lie pretty much in some subspace of the input space, and project down a few dimensions if they do.

Related

Should one calculate QR decomposition before Least Squares to speed up the process?

I am reading the book "Introduction to linear algebra" by Gilbert Strang. The section is called "Orthonormal Bases and Gram-Schmidt". The author several times emphasised the fact that with orthonormal basis it's very easy and fast to calculate Least Squares solution, since Qᵀ*Q = I, where Q is a design matrix with orthonormal basis. So your equation becomes x̂ = Qᵀb.
And I got the impression that it's a good idea to every time calculate QR decomposition before applying Least Squares. But later I figured out time complexity for QR decomposition and it turned out to be that calculating QR decomposition and after that applying Least Squares is more expensive than regular x̂ = inv(AᵀA)Aᵀb.
Is that right that there is no point in using QR decomposition to speed up Least Squares? Or maybe I got something wrong?
So the only purpose of QR decomposition regarding Least Squares is numerical stability?
There are many ways to do least squares; typically these vary in applicability, accuracy and speed.
Perhaps the Rolls-Royce method is to use SVD. This can be used to solve under-determined (fewer obs than states) and singular systems (where A'*A is not invertible) and is very accurate. It is also the slowest.
QR can only be used to solve non-singular systems (that is we must have A'*A invertible, ie A must be of full rank), and though perhaps not as accurate as SVD is also a good deal faster.
The normal equations ie
compute P = A'*A
solve P*x = A'*b
is the fastest (perhaps by a large margin if P can be computed efficiently, for example if A is sparse) but is also the least accurate. This too can only be used to solve non singular systems.
Inaccuracy should not be taken lightly nor dismissed as some academic fanciness. If you happen to know that the problems ypu will be solving are nicely behaved, then it might well be fine to use an inaccurate method. But otherwise the inaccurate routine might well fail (ie say there is no solution when there is, or worse come up with a totally bogus answer).
I'm a but confused that you seem to be suggesting forming and solving the normal equations after performing the QR decomposition. The usual way to use QR in least squares is, if A is nObs x nStates:
decompose A as A = Q*(R )
(0 )
transform b into b~ = Q'*b
(here R is upper triangular)
solve R * x = b# for x,
(here b# is the first nStates entries of b~)

Multichannel blind deconvolution in the simplest formulation: how to solve?

Recently I began to study deconvolution algorithms and met the following acquisition model:
where f is the original (latent) image, g is the input (observed) image, h is the point spread function (degradation kernel), n is a random additive noise and * is the convolution operator.
If we know g and h, then we can recover f using Richardson-Lucy algorithm:
where , (W,H) is the size of rectangular support of h and multiplication and division are pointwise. Simple enough to code in C++, so I did just so. It turned out that approximates to f while i is less then some m and then it starts rapidly decay. So the algorithm just needed to be stopped at this m - the most satisfactory iteration.
If the point spread function g is also unknown then the problem is said to be blind, and the modification of Richardson-Lucy algorithm can be applied:
For initial guess for f we can take g, as before, and for initial guess for h we can take trivial PSF, or any simple form that would look similar to observed image degradation. This algorithm also works quit fine on the simulated data.
Now I consider the multiframe blind deconvolution problem with the following acquisition model:
Is there a way to develop Richardson-Lucy algorithm for solving the problem in this formulation? If no, is there any other iterative procedure for recovering f, that wouldn't be much more complicated than the previous ones?
According to your acquisition model, latent image (f) remains same while the observed images are different due to different psf and noise models. One way to look at it, is a motion-blur problem where a sharp and noise-free image(f) is corrupted by the motion blur kernel. As this is an ill-posed problem, in most of the literature it's solved iteratively by estimating the blur kernel and the latent image. The way you solve this depends entirely on your objective function.
For example in some papers IRLS is used to estimate the blur kernel. You can find a lot of literature on this.
If you want to use Richardson Lucy Blind deconvolution, then use it on just one frame.
One strategy can be in each iteration while recovering f, assign different weights for contribution from each g(observed images). You can incorporate different weights in the objective function or calculate them according to the estimated blur kernel.
Is there a way to develop Richardson-Lucy algorithm for solving the problem in this formulation?
I'm not a specialist in this area, but I don't think that such way to construct an algorithm exists, at least not straightforwardly. Here is my argument for this. The first problem you described (when the psf is known) is already ill-posed due to the random nature of the noise and loss of information about convolution near image edges. The second problem on your list — single-channel blind deconvolution — is the extention of the previous one. In this case in addition it's underdetermined, so the ill-posedness expands, and so it's natural that the method to solve this problem is developed from the method for solving the first problem. Now when we consider the multichannel blind deconvolution formulation, we add a bunch of additional information to our previous model and so the problem goes from underdetermined to overdetermined. This is the whole other kind of ill-posedness and hence different approaches to solution are required.
is there any other iterative procedure for recovering f, that wouldn't be much more complicated than the previous ones?
I can recommend the algorithm introduced by Šroubek and Milanfar in [1]. I'm not sure whether it's much more complicated on your opinion or not so much, but it's by far one of the most recent and robust. The formulation of the problem is precisely the same as you wrote. The algorithm takes as input K>1 number of images, the upper bound of the psf size L, and four tuning parameters: alpha, beta, gamma, delta. To specify gamma, for example, you will need to estimate the variance of the noise on your input images and take the largest variance var, then gamma = 1/var. The algorithm solves the following optimization problem using alternating minimization:
where F is the data fidelity term and Q and R are regularizers of the image and blurs, respectively.
For detailed analysis of the algorithm see [1], for a collection of different deconvolution formulation and their solutions see [2]. Hope it helps.
Referenses:
Filip Šroubek, Peyman Milanfar. —- Robust Multichannel Blind Deconvolution via Fast Alternating Minimization.
-— IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
Patrizio Campisi, Karen Egiazarian. —- Blind Image Deconvolution: Theory and Applications

How to find the best (most informative) plotting limits?

I am developing a 2D plotting program for functions of 1 variable. It is designed to be very simple so that the user should not have to select the initial plot limits (or "range").
Are there known algorithms that can find the most interesting plot limits, knowing only the function f(x) ?
Notes:
The definition of interesting plot limits is not well defined here. This is part of the question: what is the most interesting part of the plot?
I already have an algorithm to determine the range of x values where the function f has finite values.
I am using Javascript, but any language is ok.
I don't want to use existing libraries.
The function f is restricted to expressions that the user can write with the basic math operators + - * / ^ and functions exp log abs sqrt sin cos tan acos asin atan ceil floor.
Using the Google graphs, you can get some examples of automatic limits. Typing graph sin(x) works pretty well, but graph exp(x) and graph log(x) don't really give the best results. Also, graph sin(x*100)*exp(-x^2) does not choose the limits I would qualify as the most informative. But it would be good enough for me.
UPDATE:
I found that PlotRange in Mathematica does that automatically very well (see here). Is the source code available, or a reference explaining the algorithm? I could not find it anywhere.
UPDATE:
I started using an adaptative refinement algorithm to find informative plot ranges, inspired from this site. It is not working perfectly yet, but the current progress is implemented in my project here. You can try plotting a few functions and see how it works. When I have a fully working version I can post an answer.
I don't have a complete answer, but I might have some useful ideas to start with.
For me, interesting parts of a graph include:
All the roots of the function, except where there is an infinite number of roots (where we might be interested in no more than 8 of each).
All the roots of the first and second derivatives of the function, again except where there is an infinite number of roots.
The behaviour of the function around x = 0.
Locations of asymptotes, though I wouldn't want the graph to plot all the way to infinity.
To see the features of the graph, I'd like it to take up a "reasonable" amount of the rectangular graphing window. I think that might be achieved by aiming to have the integral of the absolute value of the function over the plotted range equal be in the range of, say, 20-80% of the graphing window.
Thus, a sketch of an heuristic for setting plot limits might be something like:
Find the range that includes all the roots of the function, its first and second derivatives, or (for functions with infinite numbers of roots) the (say) 8 roots closest to x=0.
If the range does not already include x=0, expand the range to include x=0.
Expand the x range by, say, 10% in each direction to ensure that all "interesting" points are well within the window.
Set the y range so that the integral of the absolute value of the function is (say) 30% of the area of the graphing window.

What do you think of this interest point detection algorithm?

I've been trying to come up with an interest point detection algorithm and this is what I came up with:
You go through the X and the Y axises 3n pixels at a time creating 3n x 3n squares.
For the the n x n square in the middle of the 3n x 3n square (let's call it square Z), the R, G, and B values are averaged and rounded to preset values to limit the number of colors, and that is the color that square will be treated as.
The same is done for the 8 surrounding n x n squares.
After that, the color of square Z is compared to the surrounding squares, if it matches x out of the 8 surrounding squares where x <= 3 or x => 5 then that is an interest point (a corner is detected).
And so on till all the image is covered.
The bigger n is, the faster the image will be scanned and the the less accurate the detection is, and vice versa.
This, supposedly, detects "literal corners", that is corners you can actually SEE on the image.
What do you think of this algorithm? Is it efficient? Can it be used on a live video stream (say from the camera) on a hand-held device?
I'm sorry to say that I don't think this is likely to be very good. Your algorithm looks a bit like a simplistic version of Moravec's algorithm, which is itself one of the simplest corner detection algorithms. The hardcoded limits you test against effectively make your edge test a stepped function, unlike an approach such as summed square differences. This will almost certainly give you discontinuities in your detection function (corners that don't match when they should have), for some values.
You also have the same problem as Moravec, namely that if the edge lies at an angle to the direction of neighbours being considered, then it won't be detected.
Developing algorithms is fun, and if this isn't a business-critical project, then by all means, carry on tinkering and experimenting (and don't be put off by my comments!). But the fact is, for almost any practical problem, a better algorithm for the task you want to solve almost certainly already exists. The real challenge is identifying how you can best model your problem in such a way that you can solve it using an existing, well-understood approach, designed by experts.
In particular, robust identification and analysis of edge-cases and worst-case runtimes is a tricky business; unless you are a professional algorist, you are likely to find the going difficult. But I certainly encourage you to discover this for yourself by trying. nlucaroni mentions some excellent questions to use as starting points for your analysis.
Why not try it and see if it works the way you expect? It sounds like it should. How does the performance compare with other methods? What is the complexity of the algorithm? Is it efficient compared to others? Where can it be improved? What kind of false-positives and false negatives are expected? Are they within reason based on the data I plan to use this on? What threshold should be used to compare surrounding squares? ....
this is stuff you should be doing, not us.
I would suggest you look at the SIFT algorithm. Its the defacto standard for points of interest in an image. Unfortunately, its also patented, because its so good.
If you are interested in a real time version of SIFT you can get it to run on a GPU, but its highly experimental at this point. Note if you are developing a commercial application you'd have to first purchase a license for using SIFT or get approval from David Lowe.

Efficient evaluation of hypergeometric functions

Does anyone have experience with algorithms for evaluating hypergeometric functions? I would be interested in general references, but I'll describe my particular problem in case someone has dealt with it.
My specific problem is evaluating a function of the form 3F2(a, b, 1; c, d; 1) where a, b, c, and d are all positive reals and c+d > a+b+1. There are many special cases that have a closed-form formula, but as far as I know there are no such formulas in general. The power series centered at zero converges at 1, but very slowly; the ratio of consecutive coefficients goes to 1 in the limit. Maybe something like Aitken acceleration would help?
I tested Aitken acceleration and it does not seem to help for this problem (nor does Richardson extrapolation). This probably means Pade approximation doesn't work either. I might have done something wrong though, so by all means try it for yourself.
I can think of two approaches.
One is to evaluate the series at some point such as z = 0.5 where convergence is rapid to get an initial value and then step forward to z = 1 by plugging the hypergeometric differential equation into an ODE solver. I don't know how well this works in practice; it might not, due to z = 1 being a singularity (if I recall correctly).
The second is to use the definition of 3F2 in terms of the Meijer G-function. The contour integral defining the Meijer G-function can be evaluated numerically by applying Gaussian or doubly-exponential quadrature to segments of the contour. This is not terribly efficient, but it should work, and it should scale to relatively high precision.
Is it correct that you want to sum a series where you know the ratio of successive terms and it is a rational function?
I think Gosper's algorithm and the rest of the tools for proving hypergeometric identities (and finding them) do exactly this, right? (See Wilf and Zielberger's A=B book online.)

Resources