Writing a vector sum in MATLAB - performance

Suppose I have a function phi(x1,x2)=k1*x1+k2*x2 which I have evaluated over a grid where the grid is a square having boundaries at -100 and 100 in both x1 and x2 axis with some step size say h=0.1. Now I want to calculate this sum over the grid with which I'm struggling:
What I was trying :
clear all
close all
D=1; h=0.1;
D1 = -100;
D2 = 100;
X = D1 : h : D2;
Y = D1 : h : D2;
[x1, x2] = meshgrid(X, Y);
phi = k1.*x1 + k2.*x2;
sys=#(m1,m2,X,Y) (k1*h*m1+k2*h*m2).*exp((-([X Y]-h*[m1 m2]).^2)./(h^2*D))
Matlab says error in ndgrid, any idea how I should code this?
MATLAB shows:
Error using repmat
Requested 10001x1001x2001x2001 (298649.5GB) array exceeds maximum array size preference. Creation of arrays greater
than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference
panel for more information.
Error in ndgrid (line 72)
varargout{i} = repmat(x,s);
Error in new_try1 (line 16)

Judging by your comments and your code, it appears as though you don't fully understand what the equation is asking you to compute.
To obtain the value M(x1,x2) at some given (x1,x2), you have to compute that sum over Z2. Of course, using a numerical toolbox such as MATLAB, you could only ever hope to compute over some finite range of Z2. In this case, since (x1,x2) covers the range [-100,100] x [-100,100], and h=0.1, it follows that mh covers the range [-1000, 1000] x [-1000, 1000]. Example: m = (-1000, -1000) gives you mh = (-100, -100), which is the bottom-left corner of your domain. So really, phi(mh) is just phi(x1,x2) evaluated on all of your discretised points.
As an aside, since you need to compute |x-hm|^2, you can treat x = x1 + i x2 as a complex number to make use of MATLAB's abs function. If you were strictly working with vectors, you would have to use norm, which is OK too, but a bit more verbose. Thus, for some given x=(x10, x20), you would compute x-hm over the entire discretised plane as (x10 - x1) + i (x20 - x2).
Finally, you can compute 1 term of M at a time:
D=1; h=0.1;
D1 = -100;
D2 = 100;
X = (D1 : h : D2); % X is in rows (dim 2)
Y = (D1 : h : D2)'; % Y is in columns (dim 1)
phi = k1*X + k2*Y;
M = zeros(length(Y), length(X));
for j = 1:length(X)
for i = 1:length(Y)
% treat (x - hm) as a complex number
x_hm = (X(j)-X) + 1i*(Y(i)-Y); % this computes x-hm for all m
M(i,j) = 1/(pi*D) * sum(sum(phi .* exp(-abs(x_hm).^2/(h^2*D)), 1), 2);
By the way, this computation takes quite a long time. You can consider either increasing h, reducing D1 and D2, or changing all three of them.


Optimizing algorithm calculating (sin(x)-x)*x^{-3} (in matlab)

My task is to write optimal program that calculates matrix Y, given matrix X, where:
y = (sin(x)-x) x-3
Here's the code I have written so far:
n = size(X, 1);
m = size(X, 2);
Y = zeros(n, m);
d = n*m;
for i = 1:d
x = X(i);
if abs(x)<0.1
Y(i) = -1/6+x.^2/120-x.^4/5040+x.^6/362880;
Y(i) = (sin(x)-x).*(x.^(-3));
So, generally the formula was inaccurate around 0, so I have approximated it using Taylor theorem.
Unfortunately this program has accuracy of 91% and efficiency of only 24% (so it's 4 times slower than the optimal solution).
The tests are around 13 million samples, out of which around 6 million have value of less than 0.1. The range of samples is (-8π , 8π).
The target accuracy (100%) is 4*epsilon where epsilon equals 2^(-52) (that means that numbers calculated by program shouldn't be larger or smaller than numbers calculated "perfectly" than 4*epsilon).
100*epsilon means accuracy of 86%.
Do you have any ideas on how to make it faster and more accurate? I'm looking both for mathematical tricks on how to further transform given formula, and general MATLAB tips that can accelerate programs?
Using Horner method, I have managed to bring up efficiency up to 81% (accuracy still 91%) with this program:
function Y = main(X)
Y = (sin(X)-X).*(X.^(-3));
i = abs(X) < 0.1;
Y(i) = horner(X(i));
function y = horner (x)
pow = x.*x;
y = -1/6+pow.*(1/120+pow.*(-1/5040+pow./362880));
Do you have any further ideas on how to improve it?
Program seems to work fine for a great range of input:
x = linspace(-8*pi,8*pi,13e6); % 13 million samples in the desired range
y = (sin(x)-x)./x.^3;
Due due round-off errors, you may have problem calculating it for very small values of x:
x = 0
y = (sin(x)-x)./x.^3
y =
You already have the Taylor series expansion of the function around 0. As the Taylor expansion does not include a division by x, you can expect a better behaviour of the Taylor function around this region:
x = -1e-6:1e-9:1e-6;
y = (sin(x)-x)./x.^3;
y_taylor = -1/6 + x.^2/120 - x.^4/5040 + x.^6/362880;
plot(x,y,x,y_taylor); legend('y','taylor expansion','location','best')
You can replace your loop with vectorized code. This is usually more efficient than loop because the loop has a conditional in it, which is bad for branch prediction:
Y = (sin(X)-X).*(X.^(-3));
i = abs(X) < 0.1;
Y(i) = -1/6+X(i).^2/120-X(i).^4/5040+X(i).^6/362880;
Rewriting the primary equation to avoid the cubic root yields a 3x speedup for that computation:
Y = (sin(X)./X - 1) ./ (X.*X);
Speed comparison:
The following script compares timing for this method compared to OP's loop code. I use data that has 7 million values uniformly distributed in (-8π, 8π), and another 6 million values uniformly distributed in (-0.1,0.1).
OP's loop code takes 2.4412 s, and the vectorized solution takes 0.7224 s. Using OP's Horner method and the rewritten sin expression it takes 0.1437 s.
X = [linspace(-8*pi,8*pi,7e6), linspace(-0.1,0.1,6e6)];
function Y = method1(X)
n = size(X, 1);
m = size(X, 2);
Y = zeros(n, m);
d = n*m;
for i = 1:d
x = X(i);
if abs(x)<0.1
Y(i) = -1/6+x.^2/120-x.^4/5040+x.^6/362880;
Y(i) = (sin(x)-x).*(x.^(-3));
function Y = method2(X)
Y = (sin(X)-X).*(X.^(-3));
i = abs(X) < 0.1;
Y(i) = -1/6+X(i).^2/120-X(i).^4/5040+X(i).^6/362880;
function Y = method3(X)
Y = (sin(X)./X - 1) ./ (X.*X);
i = abs(X) < 0.1;
Y(i) = horner(X(i));
function y = horner (x)
pow = x.*x;
y = -1/6+pow.*(1/120+pow.*(-1/5040+pow./362880));

How do I implement cross-correlation to prove two images of the same scene are similar? [duplicate]

How can I select a random point on one image, then find its corresponding point on another image using cross-correlation?
So basically I have image1, I want to select a point on it (automatically) then find its corresponding/similar point on image2.
Here are some example images:
Full image:
Result of cross correlation:
Well, xcorr2 can essentially be seen as analyzing all possible shifts in both positive and negative direction and giving a measure for how well they fit with each shift. Therefore for images of size N x N the result must have size (2*N-1) x (2*N-1), where the correlation at index [N, N] would be maximal if the two images where equal or not shifted. If they were shifted by 10 pixels, the maximum correlation would be at [N-10, N] and so on. Therefore you will need to subtract N to get the absolute shift.
With your actual code it would probably be easier to help. But let's look at an example:
(A) We read an image and select two different sub-images with offsets da and db
Orig = imread('rice.png');
N = 200; range = 1:N;
da = [0 20];
db = [30 30];
A=Orig(da(1) + range, da(2) + range);
B=Orig(db(1) + range, db(2) + range);
(b) Calculate cross-correlation and find maximum
X = normxcorr2(A, B);
m = max(X(:));
[i,j] = find(X == m);
(C) Patch them together using recovered shift
R = zeros(2*N, 2*N);
R(N + range, N + range) = B;
R(i + range, j + range) = A;
(D) Illustrate things
subplot(2,2,1), imagesc(A)
subplot(2,2,2), imagesc(B)
subplot(2,2,3), imagesc(X)
rectangle('Position', [j-1 i-1 2 2]), line([N j], [N i])
subplot(2,2,4), imagesc(R);
(E) Compare intentional shift with recovered shift
delta_orig = da - db
%--> [30 10]
delta_recovered = [i - N, j - N]
%--> [30 10]
As you see in (E) we get exactly the shift we intenionally introduced in (A).
Or adjusted to your case:
S_full = size(full);
S_temp = size(template);
X=normxcorr2(template, full);
figure, colormap gray
subplot(2,2,1), title('full'), imagesc(full)
subplot(2,2,2), title('template'), imagesc(template),
subplot(2,2,3), imagesc(X), rectangle('Position', [j-20 i-20 40 40])
R = zeros(S_temp);
shift_a = [0 0];
shift_b = [i j] - S_temp;
R((1:S_full(1))+shift_a(1), (1:S_full(2))+shift_a(2)) = full;
R((1:S_temp(1))+shift_b(1), (1:S_temp(2))+shift_b(2)) = template;
subplot(2,2,4), imagesc(R);
However, for this method to work properly the patch (template) and the full image should be scaled to the same resolution.
A more detailed example can also be found here.

Cross-Correlation between two images

How can I select a random point on one image, then find its corresponding point on another image using cross-correlation?
So basically I have image1, I want to select a point on it (automatically) then find its corresponding/similar point on image2.
Here are some example images:
Full image:
Result of cross correlation:
Well, xcorr2 can essentially be seen as analyzing all possible shifts in both positive and negative direction and giving a measure for how well they fit with each shift. Therefore for images of size N x N the result must have size (2*N-1) x (2*N-1), where the correlation at index [N, N] would be maximal if the two images where equal or not shifted. If they were shifted by 10 pixels, the maximum correlation would be at [N-10, N] and so on. Therefore you will need to subtract N to get the absolute shift.
With your actual code it would probably be easier to help. But let's look at an example:
(A) We read an image and select two different sub-images with offsets da and db
Orig = imread('rice.png');
N = 200; range = 1:N;
da = [0 20];
db = [30 30];
A=Orig(da(1) + range, da(2) + range);
B=Orig(db(1) + range, db(2) + range);
(b) Calculate cross-correlation and find maximum
X = normxcorr2(A, B);
m = max(X(:));
[i,j] = find(X == m);
(C) Patch them together using recovered shift
R = zeros(2*N, 2*N);
R(N + range, N + range) = B;
R(i + range, j + range) = A;
(D) Illustrate things
subplot(2,2,1), imagesc(A)
subplot(2,2,2), imagesc(B)
subplot(2,2,3), imagesc(X)
rectangle('Position', [j-1 i-1 2 2]), line([N j], [N i])
subplot(2,2,4), imagesc(R);
(E) Compare intentional shift with recovered shift
delta_orig = da - db
%--> [30 10]
delta_recovered = [i - N, j - N]
%--> [30 10]
As you see in (E) we get exactly the shift we intenionally introduced in (A).
Or adjusted to your case:
S_full = size(full);
S_temp = size(template);
X=normxcorr2(template, full);
figure, colormap gray
subplot(2,2,1), title('full'), imagesc(full)
subplot(2,2,2), title('template'), imagesc(template),
subplot(2,2,3), imagesc(X), rectangle('Position', [j-20 i-20 40 40])
R = zeros(S_temp);
shift_a = [0 0];
shift_b = [i j] - S_temp;
R((1:S_full(1))+shift_a(1), (1:S_full(2))+shift_a(2)) = full;
R((1:S_temp(1))+shift_b(1), (1:S_temp(2))+shift_b(2)) = template;
subplot(2,2,4), imagesc(R);
However, for this method to work properly the patch (template) and the full image should be scaled to the same resolution.
A more detailed example can also be found here.

Fastest way to sort vectors by angle without actually computing that angle

Many algorithms (e.g. Graham scan) require points or vectors to be sorted by their angle (perhaps as seen from some other point, i.e. using difference vectors). This order is inherently cyclic, and where this cycle is broken to compute linear values often doesn't matter that much. But the real angle value doesn't matter much either, as long as cyclic order is maintained. So doing an atan2 call for every point might be wasteful. What faster methods are there to compute a value which is strictly monotonic in the angle, the way atan2 is? Such functions apparently have been called “pseudoangle” by some.
I started to play around with this and realised that the spec is kind of incomplete. atan2 has a discontinuity, because as dx and dy are varied, there's a point where atan2 will jump between -pi and +pi. The graph below shows the two formulas suggested by #MvG, and in fact they both have the discontinuity in a different place compared to atan2. (NB: I added 3 to the first formula and 4 to the alternative so that the lines don't overlap on the graph). If I added atan2 to that graph then it would be the straight line y=x. So it seems to me that there could be various answers, depending on where one wants to put the discontinuity. If one really wants to replicate atan2, the answer (in this genre) would be
# Input: dx, dy: coordinates of a (difference) vector.
# Output: a number from the range [-2 .. 2] which is monotonic
# in the angle this vector makes against the x axis.
# and with the same discontinuity as atan2
def pseudoangle(dx, dy):
p = dx/(abs(dx)+abs(dy)) # -1 .. 1 increasing with x
if dy < 0: return p - 1 # -2 .. 0 increasing with x
else: return 1 - p # 0 .. 2 decreasing with x
This means that if the language that you're using has a sign function, you could avoid branching by returning sign(dy)(1-p), which has the effect of putting an answer of 0 at the discontinuity between returning -2 and +2. And the same trick would work with #MvG's original methodology, one could return sign(dx)(p-1).
Update In a comment below, #MvG suggests a one-line C implementation of this, namely
pseudoangle = copysign(1. - dx/(fabs(dx)+fabs(dy)),dy)
#MvG says it works well, and it looks good to me :-).
I know one possible such function, which I will describe here.
# Input: dx, dy: coordinates of a (difference) vector.
# Output: a number from the range [-1 .. 3] (or [0 .. 4] with the comment enabled)
# which is monotonic in the angle this vector makes against the x axis.
def pseudoangle(dx, dy):
ax = abs(dx)
ay = abs(dy)
p = dy/(ax+ay)
if dx < 0: p = 2 - p
# elif dy < 0: p = 4 + p
return p
So why does this work? One thing to note is that scaling all input lengths will not affect the ouput. So the length of the vector (dx, dy) is irrelevant, only its direction matters. Concentrating on the first quadrant, we may for the moment assume dx == 1. Then dy/(1+dy) grows monotonically from zero for dy == 0 to one for infinite dy (i.e. for dx == 0). Now the other quadrants have to be handled as well. If dy is negative, then so is the initial p. So for positive dx we already have a range -1 <= p <= 1 monotonic in the angle. For dx < 0 we change the sign and add two. That gives a range 1 <= p <= 3 for dx < 0, and a range of -1 <= p <= 3 on the whole. If negative numbers are for some reason undesirable, the elif comment line can be included, which will shift the 4th quadrant from -1…0 to 3…4.
I don't know if the above function has an established name, and who might have published it first. I've gotten it quite a while ago and copied it from one project to the next. I have however found occurrences of this on the web, so I'd consider this snipped public enough for re-use.
There is a way to obtain the range [0 … 4] (for real angles [0 … 2π]) without introducing a further case distinction:
# Input: dx, dy: coordinates of a (difference) vector.
# Output: a number from the range [0 .. 4] which is monotonic
# in the angle this vector makes against the x axis.
def pseudoangle(dx, dy):
p = dx/(abs(dx)+abs(dy)) # -1 .. 1 increasing with x
if dy < 0: return 3 + p # 2 .. 4 increasing with x
else: return 1 - p # 0 .. 2 decreasing with x
I kinda like trigonometry, so I know the best way of mapping an angle to some values we usually have is a tangent. Of course, if we want a finite number in order to not have the hassle of comparing {sign(x),y/x}, it gets a bit more confusing.
But there is a function that maps [1,+inf[ to [1,0[ known as inverse, that will allow us to have a finite range to which we will map angles. The inverse of the tangent is the well known cotangent, thus x/y (yes, it's as simple as that).
A little illustration, showing the values of tangent and cotangent on a unit circle :
You see the values are the same when |x| = |y|, and you see also that if we color the parts that output a value between [-1,1] on both circles, we manage to color a full circle. To have this mapping of values be continuous and monotonous, we can do two this :
use the opposite of the cotangent to have the same monotony as tangent
add 2 to -cotan, to have the values coincide where tan=1
add 4 to one half of the circle (say, below the x=-y diagonal) to have values fit on the one of the discontinuities.
That gives the following piecewise function, which is a continuous and monotonous function of the angles, with only one discontinuity (which is the minimum) :
double pseudoangle(double dx, double dy)
// 1 for above, 0 for below the diagonal/anti-diagonal
int diag = dx > dy;
int adiag = dx > -dy;
double r = !adiag ? 4 : 0;
if (dy == 0)
return r;
if (diag ^ adiag)
r += 2 - dx / dy;
r += dy / dx;
return r;
Note that this is very close to Fowler angles, with the same properties. Formally, pseudoangle(dx,dy) + 1 % 8 == Fowler(dx,dy)
To talk performance, it's much less branchy than Fowler's code (and generally less complicated imo). Compiled with -O3 on gcc 6.1.1, the above function generates an assembly code with 4 branches, where two of them come from dy == 0 (one checking if the both operands are "unordered", thus if dy was NaN, and the other checking if they are equal).
I would argue this version is more precise than others, since it only uses mantissa preserving operations, until shifting the result to the right interval. This should be especially visible when |x| << |y| or |y| >> |x|, then the operation |x| + |y| looses quite some precision.
As you can see on the graph the angle-pseudoangle relation is also nicely close to linear.
Looking where branches come from, we can make the following remarks:
My code doesn't rely on abs nor copysign, which makes it look more self-contained. However playing with sign bits on floating point values is actually rather trivial, since it's just flipping a separate bit (no branch!), so this is more of a disadvantage.
Furthermore other solutions proposed here do not check whether abs(dx) + abs(dy) == 0 before dividing by it, but this version would fail as soon as only one component (dy) is 0 -- so that throws in a branch (or 2 in my case).
If we choose to get roughly the same result (up to rounding errors) but without branches, we could abuse copsign and write:
double pseudoangle(double dx, double dy)
double s = dx + dy;
double d = dx - dy;
double r = 2 * (1.0 - copysign(1.0, s));
double xor_sign = copysign(1.0, d) * copysign(1.0, s);
r += (1.0 - xor_sign);
r += (s - xor_sign * d) / (d + xor_sign * s);
return r;
Bigger errors may happen than with the previous implementation, due to cancellation in either d or s if dx and dy are close in absolute value. There is no check for division by zero to be comparable with the other implementations presented, and because this only happens when both dx and dy are 0.
If you can feed the original vectors instead of angles into a comparison function when sorting, you can make it work with:
Just a single branch.
Only floating point comparisons and multiplications.
Avoiding addition and subtraction makes it numerically much more robust. A double can actually always exactly represent the product of two floats, but not necessarily their sum. This means for single precision input you can guarantee a perfect flawless result with little effort.
This is basically Cimbali's solution repeated for both vectors, with branches eliminated and divisions multiplied away. It returns an integer, with sign matching the comparison result (positive, negative or zero):
signed int compare(double x1, double y1, double x2, double y2) {
unsigned int d1 = x1 > y1;
unsigned int d2 = x2 > y2;
unsigned int a1 = x1 > -y1;
unsigned int a2 = x2 > -y2;
// Quotients of both angles.
unsigned int qa = d1 * 2 + a1;
unsigned int qb = d2 * 2 + a2;
if(qa != qb) return((0x6c >> qa * 2 & 6) - (0x6c >> qb * 2 & 6));
d1 ^= a1;
double p = x1 * y2;
double q = x2 * y1;
// Numerator of each remainder, multiplied by denominator of the other.
double na = q * (1 - d1) - p * d1;
double nb = p * (1 - d1) - q * d1;
// Return signum(na - nb)
return((na > nb) - (na < nb));
The simpliest thing I came up with is making normalized copies of the points and splitting the circle around them in half along the x or y axis. Then use the opposite axis as a linear value between the beginning and end of the top or bottom buffer (one buffer will need to be in reverse linear order when putting it in.) Then you can read the first then second buffer linearly and it will be clockwise, or second and first in reverse for counter clockwise.
That might not be a good explanation so I put some code up on GitHub that uses this method to sort points with an epsilion value to size the arrays.
This might not be good for your use case because it's built for performance in graphics effects rendering, but it's fast and simple (O(N) Complexity). If your working with really small changes in points or very large (hundreds of thousands) data sets then this won't work because the memory usage might outweigh the performance benefits.
nice.. here is a varient that returns -Pi , Pi like many arctan2 functions.
edit note: changed my pseudoscode to proper python.. arg order changed for compatibility with pythons math module atan2(). Edit2 bother more code to catch the case dx=0.
def pseudoangle( dy , dx ):
""" returns approximation to math.atan2(dy,dx)*2/pi"""
if dx == 0 :
s = cmp(dy,0)
s = cmp(dx*dy,0) # cmp == "sign" in many other languages.
if s == 0 : return 0 # doesnt hurt performance much.but can omit if 0,0 never happens
p = dy/(dx+s*dy)
if dx < 0: return p-2*s
return p
In this form the max error is only ~0.07 radian for all angles.
(of course leave out the Pi/2 if you don't care about the magnitude.)
Now for the bad news -- on my system using python math.atan2 is about 25% faster
Obviously replacing a simple interpreted code doesnt beat a compiled intrisic.
If angles are not needed by themselves, but only for sorting, then #jjrv approach is the best one. Here is a comparison in Julia
using StableRNGs
using BenchmarkTools
# Definitions
struct V{T}
function pseudoangle(v)
copysign(1. - v.x/(abs(v.x)+abs(v.y)), v.y)
function isangleless(v1, v2)
a1 = abs(v1.x) + abs(v1.y)
a2 = abs(v2.x) + abs(v2.y)
a2*copysign(a1 - v1.x, v1.y) < a1*copysign(a2 - v2.x, v2.y)
# Data
rng = StableRNG(2021)
vectors = map(x -> V(x...), zip(rand(rng, 1000), rand(rng, 1000)))
# Comparison
res1 = sort(vectors, by = x -> pseudoangle(x));
res2 = sort(vectors, lt = (x, y) -> isangleless(x, y));
#assert res1 == res2
#btime sort($vectors, by = x -> pseudoangle(x));
# 110.437 μs (3 allocations: 23.70 KiB)
#btime sort($vectors, lt = (x, y) -> isangleless(x, y));
# 65.703 μs (3 allocations: 23.70 KiB)
So, by avoiding division, time is almost halved without losing result quality. Of course, for more precise calculations, isangleless should be equipped with bigfloat from time to time, but the same can be told about pseudoangle.
Just use a cross-product function. The direction you rotate one segment relative to the other will give either a positive or negative number. No trig functions and no division. Fast and simple. Just Google it.

Improving performance of interpolation (Barycentric formula)

I have been given an assignment in which I am supposed to write an algorithm which performs polynomial interpolation by the barycentric formula. The formulas states that:
p(x) = (SIGMA_(j=0 to n) w(j)*f(j)/(x - x(j)))/(SIGMA_(j=0 to n) w(j)/(x - x(j)))
I have written an algorithm which works just fine, and I get the polynomial output I desire. However, this requires the use of some quite long loops, and for a large grid number, lots of nastly loop operations will have to be done. Thus, I would appreciate it greatly if anyone has any hints as to how I may improve this, so that I will avoid all these loops.
In the algorithm, x and f stand for the given points we are supposed to interpolate. w stands for the barycentric weights, which have been calculated before running the algorithm. And grid is the linspace over which the interpolation should take place:
function p = barycentric_formula(x,f,w,grid)
%Assert x-vectors and f-vectors have same length.
if length(x) ~= length(f)
sprintf('Not equal amounts of x- and y-values. Function is terminated.')
n = length(x);
m = length(grid);
p = zeros(1,m);
% Loops for finding polynomial values at grid points. All values are
% calculated by the barycentric formula.
for i = 1:m
var = 0;
sum1 = 0;
sum2 = 0;
for j = 1:n
if grid(i) == x(j)
p(i) = f(j);
var = 1;
sum1 = sum1 + (w(j)*f(j))/(grid(i) - x(j));
sum2 = sum2 + (w(j)/(grid(i) - x(j)));
if var == 0
p(i) = sum1/sum2;
This is a classical case for matlab 'vectorization'. I would say - just remove the loops. It is almost that simple. First, have a look at this code:
function p = bf2(x, f, w, grid)
m = length(grid);
p = zeros(1,m);
for i = 1:m
var = grid(i)==x;
if any(var)
p(i) = f(var);
sum1 = sum((w.*f)./(grid(i) - x));
sum2 = sum(w./(grid(i) - x));
p(i) = sum1/sum2;
I have removed the inner loop over j. All I did here was in fact removing the (j) indexing and changing the arithmetic operators from / to ./ and from * to .* - the same, but with a dot in front to signify that the operation is performed on element by element basis. This is called array operators in contrast to ordinary matrix operators. Also note that treating the special case where the grid points fall onto x is very similar to what you had in the original implementation, only using a vector var such that x(var)==grid(i).
Now, you can also remove the outermost loop. This is a bit more tricky and there are two major approaches how you can do that in MATLAB. I will do it the simpler way, which can be less efficient, but more clear to read - using repmat:
function p = bf3(x, f, w, grid)
% Find grid points that coincide with x.
% The below compares all grid values with all x values
% and returns a matrix of 0/1. 1 is in the (row,col)
% for which grid(row)==x(col)
var = bsxfun(#eq, grid', x);
% find the logical indexes of those x entries
varx = sum(var, 1)~=0;
% and of those grid entries
varp = sum(var, 2)~=0;
% Outer-most loop removal - use repmat to
% replicate the vectors into matrices.
% Thus, instead of having a loop over j
% you have matrices of values that would be
% referenced in the loop
ww = repmat(w, numel(grid), 1);
ff = repmat(f, numel(grid), 1);
xx = repmat(x, numel(grid), 1);
gg = repmat(grid', 1, numel(x));
% perform the calculations element-wise on the matrices
sum1 = sum((ww.*ff)./(gg - xx),2);
sum2 = sum(ww./(gg - xx),2);
p = sum1./sum2;
% fix the case where grid==x and return
p(varp) = f(varx);
The fully vectorized version can be implemented with bsxfun rather than repmat. This can potentially be a bit faster, since the matrices are not explicitly formed. However, the speed difference may not be large for small system sizes.
Also, the first solution with one loop is also not too bad performance-wise. I suggest you test those and see, what is better. Maybe it is not worth it to fully vectorize? The first code looks a bit more readable..
