Is there an established method to weigh a weighted mean? - algorithm

Problem. The weighted mean/average can be used to give differing weight in a mean computation to elements of differing importance. I need to figure out an extension that would in turn 'scale' or 'weigh' the resulting weighted mean with regards to zero, depending on the actual (non-normalized) values of the weights:
if the weights are low, the scaled weighted mean should be close to 0.
if at least some weights are close to the max weight, then the scaled weighted mean should be more or less equivalent with simple weighted mean.
Rationale and details. I need such an extension in order to produce a more sensible mean value in a case where:
the weights are proximity/similarity scores (of interval (0,1)) of the elements (let's call them neighbors for simplicity) of a target element, in some space, and
the values on the neighbors (being averaged) reflect a change in some quality of theirs (because it is assumed to have an effect on the target, if they are close enough)
elements that are further away should have less weight, so using weighted mean seems reasonable - but in some cases, all the neighbors are far away - in these cases, they presumably should have little to no effect on the target (so their mean should reflect this, and be closer to zero).
Reproducible example. This requirement is not met when using a simple weighted means:
# Using R for example code (answer doesn't have to use R)
weighted.mean = function(x, w){
return( sum(x*w)/sum(w) ) # standard way to calculate weighted mean
}
# Example data:
weights1 = c(0.9, 0.1, 0.01) # proximity of neighbors to target
weights2 = c(0.1, 0.1, 0.01) # proximity of neighbors to some other target
values = c(1,2,10) # values on these neighbors
mean(values)
# 4.333333 # not useful, doesn't take into account distance of elements at all
weighted.mean(values, weights1)
# 1.188119 # useful result, reflects distance/weight!
weighted.mean(values, weights2)
# 1.904762 # not useful result - none of them should have any effect, being all distant; the mean should be close to 0 (no effect) instead
What I've tried so far (1) Removing the normalizing sum(weights) business and just taking mean of values*weights:
weighted.mean2 = function(x, w){
return( mean(x*w) )
}
weighted.mean2(values, weights1)
# 0.4 # lower value, but should be viewed relatively in comparison now
weighted.mean2(values, weights2)
# 0.1333333 # makes more sense, low proximity leads to low(er) mean value
What I've tried so far (2) Call weighted mean on 0 and the weighted mean, with the new weights for this vector of length two being 1 (max proximity/identity) and the proximity of the closest neighbor as a scale; the reasoning being that if the target has no close neighbors, then the effect in question should be about 0:
weighted.mean3 = function(x, w){
tmp = weighted.mean(x, w)
maxw = max(w)
return( weighted.mean( c(0, tmp), c(1, maxw)) )
}
weighted.mean3(values, weights1)
# 0.5627931
weighted.mean3(values, weights2)
# 0.1731602 # also makes sense, low proximity leads to low(er) mean value
Both approaches seem to yield a smaller value for the target with distant neighbors, and a comparatively higher value to a target with closer neighbors. However, this feels rather hacky to me, and I'm not sure if there might be cases where either approach might fail - surely there must be a more principled/established algorithm to do something like this out there (perhaps it's not called 'mean' or 'average' though; also, if one of my attempts is equivalent with one, then the answer could just confirm that). Long story short:
Is there an established/published method to weigh/scale a weighted mean in the way I've described above?
Note on previous version of the question: it was initially flagged as too broad, so I rewrote it and applied to reopen, but it was auto-closed as being abandoned; so I rewrote a new question; this one also now has a clear yes or no answer (rationale and/or references beyond a simple yes/no are of course appreciated)

Related

What is a greedy algorithm for this problem that is minimally optimal + proof?

The details are a bit cringe, fair warning lol:
I want to set up meters on the floor of my building to catch someone; assume my floor is a number line from 0 to length L. The specific type of meter I am designing has a radius of detection that is 4.7 meters in the -x and +x direction (diameter of 9.4 meters of detection). I want to set them up in such a way that if the person I am trying to find steps foot anywhere in the floor, I will know. However, I can't just setup a meter anywhere (it may annoy other residents); therefore, there are only n valid locations that I can setup a meter. Additionally, these meters are expensive and time consuming to make, so I would like to use as few as possible.
For simplicity, you can assume the meter has 0 width, and that each valid location is just a point on the number line aformentioned. What is a greedy algorithm that places as few meters as possible, while being able to detect the entire hallway of length L like I want it to, or, if detecting the entire hallway is not possible, will output false for the set of n locations I have (and, if it isn't able to detect the whole hallway, still uses as few meters as possible while attempting to do so)?
Edit: some clarification on being able to detect the entire hallway or not
Given:
L (hallway length)
a list of N valid positions to place a meter (p_0 ... p_N-1) of radius 4.7
You can determine in O(N) either a valid and minimal ("good") covering of the whole hallway or a proof that no such covering exists given the constraints as follows (pseudo-code):
// total = total length;
// start = current starting position, initially 0
// possible = list of possible meter positions
// placed = list of (optimal) meter placements, initially empty
boolean solve(float total, float start, List<Float> possible, List<Float> placed):
if (total-start <= 0):
return true; // problem solved with no additional meters - woo!
else:
Float next = extractFurthestWithinRange(start, possible, 4.7);
if (next == null):
return false; // no way to cover end of hall: report failure
else:
placed.add(next); // placement decided
return solve(total, next + 4.7, possible, placed);
Where extractFurthestWithinRange(float start, List<Float> candidates, float range) returns null if there are no candidates within range of start, or returns the last position p in candidates such that p <= start + range -- and also removes p, and all candidates c such that p >= c.
The key here is that, by always choosing to place a meter in the next position that a) leaves no gaps and b) is furthest from the previously-placed position we are simultaneously creating a valid covering (= no gaps) and an optimal covering (= no possible valid covering could have used less meters - because our gaps are already as wide as possible). At each iteration, we either completely solve the problem, or take a greedy bite to reduce it to a (guaranteed) smaller problem.
Note that there can be other optimal coverings with different meter positions, but they will use the exact same number of meters as those returned from this pseudo-code. For example, if you adapt the code to start from the end of the hallway instead of from the start, the covering would still be good, but the gaps could be rearranged. Indeed, if you need the lexicographically minimal optimal covering, you should use the adapted algorithm that places meters starting from the end:
// remaining = length (starts at hallway length)
// possible = positions to place meters at, starting by closest to end of hallway
// placed = positions where meters have been placed
boolean solve(float remaining, List<Float> possible, Queue<Float> placed):
if (remaining <= 0):
return true; // problem solved with no additional meters - woo!
else:
// extracts points p up to and including p such that p >= remaining - range
Float next = extractFurthestWithinRange2(remaining, possible, 4.7);
if (next == null):
return false; // no way to cover start of hall: report failure
else:
placed.add(next); // placement decided
return solve(next - 4.7, possible, placed);
To prove that your solution is optimal if it is found, you merely have to prove that it finds the lexicographically last optimal solution.
And you do that by induction on the size of the lexicographically last optimal solution. The case of a zero length floor and no monitor is trivial. Otherwise you demonstrate that you found the first element of the lexicographically last solution. And covering the rest of the line with the remaining elements is your induction step.
Technical note, for this to work you have to be allowed to place monitoring stations outside of the line.

Most efficient way to determine an intersection

Suppose I have a two vectors of "starts" and "stops", sorted in ascending order.
Vector 1 = [start1 stop1;
start2 stop2;
start3 stop3];
Vector 2 = [start4 stop4;
start5 stop5;
start6 stop6];
What is the most efficient way of determining the intersection/overlap of these two vectors?
I've had to do this on a couple of occasions. It's a simple task, but the logic can get quite messy.
One thing you must decide upon, up-front, is whether intervals are closed or open. That is, do the intervals [1,3] and [3,5] have an intersection at [3,3], or no intersection? I strongly advise "no intersection" (closed intervals tend to be much more painful to reason about than open or half-open intervals), but your use case may require otherwise.
I think the cleanest way to do this is to maintain a "current partial interval" from each list. By "partial" I mean that each interval may be "eaten away" from the bottom as intersections with intervals from the other list are recognized and output. This simplifies the logic by only forcing you to consider two intervals at a time, rather than processing all the V2 intervals which are relevant to some interval in V1.
To simplify the code further, you can allow intervals to be temporarily invalid, and start with both current intervals invalid. This makes the code more unnecessarily branchy, but it means you only have to handle updating them in one place and with one rule.
So the pseudocode goes like this (I'm destructively reading from V1 and V2, and writing to VI):
v1a,v1b = 0,0 # Empty and hence invalid
v2a,v2b = 0,0 # intervals to start with.
while True:
if v1a >= v1b: # Handle an invalid V1 interval
if V1.empty(): # If there's no more V1s,
return # No more intersections are possible.
else:
v1a,v2a = V1.pop() # Grab the next full interval from V1
if v2a >= v2b:
if V2.empty():
return
else:
v2a,v2b = V2.pop()
lower_bound = max(v1a, v2a) # Determine the overlap, if any, between
upper_bound = min(v1b, v2b) # the current two intervals.
if lower_bound < upper_bound:
VI.push(lower_bound, upper_bound) # Output the overlapping interval.
v1a = max(v1a, upper_bound) # Snip away the region which has now been
v2a = max(v2a, upper_bound) # handled. This may make one or both invalid.
The last two lines are the tricky bit. If there was an intersection, then upper_bound is its upper end: There are no remaining intersecting ranges below it, so they can be removed from either or both current intervals. If, however, the two current intervals did not overlap, then it has the effect of setting the lower interval's a to its own b, making it invalid and causing it to be replaced on the next iteration.
I believe you can take advantage of the fact that the lists are sorted and then do the following (pseudo code)
Grab the first spans
Determine if the spans overlap
If the spans overlap, then max(starts) to min(ends) is an overlap
Increment the span with the smallest end
* Merge results - You may need to pass over your results and merge overlapping elements
Here's some python that implements this (note - a and b would be lists that contain (start,end) tuples):
try:
while True:
if b_span[1]>a_span[0] and b_span[0]<a_span[1]:
overlaps.append((a_span[0] if a_span[0] > b_span[0] else b_span[0],
a_span[1] if a_span[1] < b_span[1] else b_span[1]))
if a_span[1] < b_span[1]:
a_span = a.pop(0)
else:
b_span = b.pop(0)
except IndexError:
pass

Genetic/Evolutionary algorithm - Painter

My task:
Create a program to copy a picture (given as input) using primitives only (like triangle or something). The program should use evolutionary algorithm to create output picture.
My question:
I need to invent an algorithm to create populations and check them (how much - in % - they match the input picture).
I have an idea; you can find it below.
So what I want from you: advice (if you find my idea not so bad) or inspiration (maybe you have a better idea?)
My idea:
Let's say that I'll use only triangles to build the output picture.
My first population is P pictures (generated by using T randomly generated triangles - called Elements).
I check by my fitness function every pictures in population and choose E of them as elite and rest of population just remove:
To compare 2 pictures we check every pixel in picture A and compare his R,G,B with
the same pixel (the same coordinates) in picture B.
I use this:
SingleDif = sqrt[ (Ar - Br)^2 + (Ag - Bg)^2 + (Ab - Bb)^2]
then i sum all differences (from all pixels) - lets call it SumDif
and use:
PictureDif = (DifMax - SumDif)/DifMax
where
DifMax = pictureHeight * pictureWidth * 255*3
The best are used to create the next population in this way:
picture MakeChild(picture Mother, picture Father)
{
picture child;
for( int i = 0; i < T; ++i )
{
j //this is a random number from 0 to 1 - created now
if( j < 0.5 ) child.element(i) = Mother.element(i);
else child.element(i) = Father.element(i)
if( j < some small % ) mutate( child.element(i) );
}
return child;
}
So it's quite simple. Only the mutation needs a comment: So there is always some small probability that element X in child will be different than X in his parent. To do this we make random changes in element in child (change his colour by random number, or add random number to his (x,y) coordinate - or his node).
So this is my idea... I didn't test it, didn't code it.
Please check my idea - what do you think about it?
I would make the number of patches of each child dynamic and get the mutation operation to insert/delete patches with some (low) probability. Of course this could result in a lot of redundancy and bloat in the child's genome. In these situations, it is usually a good idea to use the length of an individual's genome as a parameter of the fitness function so that individuals get rewarded (with a higher fitness value) for using fewer patches. So for example if the PictureDif of individuals A and B are the same but the A has fewer patches than B, then A has a higher fitness.
Another issue is the reproductive operator that you proposed (namely, the crossover operation). In order for the evolutionary process to work efficiently, you need to achieve a reasonable exploration and exploitation balance. One way of doing this is by having a set of reproductive operators that exhibit a good fitness correlation [1] which means the fitness of a child must be close to the fitness of its parent(s).
In the case of single parent reproduction you only need to find the right mutation parameters. However, when it comes to multi-parent reproduction (crossover) one of the frequently used techniques is to produce 2 children (instead of 1) from the same 2 parents. For the first child, each gene comes from the mother with the probability of 0.2 and from the father with the probability of 0.8, and for the second child the other way around. Of course after the crossover, you can do the mutation.
Oh and one more thing, for the mutation operators, when you say
... make random changes in element in child (change his colour by random number, or add random number to his (x,y) coordinate - or his node)
it's a good idea to use a Gaussian distribution to change the colour, coordinate etc.
[1] Evolutionary Computation: A unified approach by Kenneth A. De Jong, page 69

matlab: optimum amount of points for linear fit

I want to make a linear fit to few data points, as shown on the image. Since I know the intercept (in this case say 0.05), I want to fit only points which are in the linear region with this particular intercept. In this case it will be lets say points 5:22 (but not 22:30).
I'm looking for the simple algorithm to determine this optimal amount of points, based on... hmm, that's the question... R^2? Any Ideas how to do it?
I was thinking about probing R^2 for fits using points 1 to 2:30, 2 to 3:30, and so on, but I don't really know how to enclose it into clear and simple function. For fits with fixed intercept I'm using polyfit0 (http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m) . Thanks for any suggestions!
EDIT:
sample data:
intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];
What you have here is a rather difficult problem to find a general solution of.
One approach would be to compute all the slopes/intersects between all consecutive pairs of points, and then do cluster analysis on the intersepts:
slopes = diff(y)./diff(x);
intersepts = y(1:end-1) - slopes.*x(1:end-1);
idx = kmeans(intersepts, 3);
x([idx; 3] == 2) % the points with the intersepts closest to the linear one.
This requires the statistics toolbox (for kmeans). This is the best of all methods I tried, although the range of points found this way might have a few small holes in it; e.g., when the slopes of two points in the start and end range lie close to the slope of the line, these points will be detected as belonging to the line. This (and other factors) will require a bit more post-processing of the solution found this way.
Another approach (which I failed to construct successfully) is to do a linear fit in a loop, each time increasing the range of points from some point in the middle towards both of the endpoints, and see if the sum of the squared error remains small. This I gave up very quickly, because defining what "small" is is very subjective and must be done in some heuristic way.
I tried a more systematic and robust approach of the above:
function test
%% example data
slope = 2;
intercept = 1.5;
x = linspace(0.1, 5, 100).';
y = slope*x + intercept;
y(1:12) = log(x(1:12)) + y(12)-log(x(12));
y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;
y = y + 0.2*randn(size(y));
%% simple algorithm
[X,fn] = fminsearch(#(ii)P(ii, x,y,intercept), [0.5 0.5])
[~,inds] = P(X, y,x,intercept)
end
function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1.
N = numel(x);
n = round(N/2);
ii = round(ii*n);
inds = min(max(1, n+(-ii(1):ii(2))), N);
% Solve linear system with fixed intercept
A = x(inds);
b = y(inds) - intercept;
% and return the sum of squared errors, divided by
% the number of points included in the set. This
% last step is required to prevent fminsearch from
% reducing the set to 1 point (= minimum possible
% squared error).
C = sum(((A\b)*A - b).^2)/numel(inds);
end
which only finds a rough approximation to the desired indices (12 and 74 in this example).
When fminsearch is run a few dozen times with random starting values (really just rand(1,2)), it gets more reliable, but I still wouln't bet my life on it.
If you have the statistics toolbox, use the kmeans option.
Depending on the number of data values, I would split the data into a relative small number of overlapping segments, and for each segment calculate the linear fit, or rather the 1-st order coefficient, (remember you know the intercept, which will be same for all segments).
Then, for each coefficient calculate the MSE between this hypothetical line and entire dataset, choosing the coefficient which yields the smallest MSE.

Simple Weighted Random Walk with Hysteresis

I've already written a solution for this, but it doesn't feel "right", so I'd like some input from others.
The rules are:
Movement is on a 2D grid (Directions arbitrarily labelled N, NE, E, SE, S, SW, W, NW)
Probabilities of moving in a given direction are relative to the direction of travel (i.e. 40% represents ahead), and weighted:
[14%][40%][14%]
[ 8%][ 4%][ 8%]
[ 4%][ 4%][ 4%]
This means with overwhelming probability, travel will continue along its current trajectory. The middle value represents stopping. As an example, if the last move was NW, then the absolute probabilities would read:
[40%][14%][ 8%]
[14%][ 4%][ 4%]
[ 8%][ 4%][ 4%]
The probabilities are approximate - one thing I toyed with was making stopped a static 5% chance outside of the main calculation, which would have altered the probability of any other operation ever so slightly.
My current algorithm is as follows (in simplified pseudocode):
int[] probabilities = [4,40,14,8,4,4,4,8,14]
if move.previous == null:
move.previous = STOPPED
if move.previous != STOPPED:
// Cycle probabilities[1:8] array until indexof(move.previous) = 40%
r = Random % 99
if r < probabilities.sum[0:0]:
move.current = STOPPED
elif r < probabilities.sum[0:1]:
move.current = NW
elif r < probabilities.sum[0:2]:
move.current = NW
...
Reasons why I really dislike this method:
* It forces me to assign specific roles to array indices: [0] = stopped, [1] = North...
* It forces me to operate on a subset of the array when cycling (i.e. STOPPED always remains in place)
* It's very iterative, and therefore, slow. It has to check every value in turn until it gets to the right one. Cycling the array requires up to 4 operations.
* A 9-case if-block (most languages do not allow dynamic switches).
* Stopped has to be special cased in everything.
Things I have considered:
* Circular linked list: Simplifies the cycling (make the pivot always equal north) but requires maintaining a set of pointers, and still involves assigning roles to specific indices.
* Vectors: Really not sure how I'd go about weighting this, plus I'd need to worry about magnitude.
* Matrices: Rotating matrices does not work like that :)
* Use a well-known random walk algorithm: Overkill? Though recommendations are considered.
* Trees: Just thought of this, so no real thought given to it...
So. Does anyone have any bright ideas?
8You have 8 directions and when you hit some direction you have to "rotate this matrix"
But this is just modulo over finite field.
Since you have only 100 integers to pick probability from, you can just putt all integers in list and value from each integers points to index of your direction.
This direction you rotate (modulo addition) in way that it points to move that you have to make.
And than you have one array that have difference that you have to apply to your move.
somethihing like that.
40 numbers 14 numbers 8 numbers
int[100] probab={0,0,0,0,0,0,....,1,1,1,.......,2,2,2,...};
and then
N NE E SE STOP
int[9] next_move={{0,1},{ 1,1},{1,1},{1,-1}...,{0,0}}; //in circle
So you pick
move=probab[randint(100)]
if(move != 8)//if 8 you got stop
{
move=(prevous_move+move)%8;
}
move_x=next_move[move][0];
move_y=next_move[move][1];
Use a more direct representation of direction in your algorithms, something like a (dx, dy) pair, for example.
This allows you to move by just having x += dx; y += dy;
(You can still use the "direction ENUM" + a lookup table if you wish...)
Your next problem is finding a good representation of the "probability table". Since r only ranges from 1 to 99 it might be feasible to just do a dumb array and use prob_table[r] directly.
Then, compute a 3x3 matrix of these probability tables using the method of your choice. It doesn't matter if it is slow because you only do it once.
To get the next direction simply
prob_table = dir_table[curr_dx][curr_dy];
(curr_dx, curr_dy) = get_next_dir(prob_table, random_number());

Resources