Calculate new probability based on past probability per value - probability

I want to calculate a percentage probability based on a list of past occurrences.
The data looks similar to this simplified table, for instance when the first value has been 8 in the past there has been a 72% chance of the event occurring.
1 76%
2 64%
4 80%
6 85%
7 83%
8 72%
11 70%
The full table ranges from 0 to 1030 and has 377 rows but changes daily. I want pass the function a value such as 3 and be returned a percentage probability of the event occurring. I don't need exact code, but would appreciate being pointed in the right direction.
Thanks

Based on your answers in the comments of the question, I would suggest an interpolation---linear interpolation is the simplest answer. It doesn't look like a probabilistic model would be appropriate based on the series in the spreadsheet (there doesn't appear to be a clear relationship between the column 1 and column 3).
To give an example of how this would work: imagine you want the probability for some point p, which is unobserved in the data. The biggest value you observe which is less than p is p_low (with corresponding probability f(p_low)), and the smallest value greater than p is p_high (with probability f(p_high)). Your estimate for p is:
interval = p_high - p_low
f_p_hat = ((p-p_low)/interval*f_p_low) + ((p_high-p)/interval*f_p_high)
This is going to make your estimate for p a weighted average of the values at p_low and p_high, with weights given by the distances between p and p_low, and p and p_high. E.g. if p is equidistant between p_low and p_high, f_p_hat (your estimate for f(p)) is just the mean of p_low and p_high.
Now, linear interpolation may not work if you have reason to suspect that the estimates at the endpoints are inaccurate (possibly due to small sample sizes). If so, it would be possible to do a (possibly weighted) least squares fit to a neighbourhood of points around p, and use that as a prediction. If this is the case I can go into a bit more detail.

Related

A sudoku problem: Efficiently find or approximate probability distribution over chosen numbers at each index of an array with no repeats

I'm looking for an efficient algorithm to generate or iteratively approximate a solution to the problem described below.
You are given an array of length N and a finite set of numbers Si for each index i of the array. Now, if we are to place a number from Si at each index i to fill the entire array, while ensuring that the number is unique across the entire array; given all the possible arrays, what is the probability ditribution over each number at each index?
Here I give an example:
Assuming we have the following array of length 3 with each column representing Si at the index of the column
4 4 4
   2  2
1  1  1
We will have the following possible arrays:
421
412
124
142
And the following probability distribution: (over 1 2 4 at each index respectively)
0.5 0.25 0.25
      0.5   0.5
0.5 0.25 0.25
Brute forcing this problem is obviously doable but I have a gut feeling that there must be some more efficient algorithms for this.
The reason why I think so is due to the fact that one can derive the probability distribution from the set of all possibilities but not the other way around, so the distribution itself must contain less information then the set of all possibilities have. Therefore, I believe that we do not need to generate all possibilites just to obtain the probability distribution.
Hence, I am wondering if there is any smart matrix operation we could use for this problem or even fixed-point iteration/density evolution to approximate the end probability distribution? Some other potentially more efficient approaches to this problem are also appreciated.
(p.s. The reason why I am interested in this problem is because I wanted to generate probability distribution over candidate numbers for the empty cells in a sudoku and other sudoku-like games without a unique answers by only applying all the standard rules)
Sudoku is a combinatorial problem. It is easy to show that the probability of any independent cell is uniform (because you can relabel a configuration to put any number at a given position). The joint probabilities are more complicated.
If the game is partially filled you have constraints that will affect this distribution.
You must devise an algorithm to calculate the number of solutions from a given initial configuration. Then you compute the fraction of the total solutions are will have a specific value at the position of interest.
counts = {}
for i in range(1, 10):
board[cell] = i;
counts[i] = countSolutions(board);
prob = {i: counts[i] / sum(counts[i] for i in range(1, 10))}
The same approach works for joint probabilities but in some cases the number of possibilities may be too high.

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

Biggest diameter of a set with a distance function

I have a set of elements with a distance function between elements satisfying the triangle inequality.
I want to find the pair of elements separated by the greatest distance.
Is there any known solution better than trying all pairs?
If you measure the distance from point a to points b, c and d, and you find that |ab| + |ac| < |ad|, then you know that |bc| is shorter than |ad|, and there's no need to measure |bc|. So not all pairs need to be checked to find the longest distance.
A possible algorithm would be:
Start by measuring the distance from point a to all other points, find the point n which is furthest away from a, and then give all pairs b,x for which |ab|+|ax| < |an| the distance |ab|+|ax| (because that is their maximum possible distance).
Do the same for point b, measuring only those distances which haven't yet been set. Check if you've found a new maximum, and then again give all pairs c,x for which |bc|+|bx| < MAX the distance |bc|+|bx|.
Go on doing this for points c, d, ...
In the best case scenario, you could find the longest distance in a set of N points after just N-1 measurements (if |ax| is twice as long as any other distance from a). In the worst case, you would need to measure every single pair (if the shortest distance is more than half of the longest distance, or if you are unlucky in the order in which you run through the points).
If you want to reduce the number of distance measurements to the absolute minimum, and for every unknown distance x,y you check every previously stored value |ax|+|ay|, |bx|+|by|, |cx|+|cy| ... to see whether it's smaller than the current maximum and can thus be used as a value for |xy|, the number of measurements is reduced substantially.
Running this algorithm on 1000 random points in a square 2D space, which would normally require 499,500 measurements, returns the maximum distance with between 2,000 and 10,000 measurements (or between 0.4% and 2% of the total, with an average around 1%).
This doesn't necessarily mean that the algorithm is much faster in practice than measuring every distance; that depends on how expensive the measuring is compared to the combination of loops, additions and comparisons required to avoid the measurements.
As #mcdowella pointed out, this method becomes less efficient as the number of dimensions of the space increases. The number of points also has a big impact. The table below shows the number of measurements that have to be carried out in relation to the total number of pairs. These are averages from a test with randomly distributed points in a "square" space (i.e. the coordinates in all dimensions are within the same range). As you can see, this method makes most sense for geometrical problems with many points in a 2D or 3D space. However, if your data is highly biased in some way, the results may be different.
10 points (45 pairs) 100 points (4950 pairs) 1000 points (499500 pairs)
dim measurem. % of total measurem. % of total measurem. % of total
1 16.6674 37.04 221.17 4.47 4877.97 0.98
2 22.4645 49.92 346.77 7.01 5346.78 1.07
3 27.5892 61.31 525.73 10.62 7437.16 1.49
4 31.9398 70.98 731.83 14.78 12780.02 2.56
5 35.3313 78.51 989.27 19.99 19457.84 3.90
6 38.1420 84.76 1260.89 25.47 26360.16 5.28
7 40.2296 89.40 1565.80 31.63 33221.32 6.65
8 41.6864 92.64 1859.08 37.56 44073.42 8.82
9 42.7149 94.92 2168.03 43.80 56374.36 11.29
10 43.4463 96.55 2490.69 50.32 73053.06 14.63
20 44.9789 99.95 4617.41 93.28 289978.20 58.05
30 44.9996 99.999 4936.68 99.73 460056.04 92.10
40 4949.79 99.99 496893.10 99.48
50 4949.99 99.9999 499285.80 99.96
60 499499.60 99.9999
As expected, the results of the tests become predictable at higher dimensions, with only a few percent between the outliers, while in 2D some test cases required 30 times more measurements than others.

How can I sort a 10 x 10 grid of 100 car images in two dimensions, by price and speed?

Here's the scenario.
I have one hundred car objects. Each car has a property for speed, and a property for price. I want to arrange images of the cars in a grid so that the fastest and most expensive car is at the top right, and the slowest and cheapest car is at the bottom left, and all other cars are in an appropriate spot in the grid.
What kind of sorting algorithm do I need to use for this, and do you have any tips?
EDIT: the results don't need to be exact - in reality I'm dealing with a much bigger grid, so it would be sufficient if the cars were clustered roughly in the right place.
Just an idea inspired by Mr Cantor:
calculate max(speed) and max(price)
normalize all speed and price data into range 0..1
for each car, calculate the "distance" to the possible maximum
based on a²+b²=c², distance could be something like
sqrt( (speed(car[i])/maxspeed)^2 + (price(car[i])/maxprice)^2 )
apply weighting as (visually) necessary
sort cars by distance
place "best" car in "best" square (upper right in your case)
walk the grid in zigzag and fill with next car in sorted list
Result (mirrored, top left is best):
1 - 2 6 - 7
/ / /
3 5 8
| /
4
Treat this as two problems:
1: Produce a sorted list
2: Place members of the sorted list into the grid
The sorting is just a matter of you defining your rules more precisely. "Fastest and most expensive first" doesn't work. Which comes first my £100,000 Rolls Royce, top speed 120, or my souped-up Mini, cost £50,000, top speed 180?
Having got your list how will you fill it? First and last is easy, but where does number two go? Along the top or down? Then where next, along rows, along the columns, zig-zag? You've got to decide. After that coding should be easy.
I guess what you want is to have cars that have "similar" characteristics to be clustered nearby, and additionally that the cost in general increases rightwards, and speed in general increases upwards.
I would try to following approach. Suppose you have N cars and you want to put them in an X * Y grid. Assume N == X * Y.
Put all the N cars in the grid at random locations.
Define a metric that calculates the total misordering in the grid; for example, count the number of car pairs C1=(x,y) and C2=(x',y') such that C1.speed > C2.speed but y < y' plus car pairs C1=(x,y) and C2=(x',y') such that C1.price > C2.price but x < x'.
Run the following algorithm:
Calculate current misordering metric M
Enumerate through all pairs of cars in the grid and calculate the misordering metric M' you obtain if you swapt the cars
Swap the pair of cars that reduces the metric most, if any such pair was found
If you swapped two cars, repeat from step 1
Finish
This is a standard "local search" approach to an optimization problem. What you have here is basically a simple combinatorial optimization problem. Another approaches to try might be using a self-organizing map (SOM) with preseeded gradient of speed and cost in the matrix.
Basically you have to take one of speed or price as primary and then get the cars with the same value of this primary and sort those values in ascending/descending order and primaries are also taken in the ascending/descending order as needed.
Example:
c1(20,1000) c2(30,5000) c3(20, 500) c4(10, 3000) c5(35, 1000)
Lets Assume Car(speed, price) as the measure in the above list and the primary is speed.
1 Get the car with minimum speed
2 Then get all the cars with the same speed value
3 Arrange these values in ascending order of car price
4 Get the next car with the next minimum speed value and repeat the above process
c4(10, 3000)
c3(20, 500)
c1(20, 1000)
c2(30, 5000)
c5(35, 1000)
If you post what language you are using them it would we helpful as some language constructs make this easier to implement. For example LINQ makes your life very easy in this situation.
cars.OrderBy(x => x.Speed).ThenBy(p => p.Price);
Edit:
Now you got the list, as per placing this cars items into the grid unless you know that there will be this many number of predetermined cars with these values, you can't do anything expect for going with some fixed grid size as you are doing now.
One option would be to go with a nonuniform grid, If you prefer, with each row having car items of a specific speed, but this is only applicable when you know that there will be considerable number of cars which has same speed value.
So each row will have cars of same speed shown in the grid.
Thanks
Is the 10x10 constraint necessary? If it is, you must have ten speeds and ten prices, or else the diagram won't make very much sense. For instance, what happens if the fastest car isn't the most expensive?
I would rather recommend you make the grid size equal to
(number of distinct speeds) x (number of distinct prices),
then it would be a (rather) simple case of ordering by two axes.
If the data originates in a database, then you should order them as you fetch them from the database. This should only mean adding ORDER BY speed, price near the end of your query, but before the LIMIT part (where 'speed' and 'price' are the names of the appropriate fields).
As others have said, "fastest and most expensive" is a difficult thing to do, you ought to just pick one to sort by first. However, it would be possible to make an approximation using this algorithm:
Find the highest price and fastest speed.
Normalize all prices and speeds to e.g. a fraction out of 1. You do this by dividing the price by the highest price you found in step 1.
Multiply the normalized price and speed together to create one "price & speed" number.
Sort by this number.
This ensures that is car A is faster and more expensive than car B, it gets put ahead on the list. Cars where one value is higher but the other is lower get roughly sorted. I'd recommend storing these values in the database and sorting as you select.
Putting them in a 10x10 grid is easy. Start outputting items, and when you get to a multiple of 10, start a new row.
Another option is to apply a score 0 .. 200% to each car, and sort by that score.
Example:
score_i = speed_percent(min_speed, max_speed, speed_i) + price_percent(min_price, max_price, price_i)
Hmmm... kind of bubble sort could be simple algorithm here.
Make a random 10x10 array.
Find two neighbours (horizontal or vertical) that are in "wrong order", and exchange them.
Repeat (2) until no such neighbours can be found.
Two neighbour elements are in "wrong order" when:
a) they're horizontal neighbours and left one is slower than right one,
b) they're vertical neighbours and top one is cheaper than bottom one.
But I'm not actually sure if this algorithm stops for every data. I'm almost sure it is very slow :-). It should be easy to implement and after some finite number of iterations the partial result might be good enough for your purposes though. You can also start by generating the array using one of other methods mentioned here. Also it will maintain your condition on array shape.
Edit: It is too late here to prove anything, but I made some experiments in python. It looks like a random array of 100x100 can be sorted this way in few seconds and I always managed to get full 2d ordering (that is: at the end I got wrongly-ordered neighbours). Assuming that OP can precalculate this array, he can put any reasonable number of cars into the array and get sensible results. Experimental code: http://pastebin.com/f2bae9a79 (you need matplotlib, and I recommend ipython too). iterchange is the sorting method there.

Algorithm for modeling expanding gases on a 2D grid

I have a simple program, at it's heart is a two dimensional array of floats, supposedly representing gas concentrations, I have been trying to come up with a simple algorithm that will model the gas expanding outwards, like a cloud, eventually ending up with the same concentration of the gas everywhere across the grid.
For example a given state progression could be:
(using ints for simplicity)
starting state
00000
00000
00900
00000
00000
state after 1 pass of algorithm
00000
01110
01110
01110
00000
one more pas should give a 5x5 grid all containing the value 0.36 (9/25).
I've tried it out on paper but no matter how I try, I cant get my head around the algorithm to do this.
So my question is, how should I set about trying to code this algorithm? I've tried a few things, applying a convolution, trying to take each grid cell in turn and distributing it to its neighbours, but they all end up having undesirable effects, such as ending up eventually with less gas than I originally started with, or all of gas movement being in one direction instead of expanding outwards from the centre. I really can't get my head around it at all and would appreciate any help at all.
It's either a diffusion problem if you ignore convection or a fluid dynamics/mass transfer problem if you don't. You would start with equations for conservation of mass and momentum for an Eulerian (fixed control volume) viewpoint if you were solving from scratch.
It's a transient problem, so you need to perform an integration to advance the state from time t(n) to t(n+1). You show a grid, but nothing about how you're solving in time. What integration scheme have you tried? Explicit? Implicit? Crank-Nicholson? If you don't know, you're not approaching the problem correctly.
One book that I really liked on this subject was S.W. Patankar's "Numerical Heat Transfer and Fluid Flow". It's a little dated now, but I liked the treatment. It's still good after 29 years, but there might be better texts since I was reading on the subject. I think it's approachable for somebody looking into it for the first time.
In the example you give, your second stage has a core of 1's. Usually diffusion requires a concentration gradient, so most diffusion related techniques won't change the 1 in the middle on the next iteration (nor would they have got to that state after the first one, but it's a bit easier to see once you've got a block of equal values). But as the commenters on your post say, that's not likely to be the cause of a net movement. Reducing the gas may be edge effects, but can also be a question of rounding errors - set the cpu to round half even, and total the gas and apply a correction now and again.
It looks like you're trying to implement a finite difference solver for the heat equation with Neumann boundary conditions (insulation at the edges). There's a lot of literature on this kind of thing. The Wikipedia page on finite difference method describes a simple but stable method, but for Dirichlet boundary conditions (constant density at edges). Modifying the handling of the boundary conditions shouldn't be too difficult.
It looks like what you want is something like a smoothing algorithm, often used in programs like Photoshop, or old school demo effects, like this simple Flame Effect.
Whatever algorithm you use, it will probably help you to double buffer your array.
A typical smoothing effect will be something like:
begin loop forever
For every x and y
{
b2[x,y] = (b1[x,y] + (b1[x+1,y]+b1[x-1,y]+b1[x,y+1]+b1[x,y-1])/8) / 2
}
swap b1 and b2
end loop forever
See Tom Forsyth's Game Programming Gems article. Looks like it fulfils your requirements, but if not then it should at least give you some ideas.
Here's a solution in 1D for simplicity:
The initial setup is with a concentration of 9 at the origin (), and 0 at all other positive and negative coordinates.
initial state:
0 0 0 0 (9) 0 0 0 0
The algorithm to find next iteration values is to start at the origin and average current concentrations with adjacent neighbors. The origin value is a boundary case and the average is done considering the origin value, and its two neighbors simultaneously, i.e. average among 3 values. All other values are effectively averaged among 2 values.
after iteration 1:
0 0 0 3 (3) 3 0 0 0
after iteration 2:
0 0 1.5 1.5 (3) 1.5 1.5 0 0
after iteration 3:
0 .75 .75 2 (2) 2 .75 .75 0
after iteration 4:
.375 .375 1.375 1.375 (2) 1.375 1.375 .375 .375
You do these iterations in a loop. Outputting the state every n number of iterations. You may introduce a time constant to control how many iterations represent one second of clock-on-the-wall time. This is also a function of what length units the integer coordinates represent. For a given H/W system, you can tune this value empirically. You may also introduce a steady state tolerance value to control when the program says " all neighbor values are within this tolerance" or "no value changed between iterations by more than this tolerance" and so the algorithm has reached a steady state solution.
The concentration for each iteration given a starting concentration can be obtained by the equation:
concentration = startingConcentration/(2*iter + 1)**2
iter is the time iteration. So for your example.
startingConcentration = 9
iter = 0
concentration = 9/(2*0 + 1)**2 = 9
iter = 1
concentration = 9/(2*1 + 1)**2 = 1
iter = 2
concentration = 9/(2*2 + 1)**2 = 9/25 = .35
you can set the value of the array after each "time step"

Resources