In some crunching number program, I have a function which can be just 1 or 0 in three dimensions. I do not know in advance the function, but I need to know the total "surface" of the function which is equal to zero. In a similar problem I could draw a rectangle over the 2D representation of the map of United Kingdom. The function is equal to 0 at sea, and 1 at the earth. I need to know the total water surface. I wonder what is the best parallel algorithm or method for doing this.
I thought first about the following approach; a) divide 2D map area into a rectangular grid. For each point that belongs to the center of each cell, check whether it is earth of water. This can be done in parallel. At the end of the procedure I will have a matrix with ones and zeroes. I will get the area with some precision. Now I want to increase this precision, so b) choose the cells that are in the border regions between zeroes and ones (what is the best criterion for doing this?) and in those cells, divide them again into successive cells and repeat the process until one gets the desired accuracy. I guess that in this process, the critical parameters are the grid size for each new stage, and how to store and check the cells that belong to the border area. Finally the most optimal method, from the computational point of view, is the one that performs the minimal number of checks in order to get the value of the total surface with the desired accuracy.
First of all, it looks like you are talking about 3D function, e.g. for two coordinates x and y you have f(x, y) = 0 if (x, y) belongs to the sea, and f(x, y) = 1 otherwise.
Having said that, you can use the following simple approach.
Split your rectangle into N subrectangles, where N is the number of
your processors (or processor cores, or nodes in a cluster, etc.)
For each subrectangle use Monte Carlo method to calculate the
surface of the water.
Add the N values to calculate the total surface of the water.
Of course, you can use any other method to calculate the surface, Mothe Carlo was just an example. But the idea is the same: subdivide your problem to N subproblems, solve them in parallel, then combine the results.
Update: For the Monte Carlo method the error estimate decreases as 1/sqrt(N) where N is the number of samples. For instance, to reduce the error by a factor of 2 requires a 4-fold increase in the number of sample points.
I believe that your attitude is reasonable.
Choose the cells that area in the border regions between zeroes and ones (what is the best criterion for doing this?)
Each cell has 8 sorrunding cells (3x3), or 24 sorrunding cells (5x5). If at least one of the 9 or 25 cells contains land, and at least one of these cells contains water - increase the accuracy for the whole block of cells (3x3 or 5x5) and query again.
When the accuracy is good enough - instead of splitting, just add the land area to the sum.
Efficiency
Use a producers-consumer queue. Create n threads, where n equals to the number of cores on your machine. All threads should do the same job:
Dequeue a geo-cell from the queue
If the area of the cell is still large - divide it into 3x3 or 5x5 cells, for each of the split cells check for land/sea. If there is a mix - enqueue all these cells. If it only land: just add the area. only sea: do nothing.
For start, just divide the whole area into reasonable sized cell and equeue all of them.
You can also optimize by not adding all the 9 or 25 cells when there is a mix, but examine the pattern (only top/bottom/left/right cells).
Edit:
There is a tradeoff between accuracy and performance: If the initial cell size is too large, you may miss small lakes or small islands. therefore the optimization criteria should be: start with the largest cells possible that will assure enough accuracy.
Related
Please forgive me if I do not explain my question clearly in title.
Here may I show you two pictures as my example:
my question is described as follows: I have 2 or more different objects(In the pictures, two objects: circle and cross), each one is placed repeatedly with a fixed row/column distance (In the pictures, the circle has a distance of 4 and cross has a distance of 2) into a grid.
In the first picture, each of the two objects are repeated correctly without any interruptions(here interruption means one object may occupy another one's position), but the arrangement in the first picture is ununiform distributed; on the contrary, in the second picture, the two objects may have interruptions (the circle object occupies cross objects' position) but the picture is uniformly distributed.
My target is to get the placement as uniform as possible (the objects are still placed with fixed distances but may allow some occupations). Is there a potential algorithm for this question? Or are there any similar questions?
I have some immature thinkings on this problem that: 1. occupation may relate to least common multiple; 2. how to define "uniformly distributed" mathematically? Maybe there's no genetic solution but is there a solution for some special cases? (for example, 3 objects with distance of multiple of 2, or multiple of 3?)
Uniformity can be measured as sum of squared inverse distances(or distances to equilibrium distances). Because it has squared relation, any single piece that approaches others will have big fitness penalty in system so that the system will not tolerate too close piece and prefer a better distribution.
If you do not use squared (or higher orders) distance but simple distance, then system starts tolerating even overlapped pieces.
If you want to manually compute uniformity, then compute the standard deviation of distances. You'd say its perfect with 1 distance and 0 deviation but small enough deviation also acceptable.
I tested this only on a problem to fit 106 circles in a square thats 10x the size of circle.
I've been working for some time in an XNA roguelike game and I can't get my head around the following problem: developing an algorithm to divide a matrix of non-binary values into the fewest rectangles grouping these values.
Example: given the following matrix
01234567
0 ---##*##
1 ---##*##
2 --------
The algorithm should return:
3x3 rectangle of '-'s starting at (0,0)
2x2 rectangle of '#'s starting at (3, 0)
1x2 rectangle of '*'s starting at (5, 0)
2x2 rectangle of '#'s starting at (6, 0)
5x1 rectangle of '-'s starting at (3, 2)
Why am I doing this: I've gotten a pretty big dungeon type with a size of approximately 500x500. If I were to individually call the "Draw" method for each tile's Sprite, my FPS would be far too low. It is possible to optimize this process by grouping similar-textured tiles and applying texture repetition to them, which would dramatically decrease the amount of GPU draw calls for that. For example, if my map were the previous matrix, instead of calling draw 16 times, I'd call it only 5 times.
I've looked at some algorithms which can give you the biggest rectangle of a type inside a given binary matrix, but that doesn't fit my problem.
Thanks in advance!
You can use breadth first searches to separate each area of different tile type.
Picking a partitioning within the individual shapes is an NP-hard problem (see https://en.wikipedia.org/wiki/Graph_partition), so you can't find an efficient solution that guarantees the minimum number of rectangles. However if you don't mind an extra rectangle or two for each shape and your shapes are relatively small, you can come up with algorithms that split the shape into a number of rectangles close to the minimum.
An off the top of my head guess for something that could potentially work would be to pick a tile with the maximum connecting tiles and start growing a rectangle from it using a recursive algorithm to maximize the size. Remove the resulting rectangle from the shape, then repeat until there are no more tiles not included in a rectangle. Again, this won't produce perfect results, there are graphs on which this will return with more than the minimum amount of rectangles, but it's an easy to implement ballpark solution. With a little more effort I'm sure you will be able to find better heuristics to use and get better results too.
One possible building block is a routine to check, given two points, whether the rectangle formed by using those points as opposite corners is all of the same type. I think that a fast (but unreliable) means of testing this can be based on mapping each type to a large random number, and then working out the sum of the numbers within a rectangle modulo a large prime. Take one of the numbers within the rectangle. If the sum of the numbers within the rectangle is the size of the rectangle times the one number sampled, assume that the all of the numbers in the rectangle are the same.
In one dimension we can work out all of the cumulative sums a, a+b, a+b+c, a+b+c+d,... in time O(N) and then, for any two points, work out the sum for the interval between them by subtracting cumulative sums: b+c+d = a+b+c+d - a. In two dimensions, we can use cumulative sums to work out, for each point, the sum of all of the numbers from positions which have x and y co-ordinates no greater than the (x, y) coordinate of that position. For any rectangle we can work out the sum of the numbers within that rectangle by working out A-B-C+D where A,B,C,D are two-dimensional cumulative sums.
So with pre-processing O(N) we can work out a table which allows us to compute the sum of the numbers within a rectangle specified by its opposite corners in time O(1). Unless we are very unlucky, checking this sum against the size of the rectangle times a number extracted from within the rectangle will tell us whether the rectangle is all of the same type.
Based on this, repeatedly start with a random point not covered. Take a point just to its left and move that point left as long as the interval between the two points is of the same type. Then move that point up as long as the rectangle formed by the two points is of the same type. Now move the first point to the right and down as long as the rectangle formed by the two points is of the same type. Now you think you have a large rectangle covering the original point. Check it. In the unlikely event that it is not all of the same type, add that rectangle to a list of "fooled me" rectangles you can check against in future and try again. If it is all of the same type, count that as one extracted rectangle and mark all of the points in it as covered. Continue until all points are covered.
This is a greedy algorithm that makes no attempt at producing the optimal solution, but it should be reasonably fast - the most expensive part is checking that the rectangle really is all of the same type, and - assuming you pass that test - each cell checked is also a cell covered so the total cost for the whole process should be O(N) where N is the number of cells of input data.
Suppose we have n points in a bounded region of the plane. The problem is to divide it in 4 regions (with a horizontal and a vertical line) such that the sum of a metric in each region is minimized.
The metric can be for example, the sum of the distances between the points in each region ; or any other measure about the spreadness of the points. See the figure below.
I don't know if any clustering algorithm might help me tackle this problem, or if for instance it can be formulated as a simple optimization problem. Where the decision variables are the "axes".
I believe this can be formulated as a MIP (Mixed Integer Programming) problem.
Lets introduce 4 quadrants A,B,C,D. A is right,upper, B is right,lower, etc. Then define a binary variable
delta(i,k) = 1 if point i is in quadrant k
0 otherwise
and continuous variables
Lx, Ly : coordinates of the lines
Obviously we have:
sum(k, delta(i,k)) = 1
xlo <= Lx <= xup
ylo <= Ly <= yup
where xlo,xup are the minimum and maximum x-coordinate. Next we need to implement implications like:
delta(i,'A') = 1 ==> x(i)>=Lx and y(i)>=Ly
delta(i,'B') = 1 ==> x(i)>=Lx and y(i)<=Ly
delta(i,'C') = 1 ==> x(i)<=Lx and y(i)<=Ly
delta(i,'D') = 1 ==> x(i)<=Lx and y(i)>=Ly
These can be handled by so-called indicator constraints or written as linear inequalities, e.g.
x(i) <= Lx + (delta(i,'A')+delta(i,'B'))*(xup-xlo)
Similar for the others. Finally the objective is
min sum((i,j,k), delta(i,k)*delta(j,k)*d(i,j))
where d(i,j) is the distance between points i and j. This objective can be linearized as well.
After applying a few tricks, I could prove global optimality for 100 random points in about 40 seconds using Cplex. This approach is not really suited for large datasets (the computation time quickly increases when the number of points becomes large).
I suspect this cannot be shoe-horned into a convex problem. Also I am not sure this objective is really what you want. It will try to make all clusters about the same size (adding a point to a large cluster introduces lots of distances to be added to the objective; adding a point to a small cluster is cheap). May be an average distance for each cluster is a better measure (but that makes the linearization more difficult).
Note - probably incorrect. I will try and add another answer
The one dimensional version of minimising sums of squares of differences is convex. If you start with the line at the far left and move it to the right, each point crossed by the line stops accumulating differences with the points to its right and starts accumulating differences to the points to its left. As you follow this the differences to the left increase and the differences to the right decrease, so you get a monotonic decrease, possibly a single point that can be on either side of the line, and then a monotonic increase.
I believe that the one dimensional problem of clustering points on a line is convex, but I no longer believe that the problem of drawing a single vertical line in the best position is convex. I worry about sets of points that vary in y co-ordinate so that the left hand points are mostly high up, the right hand points are mostly low down, and the intermediate points alternate between high up and low down. If this is not convex, the part of the answer that tries to extend to two dimensions fails.
So for the one dimensional version of the problem you can pick any point and work out in time O(n) whether that point should be to the left or right of the best dividing line. So by binary chop you can find the best line in time O(n log n).
I don't know whether the two dimensional version is convex or not but you can try all possible positions for the horizontal line and, for each position, solve for the position of the vertical line using a similar approach as for the one dimensional problem (now you have the sum of two convex functions to worry about, but this is still convex, so that's OK). Therefore you solve at most O(n) one-dimensional problems, giving cost O(n^2 log n).
If the points aren't very strangely distributed, I would expect that you could save a lot of time by using the solution of the one dimensional problem at the previous iteration as a first estimate of the position of solution for the next iteration. Given a starting point x, you find out if this is to the left or right of the solution. If it is to the left of the solution, go 1, 2, 4, 8... steps away to find a point to the right of the solution and then run binary chop. Hopefully this two-stage chop is faster than starting a binary chop of the whole array from scratch.
Here's another attempt. Lay out a grid so that, except in the case of ties, each point is the only point in its column and the only point in its row. Assuming no ties in any direction, this grid has N rows, N columns, and N^2 cells. If there are ties the grid is smaller, which makes life easier.
Separating the cells with a horizontal and vertical line is pretty much picking out a cell of the grid and saying that cell is the cell just above and just to the right of where the lines cross, so there are roughly O(N^2) possible such divisions, and we can calculate the metric for each such division. I claim that when the metric is the sum of the squares of distances between points in a cluster the cost of this is pretty much a constant factor in an O(N^2) problem, so the whole cost of checking every possibility is O(N^2).
The metric within a rectangle formed by the dividing lines is SUM_i,j[ (X_i - X_j)^2 + (Y_i-Y_j)^2]. We can calculate the X contributions and the Y contributions separately. If you do some algebra (which is easier if you first subtract a constant so that everything sums to zero) you will find that the metric contribution from a co-ordinate is linear in the variance of that co-ordinate. So we want to calculate the variances of the X and Y co-ordinates within the rectangles formed by each division. https://en.wikipedia.org/wiki/Algebraic_formula_for_the_variance gives us an identity which tells us that we can work out the variance given SUM_i Xi and SUM_i Xi^2 for each rectangle (and the corresponding information for the y co-ordinate). This calculation can be inaccurate due to floating point rounding error, but I am going to ignore that here.
Given a value associated with each cell of a grid, we want to make it easy to work out the sum of those values within rectangles. We can create partial sums along each row, transforming 0 1 2 3 4 5 into 0 1 3 6 10 15, so that each cell in a row contains the sum of all the cells to its left and itself. If we take these values and do partial sums up each column, we have just worked out, for each cell, the sum of the rectangle whose top right corner lies in that cell and which extends to the bottom and left sides of the grid. These calculated values at the far right column give us the sum for all the cells on the same level as that cell and below it. If we subtract off the rectangles we know how to calculate we can find the value of a rectangle which lies at the right hand side of the grid and the bottom of the grid. Similar subtractions allow us to work out first the value of the rectangles to the left and right of any vertical line we choose, and then to complete our set of four rectangles formed by two lines crossing by any cell in the grid. The expensive part of this is working out the partial sums, but we only have to do that once, and it costs only O(N^2). The subtractions and lookups used to work out any particular metric have only a constant cost. We have to do one for each of O(N^2) cells, but that is still only O(N^2).
(So we can find the best clustering in O(N^2) time by working out the metrics associated with all possible clusterings in O(N^2) time and choosing the best).
Related: Is there a simple algorithm for calculating the maximum inscribed circle into a convex polygon?
I'm writing a graphics program whose goals are artistic rather than mathematical. It composes a picture step by step, using geometric primitives such as line segments or arcs of small angle. As it goes, it looks for open areas to fill in with more detail; as the available open areas get smaller, the detail gets finer, so it's loosely fractal.
At a given step, in order to decide what to do next, we want to find out: where is the largest circular area that's still free of existing geometric primitives?
Some constraints of the problem
It does not need to be exact. A close-enough answer is fine.
Imprecision should err on the conservative side: an almost-maximal circle is acceptable, but a circle that's not quite empty isn't acceptable.
CPU efficiency is a priority, because it will be called often.
The program will run in a browser, so memory efficiency is a priority too.
I'll have to set a limit on level of detail, constrained presumably by memory space.
We can keep track of the primitives already drawn in any way desired, e.g. a spatial index. Exactness of these is not required; e.g. storing bounding boxes instead of arcs would be OK. However the more precision we have, the better, because it will allow the program to draw to a higher level of detail. But, given that the number of primitives can increase exponentially with the level of detail, we'd like storage of past detail not to increase linearly with the number of primitives.
To summarize the order of priorities
Memory efficiency
CPU efficiency
Precision
P.S.
I framed this question in terms of circles, but if it's easier to find the largest clear golden rectangle (or golden ellipse), that would work too.
P.P.S.
This image gives some idea of what I'm trying to achieve. Here is the start of a tendril-drawing program, in which decisions about where to sprout a tendril, and how big, are made without regard to remaining open space. But now we want to know, where is there room to draw a tendril next, and how big? And where after that?
One very efficient way would be to recursively divide your area into rectangular sub-areas, splitting them when necessary to divide occupied areas from unoccupied areas. Then you would simply need to keep track of the largest unoccupied area at each time. See https://en.wikipedia.org/wiki/Quadtree - but you needn't split into squares.
Given any rectangle, you can draw a line inside it, so that at least one of the rectangles to either side of the line is a golden rectangle. Therefore you can recursively erect partitions within a rectangle so that all but one of the rectangles formed by the partitions are golden rectangles, and the add rectangle left over is vanishingly small. You could do this to create a quadtree-like structure, where almost all of the rectangles left over were golden rectangles.
This seems like the kind of situation where a randomized algorithm might be helpful. Choose points at random, reject and choose more if they're inappropriate for some reason, then find the min distance from your choices to each of the figures already included. The random point with the max of the mins would be your choice.
The number of sample points might have to increase as the complexity of the figure increases.
The random algorithm could be improved by checking points nearby good choices. Keep checking neighbors until no more improvement is possible.
Here's a simple way that uses a fixed amount of memory and time per update, regardless of how many drawing primitives you use. How much memory (and time per update) is needed can be controlled according to how high a "resolution" you need:
Divide the space up into a grid of points. We will maintain a 2D array, d[], which records the minimum distance from the grid point (x, y) to any already-drawn primitive in the entry d[x, y]. Initially, set every element in this array to infinity (or some huge number).
Whenever you draw some primitive, iterate over all grid points (x, y) calculating the minimum distance (or some conservative approximation to it) from (x, y) to the just-drawn primitive. E.g., if the primitive just drawn was a circle of radius r centered at (p, q), then this distance would be sqrt((x-p)^2 + (y-q)^2) - r. Then update d[x, y] with this new distance value if it is smaller than its current value.
The grid point at which the largest circle can be drawn without touching any already-drawn primitive is the grid point that is the farthest away from any primitive drawn so far. To find it, simply scan through d[] to find its maximum value, and note the corresponding indices (x, y). d[x, y] will be the maximum radius you could safely use for this circle.
Repeat steps 2 and 3 as necessary.
A couple of points:
For primitives that have area, you can assign 0 or a negative value to all d[x, y] corresponding to grid points inside the primitive.
For any convex primitive, you can often avoid updating most of the d[] array by scanning rows (or columns) "outward" from the just-drawn primitive's border: the distance from the current grid point to the primitive will never decrease, so if this distance becomes larger than the previous maximum value in d[] then we know that we can stop scanning this row, because no further distance value that we would compute on it could possibly be less than an existing distance on it.
I am reading a bit about the means shift clustering algorithm (http://en.wikipedia.org/wiki/Mean_shift) and this is what i got so far. For each point in your data set : select all points within a certain distance of it (including the original point), calculate the mean for all these points, repeat until these means stabilize.
What I'm confused about is how does one go from here in deciding what the final clusters are , and on what conditions do these means merge. Also, does the distance used to select the points fluctuate through the iterations or does it remain constant?
Thanks in advance
The mean shift cluster finding is a simple iterative process which is actually guaranteed to converge. The iteration starts from a starting point x, and the iteration steps are (note that x may have several components, as the algorithm will work in higher dimensions, as well):
calculate the weighted mean position x' of all points around x - maybe the simplest form is to calculate the average of positions of all points within d distance from x, but the gaussian function is also commonly used and mathematically beneficial.
set x <- x'
repeat until the difference between x and x' is very small
This can be used in cluster analysis by starting with different values of x. The final values will end up at different cluster centers. The number of clusters cannot be known (other than it is <= number of points).
The upper level algorithm is:
go through a selection of starting values
for each value, calculate the convergence value as shown above
if the value is not already in the list of convergence values, add it to the list (allow some reasonable tolerance for numerical imprecision)
And then you have the list of clusters. The only difficult thing is finding a reasonable selection of starting values. It is easy with one or two dimensions, but with higher dimensionalities exhaustive searches are not quite possible.
All starting points, which end up into the same mode (point of convergence) belong to the same cluster.
It may be of interest that if you are doing this on a 2D image, it should be sufficient to calculate the gradient (i.e. the first iteration) for each pixel. This is a fast operation with common convolution techniques, and then it is relatively easy to group the pixels into clusters.