Abstract problem. Imagine the world is a cube, made of multiple cubical cells along all the dimensions of the cube.
Now, imagine you are able to rent certain volumes for certain periods of time, for example: you rent a 3x3x3 volume with coordinates [1, 1, 1] to [3, 3, 3] for year 2012. Then you rent a 2x2x2 volume with coordinates [4, 1, 1] to [5, 2, 2] for year 2012.
Now, imagine you are able to let out volumes that you have rented, in periods for which you have acquired them. For example, having rented the volumes as defined above, you let out a 5x2x1 cell volume with coordinates [1, 1, 1] to [5, 2, 1] for Q1'2012. Then youlet out cell [5, 2, 2] for the whole year 2012.
You can rent the same volumes in multiple "rental contracts", and let them out, also in multiple "contracts".
The question is - what data structures and algorithms can be used to answer questions like:
When can I let out a certain cell?
What cells can I let out in a certain period?
Can I let out cells of certain coordinates, not including all the dimensions (e.g.: someone wants to rent any cells that have coordinate X between 2 and 4 for year 2012)?
A brute force approach (try every combination to check) is out of question. The data set I need this to work is 5-dimensional (with more dimensions potentially coming soon), and the dimensions are 100-200 items long on average.
If you treat time as just another dimension then what you describe looks like the sort of queries you might expect to want to pose about any collection of objects in n-dimensional space.
That suggests to me something like http://en.wikipedia.org/wiki/K-d_tree or possibly some n-dimensional version of http://en.wikipedia.org/wiki/Octree. The catch, of course, is that these datastructures run out of steam as the number of dimensions increase.
You rule out the brute force approach of checking every cell. Are you also forced to rule out the brute force approach of checking each query against each known object in the n-dimensional space? Since everything seems to be an axis-aligned n-dimensional rectangle, checking for the intersection of a query with an object may not be hard - and this may be what you will get anyway if you attempt to duck the problem by throwing it a database query package or some very high level language - a database full table scan.
As mcdowella points out, Octree and k-d trees lose efficiency as the number of dimensions increased beyond about 4 or 5. Since you haven't said what the dimensions are, I'm going to assume they are properties of the objects you are talking about. Just put them in an RDBMS and use indexes on these fields. A good implementation can have good performance doing a query against multiply-indexed items.
If your dimensions have binary values (or small enums) then something else will likely be better.
Related
Given an array such as
[1, 4, 6, 1, 10, 3, 24, 1]
And I wanted to implement a mutation rate of .2 let's say. Would I:
always mutate 20% of my array entries, or
mutate 0-20% of the entries?
iterate over array and mutate each 20% of the time
I am unclear from literature how this is handled - or if there is even an agreed upon standard.
Note - I am a coder meddling with GA so bear with me in the lack of depth of GA knowledge.
Thanks
I was unsure about that too when I started to learn about genetic algorithms. I decided it's best to give each gene a x% chance to be mutated (completely changed). In your case I would iterate over the array and whenever Math.random() is smaller than 0.2 I would set the current number to a new random one.
If you find that you don't get enough diversity you can also add one or two completely random individuals (I like to call them 'foreigners' since they don't have any common ancestors).
I have a matrix of real numbers and I would like to find a partition of this matrix such that the both the number of parts and the variance of the numbers in each part are minimized. Intuitively, I want as few parts as possible, but I also want all the numbers within any given part to be close together.
More formally, I suppose for the latter I would find for each part the variance of the numbers in that part, and then take the average of those variances over all the parts. This would be part of the "score" for a given solution, the other part of the score would be, for instance, the total number of elements in the matrix minus the number of parts in the partition, so that fewer parts would lead to this part of the score being higher. The final score for the solution would be a weighted average of the two parts, and the best solution is the one with the highest score.
Obviously a lot of this is heuristic: I need to decide how to balance the number of parts versus the variances. But I'm stuck for even a general approach to the problem.
For instance, given the following simple matrix:
10, 11, 12, 20, 21
8, 13, 9, 22, 23
25, 23, 24, 26, 27
It would be a reasonable solution to partition into the following submatrices:
10, 11, 12 | 20, 21
8, 13, 9 | 22, 23
--------------+----------
25, 23, 24 | 26, 27
Partitioning is only allowed by slicing vertically and horizontally.
Note that I don't need the optimal solution, I just need an approach to get a "good" solution. Also, these matrices are several hundred by several hundred, so brute forcing it is probably not a reasonable solution, unless someone can propose a good way to pare down the search space.
I think you'd be better off by starting with a simpler problem. Let's call this
Problem A: given a fixed number of vertical and/or horizontal partitions, where should they go to minimize the sum of variances (or perhaps some other measure of variation, such as the sum of ranges within each block).
I'd suggest using a dynamic programming formulation for problem A.
Once you have that under control, then you can deal with
Problem B: find the best trade-off between variation and the number of vertical and horizontal partitions.
Obviously, you can reduce the variance to 0 by putting each element into its own block. In general, problem B requires you to solve problem A for each choice of vertical and horizontal partition counts that is considered.
To use a dynamic programming approach for problem B, you would have to formulate an objective function that encapsulates the trade-off you seek. I'm not sure how feasible this is, so I'd suggest looking for different approaches.
As it stands, problem B is a 2D problem. You might find some success looking at 2D clustering algorithms. An alternative might be possible if it can be reformulated as a 1D problem: trading off variation with the number of blocks (instead of the number of vertical and horizontal partition count). Then you could use something like the Jenks natural breaks classification method to decide where to draw the line(s).
Anyway, this answer clearly doesn't give you a working algorithm. But I hope that it does at least provide an approach (which is all you asked for :)).
First of all, I'm a ruby newbie, so I ask that you have patience with me. :)
Second of all, before you read the request and think I'm trying to get easy answers, trust me, I've spent the last 7 days searching for them online, but haven't found any that answered my very specific question. And third, sorry about the long description, but the help I need is to be pointed in the right direction.
I had an idea for a small class project about genetic drift. In population genetics, a probability matrix is used to give the probability that the frequency of an allele will change from i to j in generation t1 to generation t2.
So, say I start with one copy of allele B in t1 and want to know the probability of it going to three copies in t2. The value (as given by the binomial distribution, to which I have already wrote a small code, which works nicely) then would go in the cell corresponding to column 1, row 3 (perhaps this can clarify things better: https://docs.google.com/viewer?url=http%3A%2F%2Fsamples.jbpub.com%2F9780763757373%2F57373_CH04_FINAL.pdf).
What I don't know how to do and would like to get information on is:
how do I make a square matrix where the number of rows/columns is determined by the user (say someone wants to get the probability matrix for a population of 4, which has 8 allele copies, but someone else wants to get the probability matrix for a population of 100, which has 200 allele copies?)
how do I apply the binomial distribution equation to the values of each one of the different column/row combinations (i.e. in the cell corresponding to column 1, row 3, the value would be determined by the binomial equation with variables 1 and 3; and in the cell corresponding to column 4, row 7, the value would be determined by the binomial equation with variables 4 and 7). The number of different combinations of variables (like 1 and 1, 1 and 2, 1 and 3, etc) is determined by the number of columns/rows set by the user.
I'm not asking anyone to give me the code or do my work for me, what I'm asking is for you, seasoned programmers, to point me in the direction of the correct answers, since I've so miserably failed in finding this direction. Should I be looking into arrays instead of matrices? Should I be looking into specific iterators? Which? Does anyone have more specific material I could look into, or could give me tips based on experience with creating matrices? I really want to learn ruby, and learn how to do this, not just get it done.
For generating a matrix, you might wish to take a look at the Matrix class: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/matrix/rdoc/Matrix.html which has a method Matrix.build(row_size, column_size) looking like a good fit to your problem. It even takes a block that you can use to generate values:
require 'matrix'
Matrix.build( 5, 5 ) do |row, col|
binomial_function( row, col )
end
Obviously you will need to write the binomial function too - seems like you may have done that already?
How to make the rows/columns a user choice, depends on how you want end users to run your code. You should probably make that clear in another question, there some differences in approach between a web site and a command line script.
I have m points which I wish to uniformly distribute in n-dimensional space. By "uniformly" I mean that the all shortest-distance-pairs have similar values.
In other words, I would like the points to fill the space as evenly as possible.
Please, does anyone know how to achieve this? Does this problem have a name?
Edit:
For example, when I have 4 points and 2D plane then the coordinates should be [0, 1], [1, 0], [0, -1], [-1, 0]. Just a square. For 3D it's a cube. But I'm not sure what to do if there is different point count than 2^n.
Another way of thinking about it is to consider the points to be charged particles which repel each other. But it's very slow to run such simulation...
I believe you might be interested in low discrepancy sequences. These are used as a deterministic analog to the uniform distribution described in n.m.'s comment. They're often used in so-called "quasi-Monte Carlo" algorithms, where instead of sampling randomly one uses some kind of grid of points distributed more or less evenly over the domain.
Such sequences of points do not necessarily satisfy the condition you gave that "all shortest-distance-pairs have similar values," but I interpreted this more as an attempt at description rather than a hard requirement of the problem. If it's really important then this likely does not solve your problem.
I think you probably want to look into Sphere Packing.
here's another idea (it's not perfect, but i don't think anything here is, and you may need to choose based on details of your particular case): use binary space partitioning (more info here).
the general idea is that you take your n-dimensional space and split it into two using a (n-1)-dimensional surface. then you split those two news spaces, and so on. if you choose your surfaces carefully (so that they divide into approximately equal volumes and avoid funny shapes, for some some definition of funny) then you can see that this will be an approximation to what you're asking.
the main advantage of this approach is that it's typically very fast (it's used in video games and spatial simulations). it's not going to be as fast (or as uniform) as low discrepancy sequences (which sound really cool) but i imagine it would work inside arbitrary convex hulls.
I'm trying to separate a greylevel image based on pixel-value: suppose pixels from 0 to 60 in one bin, 60-120 in another, 120-180 ... and so on til 255. The ranges are roughly equispaced in this case.
However by using K-means clustering will it be possible to get more realistic measures of what my pixel value ranges should be? Trying to obtain similar pixels together and not waste bins where there is lower concentration of pixels present.
EDITS (to include obtained results):
k-means with no of cluster = 5
Of course K-Means can be used for color quantization. It's very handy for that.
Let's see an example in Mathematica:
We start with a greyscale (150x150) image:
Let's see how many grey levels are there when representing the image in 8 bits:
ac = ImageData[ImageTake[i, All, All], "Byte"];
First#Dimensions#Tally#Flatten#ac
-> 234
Ok. Let's reduce those 234 levels. Our first try will be to let the algorithm alone to determine how many clusters are there with the default configuration:
ic = ClusteringComponents[Image#ac];
First#Dimensions#Tally#Flatten#ic
-> 3
It selects 3 clusters, and the corresponding image is:
Now, if that is ok, or you need more clusters, is up to you.
Let's suppose you decide that a more fine-grained color separation is needed. Let's ask for 6 clusters instead of 3:
ic2 = ClusteringComponents[Image#ac, 6];
Image#ic2 // ImageAdjust
Result:
and here are the pixel ranges used in each bin:
Table[{Min##, Max##} &#(Take[orig, {#[[1]]}, {#[[2]]}] & /#
Position[clus, n]), {n, 1, 6}]
-> {{0, 11}, {12, 30}, {31, 52}, {53, 85}, {86, 134}, {135, 241}}
and the number of pixels in each bin:
Table[Count[Flatten#clus, i], {i, 6}]
-> {8906, 4400, 4261, 2850, 1363, 720}
So, the answer is YES, and it is straightforward.
Edit
Perhaps this will help you understand what you are doing wrong in your new example.
If I clusterize your color image, and use the cluster number to represent brightness, I get:
That's because the clusters are not being numbered in an ascending brightness order.
But if I calculate the mean brightness value for each cluster, and use it to represent the cluster value, I get:
In my previous example, that was not needed, but that was just luck :D (i.e. clusters were found in ascending brightness order)
k-means could be applied to your problem. If it were me, I would first try a basic approach borrowed from decision trees (although "simpler" is dependent upon your precise clustering algorithm!)
Assume one bin exists, begin stuffing the pixel intensities into the bin. When the bin is "full enough", compute the mean and standard deviation of the bin (or node). If the standard deviation is greater than some threshold, split the node in half. Continue this process until all intensities are done, and you will have a more efficient histogram.
This method can be improved with additional details of course:
You might consider using kurtosis as a splitting criteria.
Skewness might be used to determine where the split occurs
You might cross all the way into decision tree land and borrow the Jini index to guide splitting (some split techniques rely on more "exotic" statistics, like the t-test).
Lastly, you might perform a final consolidation pass to collapse any sparsely populated nodes.
Of course, if you've applied all of the above "improvements", then you've basically implemented one variation of a k-means clustering algorithm ;-)
Note: I disagree with the comment above - the problem you describe does not appear closely related histogram equalization.