Validating fractal dimension computation in Mathematica - wolfram-mathematica

I've written an implementation of the standard box-counting algorithm for determining the fractal dimension of an image or a set in Mathematica, and I'm trying to validate it. I've generated a Sierpinski triangle matrix using the CellularAutomaton function, and computed its fractal dimension to be 1.58496 with a statistical error of about 10^-15. This matches the expected value of log(3)/log(2) = 1.58496 incredibly well.
The problem arises when I try to test my algorithm against a randomly-generated matrix. The fractal dimension in this case should be exactly 2, but I get about 1.994, with a statistical error of about 0.004. Hence, my box-counting algorithm seems to work perfectly fine for the Sierpinski triangle, but not quite so well for the random distribution. Any ideas why not?
Code below:
sierpinski512 = CellularAutomaton[90, {{1}, 0}, 512];
ArrayPlot[%]
d512 = FractalDimension[sierpinski512, {512, 256, 128, 64, 32, 16, 8, 4, 2}]
rtable = Table[Round[RandomReal[]], {i, 1, 512}, {j, 1, 1024}];
ArrayPlot[%]
drand = FractalDimension[rtable, {512, 256, 128, 64, 32, 16, 8, 4, 2}]
I can post the FractalDimension code if anybody really needs it, but I think the solution (if any) is not to do with the FractalDimension algorithm, but the rtable I'm generating above.

I have studied this problem a little empirically in consultation with a well-known physicist, and we believe that the fractal dimension of a random point process goes to 2 (in the limit, I think) as the number of points grows large. I can't provide an exact definition of "large" but it can't be less than a few thousand points. So, I think you should expect to get D < 2 unless the number of points is quite large, theoretically, large enough to tile the plane.
I would be grateful for your FractalDimension code!

Related

finding thermocline using depth and temperature

temp = {23,23,23,22,20,20,19,12,11,10,10 };
depth= {0,1,2,3, 8, 9, 10, 12, 18, 23, 29 };
I have two arrays as shown i need to find thermocline using the following statement
it is easily see that the slope of the curve (i.e. dT/dh) is a maximum at the (very!) obvious thermocline.
Furthermore, because the curvature of the curve changes at the thermocline, a point of inflection, then, by definition d2T/dx2 = 0.
Another way to look at it is that to maximise the slope, i.e. the 1st derivative,
then the 2nd derivative must be equal to zero.
Please help!

Efficient algorithm to fit a linear line along the upper boundary of data only

I'm currently trying to fit a linear line through a spread of scattered data in MATLAB. Now this is easy enough using the polyfit function where I can easily obtain my y= mx + c equation. However, I need to now fit a line along the upper boundary of my data, i.e., the top few data points. I know this description is vague, so lets assume that my scattered data will be in a shape of a cone, with its apex on the y-axis, and it spreads outwards and upwards in the +x and +y direction. I need to fit a best fit line on the 'upper edge of the cone' if you will.
I've developed an algorithm but it's extremely slow. It involves first fitting a line of best fit through ALL data, deleting all data points below this line of best fit, and iterating through until only 5% of the initial data points are left. The final best fit line will then reside close to the top edge of the cone. For 250 data points, this takes about 5s and with me dealing with more than a million data points, this algorithm is simply too inefficient.
I guess my question is: is there an algorithm to more efficiently achieve what I need? Or is there a way to sharpen up my code to eliminate unnecessary complexity?
Here is my code in MATLAB:
(As an example)
a = [4, 5, 1, 8, 1.6, 3, 8, 9.2]; %To be used as x-axis points
b = [45, 53, 12, 76, 25, 67, 75, 98]; %To be used as y-axis points
while prod(size(a)) > (0.05*prod(size(a))) %Iterative line fitting occurs until there are less than 5% of the data points left
lobf = polyfit(a,b,1); %Line of Best Fit for current data points
alen = length(a);
for aindex = alen:-1:1 %For loop to delete all points below line of best fit
ValLoBF = lobf(1)*a(aindex) + lobf(2)
if ValLoBF > b(aindex) %if LoBF is above current point...
a(aindex) = []; %delete x coordinate...
b(aindex) = []; %and delete its corresponding y coordinate
end
end
end
Well first of all your example code seems to be running indefinitely ;)
Some optimizations for your code:
a = [4, 5, 1, 8, 1.6, 3, 8, 9.2]; %To be used as x-axis points
b = [45, 53, 12, 76, 25, 67, 75, 98]; %To be used as y-axis points
n_init_a = length(a);
while length(a) > 0.05*n_init_a %Iterative line fitting occurs until there are less than 5% of the data points left
lobf = polyfit(a,b,1); % Line of Best Fit for current data points
% Delete data points below line using logical indexing
% First create values of the polyfit points using element-wise vector multiplication
temp = lobf(1)*a + lobf(2); % Containing all polyfit values
% Using logical indexing to discard all points below
a(b<temp)=[]; % First destroy a
b(b<temp)=[]; % Then b, very important!
end
Also you should try profiling your code by typing in the command window
profile viewer
and check what takes most time calculating your results. I suspect it is polyfit but that can't be sped up much probably.
What you are looking for is not line fitting. You are trying to find the convex hull of the points.
You should check out the function convhull. Once you find the hull, you can remove all of the points that aren't close to it, and fit each part independently to avoid the fact that the data is noisy.
Alternatively, you could render the points onto some pixel grid, and then do some kind of morphological operation, like imclose, and finish with Hough transform. Check out also this answer.

How to compute the variances in Expectation Maximization with n dimensions?

I have been reviewing Expectation Maximization (EM) in research papers such as this one:
http://pdf.aminer.org/000/221/588/fuzzy_k_means_clustering_with_crisp_regions.pdf
I have some doubts that I have not figured it out. For example, what would happen if we have many dimensions for each datapoint?
For example I have the following dataset with 6 datapoints and 4 dimensions:
>D1 D2 D3 D4
5, 19, 72, 5
6, 18, 14, 1
7, 22, 29, 4
3, 22, 51, 1
2, 21, 89, 2
1, 12, 28, 1
It means that for computing the expectation step, do I need to compute 4 standard deviations (one for each dimension)?
Do I also have to compute the variance for each cluster assuming k=3 (Do not know if it is necessary based on the formula from the paper...) or just the variances for each dimensions (4 attributes)?
Usually, you use a Covariance matrix, which also includes variances.
But it really depends on your chosen model. The simplest model does not use variances at all.
A more complex model has a single variance value, the average variance over all dimensions.
Next, you can have a separate variance for each dimension independently; and last but not least a full covariance matrix. That is probably the most flexible GMM in popular use.
Depending on your implementation, there can be many more.
From R's mclust documentation:
univariate mixture
"E" = equal variance (one-dimensional)
"V" = variable variance (one-dimensional)
multivariate mixture
"EII" = spherical, equal volume
"VII" = spherical, unequal volume
"EEI" = diagonal, equal volume and shape
"VEI" = diagonal, varying volume, equal shape
"EVI" = diagonal, equal volume, varying shape
"VVI" = diagonal, varying volume and shape
"EEE" = ellipsoidal, equal volume, shape, and orientation
"EEV" = ellipsoidal, equal volume and equal shape
"VEV" = ellipsoidal, equal shape
"VVV" = ellipsoidal, varying volume, shape, and orientation
single component
"X" = univariate normal
"XII" = spherical multivariate normal
"XXI" = diagonal multivariate normal
"XXX" = elliposidal multivariate normal

Find frequency for non-binned, weighted data

Here is a tricky problem (or at least so I think). I need to create a histogram, but instead of having the data and it's frequency, I have repeated data (i.e. not binned) and some weight for each data.
One example:
Angle | Weight
90 .... 3/10
93 .... 2/10
180 .... 2/10
180 .... 1/10
95 .... 2/10
I want to create a histogram with bin size 10. The y-values should be the sum of weighted frequencies for angles within a range. How can I do it? Preferably Mathematica or pseudocode...
In Mathematica 9, you can do it using the WeightedData function like this:
Histogram[WeightedData[{90, 93, 180, 180, 95}, {3/10, 2/10, 2/10, 1/10, 2/10}], {10}]
You should then get a graphic like this one:
Since the expected output is not forthcoming I shall adopt Verbeia's interpretation. You might use something like this:
dat = {{90, 3/10}, {93, 1/5}, {180, 1/5}, {180, 1/10}, {95, 1/5}};
bars =
Reap[
Sow[#2, Floor[#, 10]] & ### dat,
_,
{#, Tr##2} &
][[2]]
Graphics[
Rectangle[{#, 0}, {# + 10, #2}] & ### bars,
AspectRatio -> 1/GoldenRatio,
Axes -> True,
AxesOrigin -> {Min#bars[[All, 1]], 0}
]
I did something similar for a different kind of question recently (weighting by balance sheet size).
Assuming your data is in an N * 2 matrix list, I would do something like:
{numbers,weights} = {data[[All,1]], data[[All,2]]*10};
weightednumbers = Flatten# MapThread[
Table[#1, {#2}] &, {numbers, Ceiling[weights]}];
And then use Histogram to draw the histogram on this transformed data.
There might be other ways but this works.
An important point is to make sure the weights are integers, so the Table as the correct iterator. This might require defining weights as data[[All,2]]*Min[data[[All,2]].

An algorithm to solve a simple(?) array problem

For this problem speed is pretty crucial. I've drawn a nice image to explain the problem better. The algorithm needs to calculate if edges of a rectangle continue within the confines of the canvas, will the edge intersect another rectangle?
We know:
The size of the canvas
The size of each rectangle
The position of each rectangle
The faster the solution is the better! I'm pretty stuck on this one and don't really know where to start.
alt text http://www.freeimagehosting.net/uploads/8a457f2925.gif
Cheers
Just create the set of intervals for each of the X and the Y axis. Then for each new rectangle, see if there are intersecting intervals in the X or the Y axis. See here for one way of implementing the interval sets.
In your first example, the interval set on the horizontal axis would be { [0-8], [0-8], [9-10] }, and on the vertical: { [0-3], [4-6], [0-4] }
This is only a sketch, I abstracted many details here (e.g. usually one would ask an interval set/tree "which intervals overlap this one", instead of "intersect this one", but nothing not doable).
Edit
Please watch this related MIT lecture (it's a bit long, but absolutely worths it).
Even if you find simpler solutions (than implementing an augmented red-black tree), it's good to know the ideas behind these things.
Lines that are not parallel to each other are going to intersect at some point. Calculate the slopes of each line and then determine what lines they won't intersect with.
Start with that, and then let's see how to optimize it. I'm not sure how your data is represented and I can't see your image.
Using slopes is a simple equality check which probably means you can take advantage of sorting the data. In fact, you can probably just create a set of distinct slopes. You'll have to figure out how to represent the data such that the two slopes of the same rectangle are not counted as intersecting.
EDIT: Wait.. how can two rectangles whose edges go to infinity not intersect? Rectangles are basically two lines that are perpendicular to each other. shouldn't that mean it always intersects with another if those lines are extended to infinity?
as long as you didn't mention the language you chose to solve the problem, i will use some kind of pseudo code
the idea is that if everything is ok, then a sorted collection of rectangle edges along one axis should be a sequence of non-overlapping intervals.
number all your rectangles, assigning them individual ids
create an empty binary tree collection (btc). this collection should have a method to insert an integer node with info btc::insert(key, value)
for all rectangles, do:
foreach rect in rects do
btc.insert(rect.top, rect.id)
btc.insert(rect.bottom, rect.id)
now iterate through the btc (this will give you a sorted order)
btc_item = btc.first()
do
id = btc_item.id
btc_item = btc.next()
if(id != btc_item.id)
then report_invalid_placement(id, btc_item.id)
btc_item = btc.next()
while btc_item is valid
5,7,8 - repeat steps 2,3,4 for rect.left and rect.right coordinates
I like this question. Here is my try to get on it:
If possible:
Create a polygon from each rectangle. Treat each edge as an line of maximum length that must be clipped. Use a clipping algorithm to check weather or not a line intersects with another. For example this one: Line Clipping
But keep in mind: If you find an intersection which is at the vertex position, its a valid one.
Here's an idea. Instead of creating each rectangle with (x, y, width, height), instantiate them with (x1, y1, x2, y2), or at least have it interpret these values given the width and height.
That way, you can check which rectangles have a similar x or y value and make sure the corresponding rectangle has the same secondary value.
Example:
The rectangles you have given have the following values:
Square 1: [0, 0, 8, 3]
Square 3: [0, 4, 8, 6]
Square 4: [9, 0, 10, 4]
First, we compare Square 1 to Square 3 (no collision):
Compare the x values
[0, 8] to [0, 8] These are exactly the same, so there's no crossover.
Compare the y values
[0, 4] to [3, 6] None of these numbers are similar, so they're not a factor
Next, we compare Square 3 to Square 4 (collision):
Compare the x values
[0, 8] to [9, 10] None of these numbers are similar, so they're not a factor
Compare the y values
[4, 6] to [0, 4] The rectangles have the number 4 in common, but 0 != 6, therefore, there is a collision
By know we know that a collision will occur, so the method will end, but lets evaluate Square 1 and Square 4 for some extra clarity.
Compare the x values
[0, 8] to [9, 10] None of these numbers are similar, so they're not a factor
Compare the y values
[0, 3] to [0, 4] The rectangles have the number 0 in common, but 3 != 4, therefore, there is a collision
Let me know if you need any extra details :)
Heh, taking the overlapping intervals answer to the extreme, you simply determine all distinct intervals along the x and y axis. For each cutting line, do an upper bound search along the axis it will cut based on the interval's starting value. If you don't find an interval or the interval does not intersect the line, then it's a valid line.
The slightly tricky part is to realize that valid cutting lines will not intersect a rectangle's bounds along an axis, so you can combine overlapping intervals into a single interval. You end up with a simple sorted array (which you fill in O(n) time) and a O(log n) search for each cutting line.

Resources