Numeric Values in C4.5 algorithm

Numeric Values in C4.5 algorithm - algorithm

Threshold value Z:
–The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2, …, vm}. –Any threshold value lying between viand vi+1will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.
It is usual to choose the midpoint of each interval: (vi+vi+1)/2 as the representative threshold. –C4.5 chooses as the threshold a smaller value vifor every interval {vi, vi+1}, rather than the midpoint itself
I just want to know if get this right.
Lets say I have:
{65, 70, 75, 78, 80, 85, 90, 95, 96}.
I must do m-1 calculations to find the optimal value so
{65, 70, 75, 78, 80, 85, 90, 95}.
For each split (ex. 65 and >= 65 , <70 and >=70 and so on). I must calculate
the Gain ratio, and choose the split that gives me the higher gain. Am I right?

Related

finding thermocline using depth and temperature

temp = {23,23,23,22,20,20,19,12,11,10,10 };
depth= {0,1,2,3, 8, 9, 10, 12, 18, 23, 29 };
I have two arrays as shown i need to find thermocline using the following statement
it is easily see that the slope of the curve (i.e. dT/dh) is a maximum at the (very!) obvious thermocline.
Furthermore, because the curvature of the curve changes at the thermocline, a point of inflection, then, by definition d2T/dx2 = 0.
Another way to look at it is that to maximise the slope, i.e. the 1st derivative,
then the 2nd derivative must be equal to zero.
Please help!

Efficient algorithm to fit a linear line along the upper boundary of data only

I'm currently trying to fit a linear line through a spread of scattered data in MATLAB. Now this is easy enough using the polyfit function where I can easily obtain my y= mx + c equation. However, I need to now fit a line along the upper boundary of my data, i.e., the top few data points. I know this description is vague, so lets assume that my scattered data will be in a shape of a cone, with its apex on the y-axis, and it spreads outwards and upwards in the +x and +y direction. I need to fit a best fit line on the 'upper edge of the cone' if you will.
I've developed an algorithm but it's extremely slow. It involves first fitting a line of best fit through ALL data, deleting all data points below this line of best fit, and iterating through until only 5% of the initial data points are left. The final best fit line will then reside close to the top edge of the cone. For 250 data points, this takes about 5s and with me dealing with more than a million data points, this algorithm is simply too inefficient.
I guess my question is: is there an algorithm to more efficiently achieve what I need? Or is there a way to sharpen up my code to eliminate unnecessary complexity?
Here is my code in MATLAB:
(As an example)
a = [4, 5, 1, 8, 1.6, 3, 8, 9.2]; %To be used as x-axis points
b = [45, 53, 12, 76, 25, 67, 75, 98]; %To be used as y-axis points
while prod(size(a)) > (0.05*prod(size(a))) %Iterative line fitting occurs until there are less than 5% of the data points left
lobf = polyfit(a,b,1); %Line of Best Fit for current data points
alen = length(a);
for aindex = alen:-1:1 %For loop to delete all points below line of best fit
ValLoBF = lobf(1)*a(aindex) + lobf(2)
if ValLoBF > b(aindex) %if LoBF is above current point...
a(aindex) = []; %delete x coordinate...
b(aindex) = []; %and delete its corresponding y coordinate
end
end
end

Well first of all your example code seems to be running indefinitely ;)
Some optimizations for your code:
a = [4, 5, 1, 8, 1.6, 3, 8, 9.2]; %To be used as x-axis points
b = [45, 53, 12, 76, 25, 67, 75, 98]; %To be used as y-axis points
n_init_a = length(a);
while length(a) > 0.05*n_init_a %Iterative line fitting occurs until there are less than 5% of the data points left
lobf = polyfit(a,b,1); % Line of Best Fit for current data points
% Delete data points below line using logical indexing
% First create values of the polyfit points using element-wise vector multiplication
temp = lobf(1)*a + lobf(2); % Containing all polyfit values
% Using logical indexing to discard all points below
a(b<temp)=[]; % First destroy a
b(b<temp)=[]; % Then b, very important!
end
Also you should try profiling your code by typing in the command window
profile viewer
and check what takes most time calculating your results. I suspect it is polyfit but that can't be sped up much probably.

What you are looking for is not line fitting. You are trying to find the convex hull of the points.
You should check out the function convhull. Once you find the hull, you can remove all of the points that aren't close to it, and fit each part independently to avoid the fact that the data is noisy.
Alternatively, you could render the points onto some pixel grid, and then do some kind of morphological operation, like imclose, and finish with Hough transform. Check out also this answer.

How to compute the variances in Expectation Maximization with n dimensions?

I have been reviewing Expectation Maximization (EM) in research papers such as this one:
http://pdf.aminer.org/000/221/588/fuzzy_k_means_clustering_with_crisp_regions.pdf
I have some doubts that I have not figured it out. For example, what would happen if we have many dimensions for each datapoint?
For example I have the following dataset with 6 datapoints and 4 dimensions:
>D1 D2 D3 D4
5, 19, 72, 5
6, 18, 14, 1
7, 22, 29, 4
3, 22, 51, 1
2, 21, 89, 2
1, 12, 28, 1
It means that for computing the expectation step, do I need to compute 4 standard deviations (one for each dimension)?
Do I also have to compute the variance for each cluster assuming k=3 (Do not know if it is necessary based on the formula from the paper...) or just the variances for each dimensions (4 attributes)?

Usually, you use a Covariance matrix, which also includes variances.
But it really depends on your chosen model. The simplest model does not use variances at all.
A more complex model has a single variance value, the average variance over all dimensions.
Next, you can have a separate variance for each dimension independently; and last but not least a full covariance matrix. That is probably the most flexible GMM in popular use.
Depending on your implementation, there can be many more.
From R's mclust documentation:
univariate mixture
"E" = equal variance (one-dimensional)
"V" = variable variance (one-dimensional)
multivariate mixture
"EII" = spherical, equal volume
"VII" = spherical, unequal volume
"EEI" = diagonal, equal volume and shape
"VEI" = diagonal, varying volume, equal shape
"EVI" = diagonal, equal volume, varying shape
"VVI" = diagonal, varying volume and shape
"EEE" = ellipsoidal, equal volume, shape, and orientation
"EEV" = ellipsoidal, equal volume and equal shape
"VEV" = ellipsoidal, equal shape
"VVV" = ellipsoidal, varying volume, shape, and orientation
single component
"X" = univariate normal
"XII" = spherical multivariate normal
"XXI" = diagonal multivariate normal
"XXX" = elliposidal multivariate normal

Partition an arc into portions as per the Quadrant in which that portion lies

The problem seems very simple but I am not able to find an elegant solution for it.
I have an arc defined by
startAngle ( -360 <= startAngle <= 360 ),
sweepAngle ( -360 <= sweepAngle <= 360 )
and a radius (not important here).
I want to divide this arc into pairs of (startAngle1, sweepAngle1), ... such that there is a different pair for each of the four quadrants.
Eg. If startAngle = 45, sweepAngle = 90, then there shall be two pairs (45,45) and (90,45).
A brute-force way is to check for all 4^2 possibilities (each of startAngle and corresponding endAngle (calculated by sweepAngle) can be in any of the 4 quadrants).
But I think an elegant simpler solution should be there. I just can't seem to find it.
Thanks.
EDIT:
One algorithm I just thought of is:
1. Starting from startAngle, I iterate towards the sweepAngle and keep checking whether I encounter any quadrant boundary (mod(theta) = 0, 90, 180, 270, 360).
2. Update to the list of arcs accordingly.
Anything better?

I would start with 90-startAngle%90, the modulo operator gives you the part which is startAngle in it's current quadrant. 90 minus that value is the part that the startAngle has to go in this quadrant. So, that is your first sweep angle. Now you can add always 90 to your next sweep angle. You do this until your calculated current sweepAngle is larger than your input sweepAngle. Then you know that you are in the last quadrant. In pseudo-code, out prints a new pair of angles:
currentPosition=startAngle
currentSweep = 90-startAngle%90
totalAngle=0
while (currentSweep < sweepAngle)
out (currentPosition, currentSweep)
currentPosition += currentSweep
totalAngle += currentSweep
currentSweep += 90;
out (currentPosition, sweepAngle-totalAngle)
Probably you have to look into the corner cases more closely, what happens when startAngle is exactly 90, e.g. But basically this should be the algorithm with a reasonable running time (and elegance, imo).

Find frequency for non-binned, weighted data

Here is a tricky problem (or at least so I think). I need to create a histogram, but instead of having the data and it's frequency, I have repeated data (i.e. not binned) and some weight for each data.
One example:
Angle | Weight
90 .... 3/10
93 .... 2/10
180 .... 2/10
180 .... 1/10
95 .... 2/10
I want to create a histogram with bin size 10. The y-values should be the sum of weighted frequencies for angles within a range. How can I do it? Preferably Mathematica or pseudocode...

In Mathematica 9, you can do it using the WeightedData function like this:
Histogram[WeightedData[{90, 93, 180, 180, 95}, {3/10, 2/10, 2/10, 1/10, 2/10}], {10}]
You should then get a graphic like this one:

Since the expected output is not forthcoming I shall adopt Verbeia's interpretation. You might use something like this:
dat = {{90, 3/10}, {93, 1/5}, {180, 1/5}, {180, 1/10}, {95, 1/5}};
bars =
Reap[
Sow[#2, Floor[#, 10]] & ### dat,
_,
{#, Tr##2} &
][[2]]
Graphics[
Rectangle[{#, 0}, {# + 10, #2}] & ### bars,
AspectRatio -> 1/GoldenRatio,
Axes -> True,
AxesOrigin -> {Min#bars[[All, 1]], 0}
]

I did something similar for a different kind of question recently (weighting by balance sheet size).
Assuming your data is in an N * 2 matrix list, I would do something like:
{numbers,weights} = {data[[All,1]], data[[All,2]]*10};
weightednumbers = Flatten# MapThread[
Table[#1, {#2}] &, {numbers, Ceiling[weights]}];
And then use Histogram to draw the histogram on this transformed data.
There might be other ways but this works.
An important point is to make sure the weights are integers, so the Table as the correct iterator. This might require defining weights as data[[All,2]]*Min[data[[All,2]].

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Numeric Values in C4.5 algorithm - algorithm

Related

finding thermocline using depth and temperature

Efficient algorithm to fit a linear line along the upper boundary of data only

How to compute the variances in Expectation Maximization with n dimensions?

Partition an arc into portions as per the Quadrant in which that portion lies

Find frequency for non-binned, weighted data

Categories

Resources