Negative value of Information Gain

Negative value of Information Gain - algorithm

I'm implementing C4.5 and in my calculations im getting (for some examples) negative values for information gain. I read Why am I getting a negative information gain, but my issiue seeams to be diffrent. I putt my calculation to excel and i get the same results as below:
My calculations
What am i doing wrong?
I tried calculate it again, and also i get negative value as is on image below:
Newest calculations with data set
80 is split value, so i get 11 <=80 and 3objects > 80

Are you multiplying your result for entropy by -1?
$$ H(X) = -\sum_{i=1}^n {\mathrm{P}(x_i) \log_b \mathrm{P}(x_i)} $$
Ugh... having trouble with mathjax, go here for definition

Related

MSE giving negative results in High-Level Synthesis

I am trying to calculate the Mean Squared Error in Vitis HLS. I am using hls::pow(...,2) and divide by n, but all I receive is a negative value for example -0.004. This does not make sense to me. Could anyone point the problem out or have a proper explanation for this??
Besides calculating the mean squared error using hls::pow does not give the same results as (a - b) * (a - b) and for information I am using ap_fixed<> types and not normal float or double precision
Thanks in advance!

It sounds like an overflow and/or underflow issue, meaning that the values reach the sign bit and are interpreted as negative while just be very large.
Have you tried tuning the representation precision or the different saturation/rounding options for the fixed point class? This tuning will depend on the data you're processing.
For example, if you handle data that you know will range between -128.5 and 1023.4, you might need very few fractional bits, say 3 or 4, leaving the rest for the integer part (which might roughly be log2((1023+128)^2)).
Alternatively, if n is very large, you can try a moving average and calculate the mean in small "chunks" of length m < n.
p.s. Getting the absolute value of a - b and store it into an ap_ufixed before the multiplication can already give you one extra bit, but adds an instruction/operation/logic to the algorithm (which might not be a problem if the design is pipelined, but require space if the size of ap_ufixed is very large).

Generate random number in interval in PostScript

I am struggling to find a way to generate a random number within a given interval in PostScript.
Basically PostScript has three functions to help you generate (pseudo-)random numbers. Those are rand, srand and rrand.
The later two are for passing a seed to the number generator to be able to reproduce specific results. At least that´s what I understood they are for. Anyway they don´t seem suitable for my case.
So rand seems to be the only function I can use to generate a random number, but...
rand returns a random integer in the range 0 to 231 − 1 (From the PostScript Language Reference, page 637 (651 in the PDF))
This is far beyond the the interval I´m looking for. I am more interested in values up to small thousands, maybe 10.000 or something like that and small float values, up to 100, all with the lower limit of 0.
I thought I could just narrow my numbers down by simple divisions and extracting the root but that tends to give me unusable small values in quite a lot cases. I am wondering if there are robust ways to either shrink a large number down to what I need or, I´d prefer that, only generate numbers in the desired interval.
Besides: while-loops are not possible in PostScript, otherwise I´d have written a function to generate numbers until they fit in my interval.
Any hints on what to look for breaking numbers down into my interval?

mod is often good enough and it's fast. But you may get a more uniform distribution by using floating-point ops.
rand 16#7fffffff div 100 mul cvi
This is because mod discards the upper bits of the input. And the PRNG is usually trying to randomize over all the bits. By scaling down then up, they all contribute something in the way of rounding effects.

Just use the modulo operator to get it down to the size you want:
GS>rand 100 mod stack
7

Random number generation requires too many iterations

I am running simulations in Anylogic and I'm trying to calibrate the following distribution:
Jump = normal(coef1, coef2, -1, 1);
However, I keep getting the following message as soon as I start the calibration (experimentation):
Random number generation requires too many iterations (> 10000)
I tried to replace -1 and 1 by other values and keep getting the same thing.
I also tried to change the bounds of coef1 and coef2 and put things like [0,1], but I still get the same error.
I don't get it.
Any ideas?

The four parameter normal method is not deprecated and is not a "calibration where coef1 and coef2 are the coefficicents to be solved for". Where did you get that understanding from? Or are you saying that you're using your AnyLogic Experiment (possibly a multi-run or optimisation experiment) to 'calibrate' that distribution, in which case you need to explain what you mean by 'calibrate' here---what is your desired outcome?
If you look in the API reference (AnyLogic classes and functions --> API Reference --> com.xj.anylogic.engine --> Utilities), you'll see that it's a method to use a truncated normal distribution.
public double normal(double min,
double max,
double shift,
double stretch)
The first 2 parameters are the min and max (where it will sample repeatedly and ignore values outside the [min,max] range); the second two are effectively the mean and standard deviation. So you will get the error you mentioned if min or max means it will sample too many times to get a value in range.
API reference details below:
Generates a sample of truncated Normal distribution. Distribution
normal(1, 0) is stretched by stretch coefficient, then shifted to the
right by shift, after that it is truncated to fit in [min, max]
interval. Truncation is performed by discarding every sample outside
this interval and taking subsequent try. For more details see
normal(double, double)
Parameters:
min - the minimum value that this function will return. The distribution is truncated to return values above this. If the sample
(stretched and shifted) is below this value it will be discarded and
another sample will be drawn. Use -infinity for "No limit".
max - the maximum value that this function will return. The distribution is truncated to return values below this. If the sample
(stretched and shifted) is bigger than this value it will be discarded
and another sample will be drawn. Use +infinity for "No limit".
shift - the shift parameter that indicates how much the (stretched) distribution will shifted to the right = mean value
stretch - the stretch parameter that indicates how much the distribution will be stretched = standard deviation Returns:
the generated sample

According to AnyLogic's documentation, there is no version of normal which takes 4 arguments. Also note that if you specify a mean and standard deviation, the order is unusual (to probabilists/statisticians) by putting the standard deviation before the mean.

Correlation of an image

I tried to using this post How to find Correlation of an image
to find Correlation for my image but i have questions.when i use this: cov(x,y) / ( sqrt(D(x)* D(y) )) my result is [1.0025 -0.0358 ;-0.0358 0.9975](for 5000 pixel). -0.0358 is Correlation for my image?what is 0.9975?i run my code tow times. the second result is [0.9830 0.0243;0.0243 1.0173].-0.0358 or 0.0243 which one is Correlation ? I know because of using randperm In each run different number is create but Which one is a best?negative number or positive number?

Functions like corrcoef return values like [1.0025 -0.0358; -0.0358 0.9975] where the coefficient is -0.0358, and the other values relate to the certainty of the coefficient.
Here's what you should expect from the covariance matrix cov:
http://www.mathworks.com/help/matlab/ref/cov.html
Here's what you should expect from the correlation coefficients corrcoef:
http://www.mathworks.com/help/matlab/ref/corrcoef.html
What I suspect is going on is there's some variable whose value is affected by running. A useful debugging practice (when going through your code with a fine-toothed comb) is having your code output to the screen some of the values as they are being generated. In Matlab, this is as simple as removing a few ; at the end of a few lines.
Hope this helps!

MATLAB script to generate reports of rounding errors in algorithms

I am interested in use or created an script to get error rounding reports in algorithms.
I hope the script or something similar is already done...
I think this would be usefull for digital electronic system design because sometimes it´s neccesary to study how would be the accuracy error depending of the number of decimal places that are considered in the design.
This script would work with 3 elements, the algorithm code, the input, and the output.
This script would show the error line by line of the algorithm code.
It would modify the algorith code with some command like roundn and compare the error of the output.
I would define the error as
Errorrounding = Output(without rounding) - Output round
For instance I have the next algorithm
calculation1 = input*constan1 + constan2 %line 1 of the algorithm
output = exp(calculation1) %line 2 of the algorithm
Where 'input' is the input of n elements vector and 'output' is the output and 'constan1' and 'constan2' are constants.
n is the number of elements of the input vector
So, I would put my algorithm in the script and it generated in a automatic way the next algorithm:
input_round = roundn(input,-1*mdec)
calculation1 = input*constant1+constant2*ones(1,n)
calculation1_round = roundn(calculation1,-1*mdec)
output=exp(calculation1_round)
output_round= roundn(output,-1*mdec)
where mdec is the number of decimal places to consider.
Finally the script give the next message
The rounding error at line 1 is #Errorrounding_calculation1
Where '#Errorrounding' would be the result of the next operation Errorrounding_calculation1 = calculation1 - calculation1_round
The rounding error at line 2 is #Errorrounding_output
Where 'Errorrounding_output' would be the result of the next operation Errorrounding_output = output - output_round
Does anyone know if there is something similar already done, or Matlab provides a solution to deal with some issues related?
Thank you.

First point: I suggest reading What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg. It should illuminate a lot of issues regarding floating-point computations that will help you understand more of the intricacies of the problem you are considering.
Second point: I think the problem you are considering is a lot more complicated than you realize. You are interested in the error introduced into a calculation due to the reduced precision from rounding. What you don't realize is that these errors will propagate through your computations. Consider your example:
output = input*C1 + C2
If each of the three operands is a double-precision floating-point number, they will each have some round-off error in their precision. A bound on this round-off error can be found using the function EPS, which tells you the distance from one double-precision number to the next largest one. For example, a bound on the relative error of the representation of input will be 0.5*eps(input), or halfway between it and the next largest double-precision number. We can therefore estimate some errors bounds on the three operands as follows:
err_input = 0.5.*eps(input); %# Maximum round-off error for input
err_C1 = 0.5.*eps(C1); %# Maximum round-off error for C1
err_C2 = 0.5.*eps(C2); %# Maximum round-off error for C2
Note that these errors could be positive or negative, since the true number may have been rounded up or down to represent it as a double-precision value. Now, notice what happens when we estimate the true value of the operands before they were rounded-off by adding these errors to them, then perform the calculation for output:
output = (input+err_input)*(C1+err_C1) + C2+err_C2
%# ...and after reordering terms
output = input*C1 + C2 + err_input*C1 + err_C1*input + err_input*err_C1 + err_C2
%# ^-----------^ ^-----------------------------------------------------^
%# | |
%# rounded computation difference
You can see from this that the precision round-off of the three operands before performing the calculation could change the output we get by as much as difference. In addition, there will be another source of round-off error when the value output is rounded off to represent it as a double-precision value.
So, you can see how it's quite a bit more complicated than you thought to adequately estimate the errors introduced by precision round-off.

This is more of an extended comment than an answer:
I'm voting to close this on the grounds that it isn't a well-formed question. It sort of expresses a hope or wish that there exists some type of program which would be interesting or useful to you. I suggest that you revise the question to, well, to be a question.
You propose to write a Matlab program to analyse the numerical errors in other Matlab programs. I would not use Matlab for this. I'd probably use Mathematica, which offers more sophisticated structural operations on strings (such as program source text), symbolic computation, and arbitrary precision arithmetic. One of the limitations of Matlab for what you propose is that Matlab, like all other computer implementations of real arithmetic, suffers rounding errors. There are other languages which you might choose too.
What you propose is quite difficult, and would probably require a longer answer than most SOers, including this one, would be happy to contemplate writing. Happily for you, other people have written books on the subject, I suggest you start with this one by NJ Higham. You might also want to investigate matters such as interval arithmetic.
Good luck.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio