implementation of fayad and irani discretization - entropy

IS there a java code to implement Fayad and Irani's Entropy based discretization? I have tried reading the file and then calculating the Entropy and Info gain. How to get the bounday points?
I have to implement Fayad and Irani’s discretization algorithm, which is based on entropy and information gain ([Fayad and Irani, 1993]

Yes; here's an implementation:
https://svn.kenai.com/svn/grex~subversion/grex/src/weka/filters/supervised/attribute/Discretize.java

Related

Which pseudo random number generator algorithm does java.util.Random in Java 8 use?

I'm trying to find out exactly how java.util.Random in java 8 generates its random numbers, more specifically the algorithm behind it. All I keep seeing is how to generate random numbers in java 8 and not the driving forces behind it.
If you could point me to any documentation regarding the PRNG that java.util.Random uses that would be perfect.
Also in case its been done already, is there a way of replicating the output of java.util.Random in python?
A Quick test using a seed of 5 and int range of 0 to 100 gives different results to pythons random module
Quoting Java docs:
An instance of this class is used to generate a stream of pseudorandom numbers. The class uses a 48-bit seed, which is modified using a linear congruential formula. (See Donald Knuth, The Art of Computer Programming, Volume 2, Section 3.2.1.)
So it seems that a Linear congruential generator with a 48bit seed is used.
I do not have access to the mentioned book, but I would guess it gives more detailled information.

Seeding F# random generator to same state as Matlab

In trying to port over some Matlab code to F#, I'm trying to make sure the translations are accurate. As of now, there are cases where I'm not completely sure whether there are mistakes. Since a lot of the code is statistical in nature, it would be convenient to be able to seed the F# generators to the same state as Matlab's. It would also help with triangulating the exact equations that are wrong. Wanted to ask before I started dumping Matlab generated random numbers to csv files and solving this issue in a manual way.
This is not a definitive answer as probably implementing your own random number generator in matlab and F# should yield the most reliable results. You are also bound to bump into issues of thread safety in .NET, and the shapes of matrices in matlab. For example
In matlab:
rng(200,'twister')
rand(1,5)
ans =
0.9476 0.2265 0.5944 0.4283 0.7641
In F#:
open MathNet.Numerics.Random
let random1b = MersenneTwister(200)
random1b.NextDoubles(5)
val it : float [] = [|0.9476322592; 0.4941436297; 0.2265474238;
0.1485590497; 0.5944201448|]
The 1st, 3rd, and 5th random numbers do match.
Now it's possible you can replicate this somehow by playing around with different versions and/or F# and matlab array dimensions.
The MathNet Random Docs.

Can we convert every Algorithm in fixed point?

I have developed a Algorithm in MATLAB using floating point variable. In my algortihm I am doing eigen value decomposition ,rotation, transformation of matrices, inverse of matrices, division , addition and multipications of matrices several times.(So it is kind of processing of the the signal). I tried to convert it into the fixed point but I am unable to do because my variables and matrices changes it values every time. So for me it is very difficult to handle the overflow problem as I can not make any routine to handle the overflow. Can any one tell me how to handle this problem or is it not possible to convert the algorithm into fixed point.
I need a concerte reason to justify that I can not convert my Algorithm into fixed point(As it is my master thesis!)
P.S:- This algorithm is developed for the controller of the Analog to digital converter, which utilizes the Statistics of the signal and gives the effective decision threshold. I have just written the mathemetical operations.
the answer is YES and NO. it depends on the processed data dynamic range
if you are processing numbers/signal in specified range then YES
but if the numbers/signal has very high dynamic range then NO
you should use more fixed point formats for different stage of signal processing
for example ADC gives you values in exact defined range
so you have to use fixed format such that does not loss precision and have not many unused bits
after that you apply some filter or what ever the range changes
so you need to get bound of possible number ranges per stage and use the best suited fixed point format you have at disposal
This means you need some number of fixed point formats
and also the operations between them
you can have fixed number of bits and just change the position of decimal point...
To be more specific then you need add the block diagram of your processing pipeline
with the number ranges included
and list of used operations
matrix operations and integrals/sums are tricky because they can change the dynamic range considerably
The real question always stays if such implementation is faster then floating point ...
because sometimes the transition between different fixed point stages can be slower then direct floating point implementation ...

Ising 2D Optimization

I have implemented a MC-Simulation of the 2D Ising model in C99.
Compiling with gcc 4.8.2 on Scientific Linux 6.5.
When I scale up the grid the simulation time increases, as expected.
The implementation simply uses the Metropolis–Hastings algorithm.
I tried to find out a way to speed up the algorithm, but I haven't any good idea ?
Are there some tricks to do so ?
As jimifiki wrote, try to do a profiling session.
In order to improve on the algorithmic side only, you could try the following:
Lookup Table:
When calculating the energy difference for the Metropolis criteria you need to evaluate the exponential exp[-K / T * dE ] where K is your scaling constant (in units of Boltzmann's constant) and dE the energy-difference between the original state and the one after a spin-flip.
Calculating exponentials is expensive
So you simply build a table beforehand where to look up the possible values for the dE. There will be (four choose one plus four choose two plus four choose three plus four choose four) possible combinations for a nearest-neightbour interaction, exploit the problem's symmetry and you get five values fordE: 8, 4, 0, -4, -8. Instead of using the exp-function, use the precalculated table.
Parallelization:
As mentioned before, it is possible to parallelize the algorithm. To preserve the physical correctness, you have to use a so-called checkerboard concept. Consider the two-dimensional grid as a checkerboard and compute only the white cells parallel at once, then the black ones. That should be clear, considering the nearest-neightbour interaction which introduces dependencies of the values.
Use GPGPU:
You can also implement the simulation on a GPGPU, e.g. using CUDA, if you're already working on C99.
Some tips:
- Don't forget to align C99-structs properly.
- Use linear Arrays, not that nested ones. Aligned memory is normally faster to access, if done properly.
- Try to let the compiler do loop-unrolling, etc. (gcc special options, not default on O2)
Some more information:
If you look for an efficient method to calculate the critical point of the system, the method of choice would be finite-size scaling where you simulate at different system-sizes and different temperature, then calculate a value which is system-size independet at the critical point, therefore an intersection point of the corresponding curves (please see the theory to get a detailed explaination)
I hope I was helpful.
Cheers...
It's normal that your simulation times scale at least with the square of the size. Isn't it?
Here some subjestions:
If you are concerned with thermalization issues, try to use parallel tempering. It can be of help.
The Metropolis-Hastings algorithm can be made parallel. You could try to do it.
Check you are not pessimizing the code.
Are your spin arrays of ints? You could put many spins on the same int. It's a lot of work.
Moreover, remember what Donald taught us:
premature optimisation is the root of all evil
Before optimising you should first understand where your program is slow. This is called profiling.

I'm looking for a good psuedo random number generator, that takes two inputs instead of one

I'm looking for a determenistic psuedo random generator that takes two inputs and always returns the same output. I'm looking for things like uniform distribution, unpredictable as possible, and doesn't repeat for a long long time. Ideally the function doesn't rely on previous values. The reason that is a problem is I'm generating terrain data for an extremely large procedurely generated world and can't afford to store previous values.
Any help is appreciated.
i think what you're looking for is perlin noise - it's a way of generating "random" values in 2d (typically) that look like terrain / clouds / etc.
note that this doesn't have much to do with cryptography etc, but a "real" random number source is probably not what you want for synthetic terrain (it looks too noisy/spikey).
there's a good article on perlin noise here.
the implementation of perlin noise does use a source of random numbers, but typically you can use whatever is present on your system (starting with a known seed if you want to reproduce it later).
Is the problem deciding on a PRNG algorithm to use or an algorithm that accepts 2 inputs?
If it's the former, why not use the built in random class - such as Random class in .NET - since it strives for uniform distribution and long cycles. Also, given the same seed it will generate the same sequence of numbers.
If it's the latter, what you can do is map the 2 inputs to a single ouput and use that as a seed to your random algorithm. You can define a simple hash function that takes a string and calculates an integer from it:
s[0] + s[1]^1 + s[2]^2 + ... s[n]^n = seed
Combination of two inputs (by concatenating each other, provided the inputs are binary integers) into one seed will do, for a PRNG, such as Mersenne Twister.

Resources