Memory issue in Party package, cforest - party

I am currently building random forest.
I have made my dataset very small, just only binary variable and three numeric variables (double precision). There are around 400,000 samples.
Model_cforest <- cforest(result ~ score1 + score2 + score3
, data=trainData, controls = cforest_unbiased(ntree=100))
I really don't think it would consume too much memory, as the package randomForest can handle this easily if I disable approximation.
However, in party package, cforest function, it ate up all of my memory (16GB, at least 12GB free) and still want more. I have to do nothing but terminate the program. I believe there must be something wrong in my setting, but I can not figure it out.
Would you please help me out? Thanks

Related

Impurity or Randomness in Haskell's profiling

I'm implementing RayTracingInOneWeekend and I optimized it from 33m to 23s for image size 384x216 scene and the parameters as given in the article. However, when I profile it, the entries (the 5th column from the left in the screenshot below) changes on almost every run. How is that possible? In my program everything stays same, including even the random number generators, as generators are created as (you can see it on github):
g = mkStdGen (i * width + j)
If width and height stay same, then all g (one for each pixel) should stay same as well. However, as you can see the two screenshots have different values in the entries column.
What could be the reason behind this impurity? Or the profiler is not just able to gather all the information and the numbers are not exact (means, in reality the frequency of function calls are different from the numbers shown above; the docs however does not say anything like that).
My program builds with cabal v2-build -O2 --enable-profiling --enable-executable-profiling and I dont pass -prof -fprof-auto to ghc-options (I guess cabal takes care of that). I've also used -threaded and parallel library.
I'm on GHC 8.6.5 and Cabal 3.2.
It looks like, the profiler in multi-core mode does not run consistently — not sure if that counts as bug. I ran the program couple of times without passing -N to RTS and now every time I see the same entries count:
Not sure if that proves that my program does not have any impurity. I'm stil looking for better and more plausible response (if there is any, at all).

How can this function in Haskell be optimised

As part of an advent of code challenge, I've written the following functions in Haskell:
simulateUntilRepeat_int a b i = if (a /= b) then (simulateUntilRepeat_int a (updateCycle b) (i+1)) else i
simulateUntilRepeat a = simulateUntilRepeat_int a (updateCycle a) 1
The purpose of this is to take a list of moons and simulate their movement until they resume their original position, returning the number of cycles it took for them to get there. (the function updateCycle does one iteration of the simulation). However, when I attempt to run this it uses all available memory and then gets killed by the operating system. The question does admit that this may take a very large number of cycles.
Googling around about this problem I find the usual fix is to make some of the parameters strict, but I think I've experimented with all possible permutations of strictness on the parameters to no avail. By the looks of this function, I'd have anticipated the compiler would be able to use the tail recursion optimisation and turn it into a loop, but this seems to not be happening somehow.
A friend of mine, who is knowledgeable in haskell suggested changing the form of the function to the following:
f a b0 = length (takeWhile (/= a) (iterate updateCycle b0))
But doing this didn't fix it either, leaving me out of ideas.
The comments are undoubtedly correct that your approach is not the intended solution method.
However, the functions you've posted would not, in and of themselves, cause a memory leak, fail to tail recurse, or lead to poor performance. Given your code above plus the definitions:
updateCycle 4686774942 = 0
updateCycle n = n+1
main = do
print $ simulateUntilRepeat (0 :: Int)
and compiling with -O2, the program runs in constant memory on my laptop in about 30 seconds. Adding explicit type signatures to use Int in place of Integer for the iteration count:
simulateUntilRepeat_int :: Int -> Int -> Int -> Int
simulateUntilRepeat :: Int -> Int
it runs in about 2.4 seconds.
So, to understand why your program is gobbling all available memory or why your strictness annotations failed to make a difference, it would probably be necessary to see the whole working program (or preferably a minimal example that illustrates the performance problem). If the program is short, and the question is "why is the performance of this program totally unreasonable?" instead of "how can I optimize my program to run as fast as possible?", it might still be a good SO question. Otherwise, the Code Review site might be better -- you can post a larger program there and ask for general performance advice, and that's considered on-topic for that site.

How many lines of machine code are generated by one statement in programming language X?

Reading an article about Lost Programming Skills, the author brings up this chat:
Me: How much horsepower do you need?
SE: I don't know.
Me: Let's see, how many lines of code in your main loop?
SE: 10,000.
Me: what language?
SE: Fortran
Me: ok, that's about 10 lines of machine code per line of Fortran, so
100,000 instructions per loop; how many times does the loop execute per
second?
SE: every 1/20th of a second.
Me: OK, so that's 20 x 100,000 = 2mops (which was faster than anything we had
at the time), maybe we'd better rethink this.
Which makes me wonder, what is the number for modern languages, say Ruby? How does one find out?
i dont think there would be an exact no. saying "for languange x the compiled binary has y lines per source code line". But if you still want to find out may be you can take a large no. of compiled code and corresponding source code and find out the average per source code line.
You can open the binary with any binary editor to see how many lines it generates. for eg. Ollydbg
In terms of determining how long a piece of code will take to execute, that doesn't even really work for Fortran any more! If you write this in Fortran 90:
SUBROUTINE foo(x, y)
IMPLICIT NONE
REAL, DIMENSION(:), INTENT(IN) :: x
REAL, DIMENSION(:), INTENT(OUT) :: y
y = EXP(x)
END SUBROUTINE foo
the line that says y = EXP(x) can take arbitrarily long to execute, depending on the size of the arrays x and y. The same goes for any language with vector assignment.
In the chat they where trying to estimate CPU performance.
If you know CPU performance and time of execution of the loop you can get number of CPU commands per loop and then per line.
Calculation in your chat is not precises.
You can do similar unprecise calculations even for ruby.
Be aware that it wrong to say that one fortran line is 10 CPU commands BUT is average for certain loop it was true.
Estimate time taken by your loop in ruby.
Multiply your CPU performance (in operations per second) on loop time. You will get operations per second.
Divide operations per second on number of lines in loop. That is your value for your loop.
For X="C#" you might want to take a look at Faster Managed Code: Know What Things Cost from Microsoft. It says, that (particular) modern languages are heavily optimized before actually touching the hardware.

Haskell - simple way to cache a function call

I have functions like:
millionsOfCombinations = [[a, b, c, d] |
a <- filter (...some filter...) someListOfAs,
b <- (...some other filter...) someListOfBs,
c <- someListOfCs, d <- someListOfDs]
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful]
Evaluating millionsOfCombinations takes 40s. on a very fast workstation. Evaluating aLotOfCombinationsOfCombinations!!0 took 2 days :-(
How can I speed up this code? So far I've had 2 ideas - use a profiler. Tried running myapp +RTS -sstderr after compiling with GHC, but get a blank screen and don't want to wait days for it to finish.
2nd thought was to somehow cache millionsOfCombinations. Do I understand correctly that for each value in aLotOfCombinationsOfCombinations, millionsOfCombinations gets evaluated multiple times? If that is so, how can I cache the result? Obviously I've just started learning Haskell. I know there is a way to do call caching with a monad, but I still don't understand those things.
Use the -fforce-recomp, -O2 and -fllvm flags
If you aren't already, be sure to use the above flags. I wouldn't normally mention it, but I've seen some questions recently that didn't know powerful optimization isn't a default.
Profile Your Code
The -sstderr flag isn't exactly profiling. When people say profiling they're usually talking about either heap profiling or time profiling via -prof and -auto-all flags.
Avoid Costly Primitives
If you need the entire list in memory (i.e. it isn't going to be optimized away) then consider unboxed vectors. If Int will do instead of Integer, consider that (but Integer is a reasonable default when you don't know!). Use worker/wrapping transforms at the right times. If you're leaning heavily on Data.Map, try using Data.HashMap from the unordered-containers library. This list can go on and on, but since you don't already have an intuition on where your computation time is going the profiling should come first!
I think, that there is no way. Please notice, that the time to generate the list is growing with each list involved. So you get around 10000003 combinations to check, which indeed takes a lot of time. Caching the list ist possible but is unlikely to change anything, since new elements can be generated almost instantly. The only way is probably to change the algorithm.
If millionsOfCombinations is a constant (and not a function with arguments), it is cached automatically. Else, make it a constant by using a where clause:
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful] where
millionsOfCombinations = makeCombination xyz

Generating random number in a given range in Fortran 77

I am a beginner trying to do some engineering experiments using fortran 77. I am using Force 2.0 compiler and editor. I have the following queries:
How can I generate a random number between a specified range, e.g. if I need to generate a single random number between 3.0 and 10.0, how can I do that?
How can I use the data from a text file to be called in calculations in my program. e.g I have temperature, pressure and humidity values (hourly values for a day, so total 24 values in each text file).
Do I also need to define in the program how many values are there in the text file?
Knuth has released into the public domain sources in both C and FORTRAN for the pseudo-random number generator described in section 3.6 of The Art of Computer Programming.
2nd question:
If your file, for example, looks like:
hour temperature pressure humidity
00 15 101325 60
01 15 101325 60
... 24 of them, for each hour one
this simple program will read it:
implicit none
integer hour, temp, hum
real p
character(80) junkline
open(unit=1, file='name_of_file.dat', status='old')
rewind(1)
read(1,*)junkline
do 10 i=1,24
read(1,*)hour,temp,p,hum
C do something here ...
10 end
close(1)
end
(the indent is a little screwed up, but I don't know how to set it right in this weird environment)
My advice: read up on data types (INTEGER, REAL, CHARACTER), arrays (DIMENSION), input/output (READ, WRITE, OPEN, CLOSE, REWIND), and loops (DO, FOR), and you'll be doing useful stuff in no time.
I never did anything with random numbers, so I cannot help you there, but I think there are some intrinsic functions in fortran for that. I'll check it out, and report tomorrow. As for the 3rd question, I'm not sure what you ment (you don't know how many lines of data you'll be having in a file ? or ?)
You'll want to check your compiler manual for the specific random number generator function, but chances are it generates random numbers between 0 and 1. This is easy to handle - you just scale the interval to be the proper width, then shift it to match the proper starting point: i.e. to map r in [0, 1] to s in [a, b], use s = r*(b-a) + a, where r is the value you got from your random number generator and s is a random value in the range you want.
Idigas's answer covers your second question well - read in data using formatted input, then use them as you would any other variable.
For your third question, you will need to define how many lines there are in the text file only if you want to do something with all of them - if you're looking at reading the line, processing it, then moving on, you can get by without knowing the number of lines ahead of time. However, if you are looking to store all the values in the file (e.g. having arrays of temperature, humidity, and pressure so you can compute vapor pressure statistics), you'll need to set up storage somehow. Typically in FORTRAN 77, this is done by pre-allocating an array of a size larger than you think you'll need, but this can quickly become problematic. Is there any chance of switching to Fortran 90? The updated version has much better facilities for dealing with standardized dynamic memory allocation, not to mention many other advantages. I would strongly recommend using F90 if at all possible - you will make your life much easier.
Another option, depending on the type of processing you're doing, would be to investigate algorithms that use only single passes through data, so you won't need to store everything to compute things like means and standard deviations, for example.
This subroutine generate a random number in fortran 77 between 0 and ifin
where i is the seed; some great number such as 746397923
subroutine rnd001(xi,i,ifin)
integer*4 i,ifin
real*8 xi
i=i*54891
xi=i*2.328306e-10+0.5D00
xi=xi*ifin
return
end
You may modifies in order to take a certain range.

Resources