Impurity or Randomness in Haskell's profiling - performance

I'm implementing RayTracingInOneWeekend and I optimized it from 33m to 23s for image size 384x216 scene and the parameters as given in the article. However, when I profile it, the entries (the 5th column from the left in the screenshot below) changes on almost every run. How is that possible? In my program everything stays same, including even the random number generators, as generators are created as (you can see it on github):
g = mkStdGen (i * width + j)
If width and height stay same, then all g (one for each pixel) should stay same as well. However, as you can see the two screenshots have different values in the entries column.
What could be the reason behind this impurity? Or the profiler is not just able to gather all the information and the numbers are not exact (means, in reality the frequency of function calls are different from the numbers shown above; the docs however does not say anything like that).
My program builds with cabal v2-build -O2 --enable-profiling --enable-executable-profiling and I dont pass -prof -fprof-auto to ghc-options (I guess cabal takes care of that). I've also used -threaded and parallel library.
I'm on GHC 8.6.5 and Cabal 3.2.

It looks like, the profiler in multi-core mode does not run consistently — not sure if that counts as bug. I ran the program couple of times without passing -N to RTS and now every time I see the same entries count:
Not sure if that proves that my program does not have any impurity. I'm stil looking for better and more plausible response (if there is any, at all).

Related

React Native "dropped so far" perf monitor count constantly increases

I originally noticed this issue in my completed app, but have installed the default react-native app to test, and I'm seeing the "dropped so far" number in the perf monitor constantly creep up even though nothing is happening.
Is this number supposed to increase constantly?
Yes, though it's not always constant.
(I'm making the assumption that you mean constant as in the same value, although if you just mean it as it never stops, then you can ignore the extra extra explanation behind how it works.)
To understand the logic of dropped so far, you can look at the React Native codebase. You will find the code for the perf monitor in the FpsView.java file. In it, you can see what variable (droppedUIFrames) is being used for the dropped so far code (line 67). If you follow this all the way back, you get to the FPSMonitorRunnable class which uses the mTotalFramesDropped variable to keep track of frames dropped so far (line 79). In this class, you just have a loop updating the variables being reported. The line you'll be interested in is this one on line 90:
mTotalFramesDropped += mFrameCallback.getExpectedNumFrames() - mFrameCallback.getNumFrames();
From this, you can see that yes, this value is a counter that simply increases but never gets reset while the perf monitor is running. You can also see that it isn't constant (fixed value); in your case, it probably happens to appear constant because you are on the "hello world" screen where nothing interesting is happening.
So based on Michael Cheng's answer I delved into the RN code a bit more, and dug out EXPECTED_FRAME_TIME which is set to 16.9, a-la the classic 60fps magic number.
The reason the dropped frames counter was constantly (i.e continually) increasing was that RN expects to run at 60fps and thinks any framerate less than this means dropped frames.
however, having tested this particular tablet with various "framerate" testing apps, the tablet's native framerate appears to be 51.9fps. I don't know why that is, it seems a particularly arbitrary number, but in all my testing the framerate never went above 52 and mostly hovered at 51.
So to answer my question, "dropped so far" means how many frames have been less than 60fps, and "should it increase continually?"; yes if the device is only capable of drawing less than 60fps anyway.

Even when using the same randomseed in Lua, get different results?

I have a large, rather complicated procedural content generation lua project. One thing I want to be able to do, for debugging purposes, is use a random seed so that I can re-run the system & get the same results.
To the end, I print out the seed at the start of a run. The problem is, I still get completely different results each time I run it. Assuming the seed doesn't change anywhere else, this shouldn't be possible, right?
My question is, what other ways are there to influence the output of lua's math.random()? I've searched through all the code in the project, and there's only one place where I call math.randomseed(), and I do that before I do anything else. I don't use the time or date for any calculations, so that wouldn't be influencing the results... What else could I be missing?
Updated on 2/22/16 monkey patching math.random & math.randomseed has, oftentimes (but not always) output the same sequence of random numbers. But still not the same results – so I guess the real question is now: what behavior in lua is indeterminate, and could result in different output when the same code is run in sequence? Noting where it diverges, when it does, is helping me narrow it down, but I still haven't found it. (this code does NOT use coroutines, so I don't think it's a threading / race condition issue)
randomseed is using srandom/srand function, which "sets its argument as the seed for a new sequence of pseudo-random integers to be returned by random()".
I can offer several possible explanations:
you think you call randomseed, but you do not (random will initialize the sequence for you in this case).
you think you call randomseed once, but you call it multiple times (or some other part of the code calls randomseed as well, possibly at different times in your sequence).
some other part of the code calls random (some number of times), which generates different results for your part of the code.
there is nothing wrong with the generated sequence, but you are misinterpreting the results.
your version of Lua has a bug in srandom/random processing.
there is something wrong with srandom or random function in your system.
Having some information about your version of Lua and your system (in addition to the small example demonstrating the issue) would help in figuring out what's causing this.
Updated on 2016/2/22: It should be fairly easy to check; monkeypatch both math.randomseed and math.random and log all the calls and the values returned by the functions for two subsequent runs. Compare the results. If the results differ, you should be able to isolate why they differ and reproduce on a smaller example. You can also look at where the functions are called from using debug.traceback.
Correct, as stated in the documentation, 'equal seeds produce equal sequences of numbers.'
Immediately after setting the seed to a known constant value, output a call to rand - if this varies across runs, you know something is seriously wrong (corrupt library download, whack install, gamma ray hit your drive, etc).
Assuming that the first value matches across runs, add another output midway through the code. From there, you can use a binary search to zero in on where things go wrong (I.E. first half or second half of the code block in question).
While you can & should use some intuition to find the error as you go, keep in mind that if intuition alone was enough, you would have already found it, thus a bit of systematic elimination is warranted.
Revision to cover comment regarding array order:
If possible, use debugging tools. This SO post on detecting when the value of a Lua variable changes might help.
In the absence of tools, here's one way to roll your own for this problem:
A full debugging dump of any sizable array quickly becomes a mess that makes it tough to spot changes. Instead, I'd use a few extra variables & a test function to keep things concise.
Make two deep copies of the array. Let's call them debug01 & debug02 & call the original array original. Next, deliberately swap the order of two elements in debug02.
Next, build a function to compare two arrays & test if their elements match up & return / print the index of the first mismatch if they do not. Immediately after initializing the arrays, test them to ensure:
original & debug01 match
original & debug02 do not match
original & debug02 mismatch where you changed them
I cannot stress enough the insanity of using an unverified (and thus, potentially bugged) test function to track down bugs.
Once you've verified the function works, you can again use a binary search to zero in on where things go off the rails. As before, balance the use of a systematic search with your intuition.

Is there a compiler (for GCC and/or Clang) setting that limits compiled function length?

For the program I'm working on, I'd like to limit the length of each compiled function, so as to provide a hard upper-bound on the distance1 required to reach a function boundary2. Is there an option in GCC or Clang (or really any compiler framework/toolchain) that will enable function splitting to do this? Or are there limitations that I'm not aware of preventing this?
1 Distance here defined as any discrete unit smaller than a function - i.e., number of instructions, number of basic blocks, number of grey hairs on Jon Skeet's head3, etc.
2 I'm defining function boundary as "location where a new stack frame is pushed on to the CPU's stack". To my understanding, this happens almost exclusively when a new function is called (except occasionally for leaf functions that don't themselves call other functions).
3 This is just a joke. We all know that Jon Skeet's hair doesn't turn grey - it just garbage collects and a new hair is instantiated, good as new.
I'm not aware of any compiler switch, but you don't need one. The size of symbols in the text segment is easily obtained with nm:
$ nm -AP a.out|awk '$3=="T" {print $2 " " $5}'
main 000000000000005b
Note that this requires an unstripped executable. Many nm implementations provide additional options, such as decimal numbers instead of hex, which makes comparing the numbers a little easier. Turning this into a script to output functions larger than X left as an exercise :-)

How to get variable/function definitions set in Parallel (e.g. with ParallelMap)?

I have a function that I use to look up a value based on an index. The value takes some time to calculate, so I want to do it with ParallelMap, and references another similar such function that returns a list of expressions, also based on an index.
However, when I set it all up in a seemingly reasonable fashion, I see some very bizarre behaviour. First, I see that the function appears to work, albeit very slowly. For large indexes, however, the processor activity in Taskmangler stays entirely at zero for an extended period of time (i.e. 2-4 minutes) where all instances of Mathematica are seemingly inert. Then, without the slightest blip of CPU use, a result appears. Is this another case of Mathematica spukhafte Fernwirkung?
That is, I want to create a variable/function that stores an expression, here a list of integers (ListOfInts), and then on the parallel workers I want to perform some function on that expression (here I apply a set of replacement rules and take the Min). I want the result of that function to also be indexed by the same index under another variable/function (IndexedFunk), whose result is then available back on the main instance of Mathematica:
(*some arbitrary rules that will convert some of the integers to negative values:*)
rulez=Dispatch[Thread[Rule[Range[222],-Range[222]]]];
maxIndex = 333;
Clear[ListOfInts]
Scan[(ListOfInts[#]=RandomInteger[{1,999},55])&,Range[maxIndex ]]
(*just for safety's sake:*)
DistributeDefinitions[rulez, ListOfInts]
Clear[IndexedFunk]
(*I believe I have to have at least one value of IndexedFunk defined before I Share the definition to the workers:*)
IndexedFunk[1]=Min[ListOfInts[1]]/.rulez
(*... and this should let me retrieve the values back on the primary instance of MMA:*)
SetSharedFunction[IndexedFunk]
(*Now, here is the mysterious part: this just sits there on my multiprocessor machine for many minutes until suddenly a result appears. If I up maxIndex to say 99999 (and of course re-execute the above code again) then the effect can more clearly be seen.*)
AbsoluteTiming[Short[ParallelMap[(IndexedFunk[#]=Min[ListOfInts[#]/.rulez])&, Range[maxIndex]]]]
I believe this is some bug, but then I am still trying to figure out Mathematica Parallel, so I can't be too confident in this conclusion. Despite its being depressingly slow, it is nonetheless impressive in its ability to perform calculations without actually requiring a CPU to do so.
I thought perhaps it was due to whatever communications protocol is being used between the master and slave processes, perhaps it is so slow that it just appears that the processors are doing nothing when if fact they are just waiting to send the next bit of some definition or other. In which case I thought ParallelMap[..., Method->"CoarsestGrained"] would be of some use. But no, that doesn't work neither.
A question: "Am I doing something obviously wrong, or is this a bug?"
I am afraid you are. The problem is with the shared definition of a variable. Mathematica maintains a single coherent value in all copies of the variable across kernels, and therefore that variable becomes a single point of huge contention. CPU is idle because kernels line up to the queue waiting for the variable IndexedFunk, and most time is spent in interprocess or inter-machine communication. Go figure.
By the way, there is no function SetSharedDefinition in any Mathematica version I know of. You probably intended to write SetSharedVariable. But remove that evil call anyway! To avoid contention, return results from the parallelized computation as a list of pairs, and then assemble them into downvalues of your variable at the main kernel:
Clear[IndexedFunk]
Scan[(IndexedFunk[#[[1]]] = #[[2]]) &,
ParallelMap[{#, Min[ListOfInts[#] /. rulez]} &, Range[maxIndex]]
]
ParallelMap takes care of distributing definition automagically, so the call to DistributeDefinitions is superfluous. (As a minor note, it is not correct as written, omitting the maxIndex variable, but the omission is automatically taken care of by ParallelMap in this particular case.)
EDIT, NB!: The automatic distribution applies only to the version 8 of Mathematica. Thanks #MikeHoneychurch for the correction.

Haskell - simple way to cache a function call

I have functions like:
millionsOfCombinations = [[a, b, c, d] |
a <- filter (...some filter...) someListOfAs,
b <- (...some other filter...) someListOfBs,
c <- someListOfCs, d <- someListOfDs]
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful]
Evaluating millionsOfCombinations takes 40s. on a very fast workstation. Evaluating aLotOfCombinationsOfCombinations!!0 took 2 days :-(
How can I speed up this code? So far I've had 2 ideas - use a profiler. Tried running myapp +RTS -sstderr after compiling with GHC, but get a blank screen and don't want to wait days for it to finish.
2nd thought was to somehow cache millionsOfCombinations. Do I understand correctly that for each value in aLotOfCombinationsOfCombinations, millionsOfCombinations gets evaluated multiple times? If that is so, how can I cache the result? Obviously I've just started learning Haskell. I know there is a way to do call caching with a monad, but I still don't understand those things.
Use the -fforce-recomp, -O2 and -fllvm flags
If you aren't already, be sure to use the above flags. I wouldn't normally mention it, but I've seen some questions recently that didn't know powerful optimization isn't a default.
Profile Your Code
The -sstderr flag isn't exactly profiling. When people say profiling they're usually talking about either heap profiling or time profiling via -prof and -auto-all flags.
Avoid Costly Primitives
If you need the entire list in memory (i.e. it isn't going to be optimized away) then consider unboxed vectors. If Int will do instead of Integer, consider that (but Integer is a reasonable default when you don't know!). Use worker/wrapping transforms at the right times. If you're leaning heavily on Data.Map, try using Data.HashMap from the unordered-containers library. This list can go on and on, but since you don't already have an intuition on where your computation time is going the profiling should come first!
I think, that there is no way. Please notice, that the time to generate the list is growing with each list involved. So you get around 10000003 combinations to check, which indeed takes a lot of time. Caching the list ist possible but is unlikely to change anything, since new elements can be generated almost instantly. The only way is probably to change the algorithm.
If millionsOfCombinations is a constant (and not a function with arguments), it is cached automatically. Else, make it a constant by using a where clause:
aLotOfCombinationsOfCombinations = [[comb1, comb2, comb3] |
comb1 <- millionsOfCombinations,
comb2 <- millionsOfCombinations,
comb3 <- someList,
...around 10 function calls to find if
[comb1, comb2, comb3] is actually useful] where
millionsOfCombinations = makeCombination xyz

Resources