Fortran implied do write speedup - performance

tl;dr: I found that an "implied do" write was slower than an explicit one under certain circumstances, and want to understand why/if I can improve this.
Details:
I've got a code that does something to the effect of:
DO i=1,n
calculations...
!m, x, and y all change each pass through the loop
IF(m.GT.1)THEN
DO j=1,m
WRITE(10,*)x(j),y(j) !where 10 is an output file
ENDDO
ENDIF
ENDDO
The output file ends up being fairly large, and so it seems like the writing is a big performance factor, so I wanted to optimize it. Before anyone asks, no, moving away from ASCII isn't an option due to various downstream requirements. Accordingly, I rewrote the IF statement (and contents) as:
IF(m.GT.1)THEN
!build format statement for write
WRITE(mm1,*)m-1
mm1=ADJUSTL(mm1)
!implied do write statement
WRITE(10,'('//TRIM(mm1)//'(i9,1x,f7.5/),i9,1x,f7.5)')(x(j),y(j),j=1,m)
ELSEIF(m.EQ.1)THEN
WRITE(10,'(i9,1x,f7.5)')x(1),y(1)
ENDIF
This builds the format statement according to the # of values to be written out, then does a single write statement to output things. I've found that the code actually runs slower with this formulation. For reference, I've seen significant speedup on the same system (hardware and software) when going to an implied do write statement when the amount of data to be written was fixed. Under the assumption that the WRITE statement, itself, is faster, then that would mean the overhead from the couple of lines building that statement are what take the added time, but that seems hard to believe. For reference, m can vary a fair amount, but probably averages at least 1000. Is the concatenation of strings // a very slow operator, or is there something else I'm missing? Thanks in advance.

I haven't specific timing information to add, but your data transfer with an implied do loop is needlessly complicated.
In the first fragment, with the explicit looping, you are writing each pair of numbers to distinct records and you wish to repeat this output with the implied do loop. To do this, you use the slash edit descriptor to terminate each record once a pair has been written.
The needless complexity comes from two areas:
you have distinct cases for one/more than one pair;
for the more-than-one case you construct a format including a "dynamic" repeat count.
As Vladimir F comments you could just use a very large repeat count: it isn't erroneous for an edit descriptor to be processed when there are no more items to be written. The output terminates (successfully) when reaching such a non-matching descriptor. You could, then, just write
WRITE(10,'(*(i9,1x,f7.5/))') (x(j),y(j),j=1,m) ! * replacing a large count
rather than the if construct and the format creation.
Now, this doesn't quite match your first output. As I mentioned above, output termination comes about when a data edit descriptor is reached when there is no corresponding item to output. This means that / will be processed before that happens: you have a final empty record.
The colon edit descriptor is useful here:
WRITE(10,'(*(i9,1x,f7.5,:,/))') (x(j),y(j),j=1,m)
On reaching a : processing stops immediately if there is no remaining output item to process.
But my preferred approach is the far simpler
WRITE(10,'(i9,1x,f7.5)') (x(j),y(j),j=1,m) ! No repeat count
You had the more detailed format to include record termination. However, we have what is known as format reversion: if a format end is reached and more remains to be output then the record is terminated and processing goes back to the start of the format.
Whether these things make your output faster remains to be seen, but they certainly make the code itself much cleaner and clearer.
As a final note, it used to be trendy to avoid additional X editing. If your numbers fit inside the field of width 7 then 1x,f7.5 could be replaced by f8.5 and have the same look: the representation is right-justified in the field. It was claimed that this reduction had performance benefits with fewer switching of descriptors.

Related

Even when using the same randomseed in Lua, get different results?

I have a large, rather complicated procedural content generation lua project. One thing I want to be able to do, for debugging purposes, is use a random seed so that I can re-run the system & get the same results.
To the end, I print out the seed at the start of a run. The problem is, I still get completely different results each time I run it. Assuming the seed doesn't change anywhere else, this shouldn't be possible, right?
My question is, what other ways are there to influence the output of lua's math.random()? I've searched through all the code in the project, and there's only one place where I call math.randomseed(), and I do that before I do anything else. I don't use the time or date for any calculations, so that wouldn't be influencing the results... What else could I be missing?
Updated on 2/22/16 monkey patching math.random & math.randomseed has, oftentimes (but not always) output the same sequence of random numbers. But still not the same results – so I guess the real question is now: what behavior in lua is indeterminate, and could result in different output when the same code is run in sequence? Noting where it diverges, when it does, is helping me narrow it down, but I still haven't found it. (this code does NOT use coroutines, so I don't think it's a threading / race condition issue)
randomseed is using srandom/srand function, which "sets its argument as the seed for a new sequence of pseudo-random integers to be returned by random()".
I can offer several possible explanations:
you think you call randomseed, but you do not (random will initialize the sequence for you in this case).
you think you call randomseed once, but you call it multiple times (or some other part of the code calls randomseed as well, possibly at different times in your sequence).
some other part of the code calls random (some number of times), which generates different results for your part of the code.
there is nothing wrong with the generated sequence, but you are misinterpreting the results.
your version of Lua has a bug in srandom/random processing.
there is something wrong with srandom or random function in your system.
Having some information about your version of Lua and your system (in addition to the small example demonstrating the issue) would help in figuring out what's causing this.
Updated on 2016/2/22: It should be fairly easy to check; monkeypatch both math.randomseed and math.random and log all the calls and the values returned by the functions for two subsequent runs. Compare the results. If the results differ, you should be able to isolate why they differ and reproduce on a smaller example. You can also look at where the functions are called from using debug.traceback.
Correct, as stated in the documentation, 'equal seeds produce equal sequences of numbers.'
Immediately after setting the seed to a known constant value, output a call to rand - if this varies across runs, you know something is seriously wrong (corrupt library download, whack install, gamma ray hit your drive, etc).
Assuming that the first value matches across runs, add another output midway through the code. From there, you can use a binary search to zero in on where things go wrong (I.E. first half or second half of the code block in question).
While you can & should use some intuition to find the error as you go, keep in mind that if intuition alone was enough, you would have already found it, thus a bit of systematic elimination is warranted.
Revision to cover comment regarding array order:
If possible, use debugging tools. This SO post on detecting when the value of a Lua variable changes might help.
In the absence of tools, here's one way to roll your own for this problem:
A full debugging dump of any sizable array quickly becomes a mess that makes it tough to spot changes. Instead, I'd use a few extra variables & a test function to keep things concise.
Make two deep copies of the array. Let's call them debug01 & debug02 & call the original array original. Next, deliberately swap the order of two elements in debug02.
Next, build a function to compare two arrays & test if their elements match up & return / print the index of the first mismatch if they do not. Immediately after initializing the arrays, test them to ensure:
original & debug01 match
original & debug02 do not match
original & debug02 mismatch where you changed them
I cannot stress enough the insanity of using an unverified (and thus, potentially bugged) test function to track down bugs.
Once you've verified the function works, you can again use a binary search to zero in on where things go off the rails. As before, balance the use of a systematic search with your intuition.

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

How to get variable/function definitions set in Parallel (e.g. with ParallelMap)?

I have a function that I use to look up a value based on an index. The value takes some time to calculate, so I want to do it with ParallelMap, and references another similar such function that returns a list of expressions, also based on an index.
However, when I set it all up in a seemingly reasonable fashion, I see some very bizarre behaviour. First, I see that the function appears to work, albeit very slowly. For large indexes, however, the processor activity in Taskmangler stays entirely at zero for an extended period of time (i.e. 2-4 minutes) where all instances of Mathematica are seemingly inert. Then, without the slightest blip of CPU use, a result appears. Is this another case of Mathematica spukhafte Fernwirkung?
That is, I want to create a variable/function that stores an expression, here a list of integers (ListOfInts), and then on the parallel workers I want to perform some function on that expression (here I apply a set of replacement rules and take the Min). I want the result of that function to also be indexed by the same index under another variable/function (IndexedFunk), whose result is then available back on the main instance of Mathematica:
(*some arbitrary rules that will convert some of the integers to negative values:*)
rulez=Dispatch[Thread[Rule[Range[222],-Range[222]]]];
maxIndex = 333;
Clear[ListOfInts]
Scan[(ListOfInts[#]=RandomInteger[{1,999},55])&,Range[maxIndex ]]
(*just for safety's sake:*)
DistributeDefinitions[rulez, ListOfInts]
Clear[IndexedFunk]
(*I believe I have to have at least one value of IndexedFunk defined before I Share the definition to the workers:*)
IndexedFunk[1]=Min[ListOfInts[1]]/.rulez
(*... and this should let me retrieve the values back on the primary instance of MMA:*)
SetSharedFunction[IndexedFunk]
(*Now, here is the mysterious part: this just sits there on my multiprocessor machine for many minutes until suddenly a result appears. If I up maxIndex to say 99999 (and of course re-execute the above code again) then the effect can more clearly be seen.*)
AbsoluteTiming[Short[ParallelMap[(IndexedFunk[#]=Min[ListOfInts[#]/.rulez])&, Range[maxIndex]]]]
I believe this is some bug, but then I am still trying to figure out Mathematica Parallel, so I can't be too confident in this conclusion. Despite its being depressingly slow, it is nonetheless impressive in its ability to perform calculations without actually requiring a CPU to do so.
I thought perhaps it was due to whatever communications protocol is being used between the master and slave processes, perhaps it is so slow that it just appears that the processors are doing nothing when if fact they are just waiting to send the next bit of some definition or other. In which case I thought ParallelMap[..., Method->"CoarsestGrained"] would be of some use. But no, that doesn't work neither.
A question: "Am I doing something obviously wrong, or is this a bug?"
I am afraid you are. The problem is with the shared definition of a variable. Mathematica maintains a single coherent value in all copies of the variable across kernels, and therefore that variable becomes a single point of huge contention. CPU is idle because kernels line up to the queue waiting for the variable IndexedFunk, and most time is spent in interprocess or inter-machine communication. Go figure.
By the way, there is no function SetSharedDefinition in any Mathematica version I know of. You probably intended to write SetSharedVariable. But remove that evil call anyway! To avoid contention, return results from the parallelized computation as a list of pairs, and then assemble them into downvalues of your variable at the main kernel:
Clear[IndexedFunk]
Scan[(IndexedFunk[#[[1]]] = #[[2]]) &,
ParallelMap[{#, Min[ListOfInts[#] /. rulez]} &, Range[maxIndex]]
]
ParallelMap takes care of distributing definition automagically, so the call to DistributeDefinitions is superfluous. (As a minor note, it is not correct as written, omitting the maxIndex variable, but the omission is automatically taken care of by ParallelMap in this particular case.)
EDIT, NB!: The automatic distribution applies only to the version 8 of Mathematica. Thanks #MikeHoneychurch for the correction.

Why does the Matlab Profiler say there is a bottleneck on the 'end' statement of a 'for' loop?

So, I've recently started using Matlab's built-in profiler on a regular basis, and I've noticed that while its usually great at showing which lines are taking up the most time, sometimes it'll tell me a large chunk of time is being used on the end statement of a for loop.
Now, seeing as such a line is just used for denoting the end of the loop, I can't imagine how it could use anything other than a trivial amount of processing.
I've seen a specific version of this question asked on matlab central, but a consensus didn't seem to be reached.
EDIT: Here's a minimal example of this problem:
for i =1:1000
x = 1;
x = [x 1];
% clear x;
end
Even if you uncomment the clear, the end line still takes up a lot of computation (about 20%), and the clear actually increases the absolute amount of computation performed by the end line.
When I've seen this in my code, it's been the deallocation of large temporaries created in the loop. Each new variable created in the loop is deallocated at the end.

Detecting infinite loop in brainfuck program

I have written a simple brainfuck interpreter in MATLAB script language. It is fed random bf programs to execute (as part of a genetic algorithm project). The problem I face is, the program turns out to have an infinite loop in a sizeable number of cases, and hence the GA gets stuck at the point.
So, I need a mechanism to detect infinite loops and avoid executing that code in bf.
One obvious (trivial) case is when I have
[]
I can detect this and refuse to run that program.
For the non-trivial cases, I figured out that the basic idea is: to determine how one iteration of the loop changes the current cell. If the change is negative, we're eventually going to reach 0, so it's a finite loop. Otherwise, if the change is non-negative, it's an infinite loop.
Implementing this is easy for the case of a single loop, but with nested loops it becomes very complicated. For example, (in what follows (1) refers to contents of cell 1, etc. )
++++ Put 4 in 1st cell (1)
>+++ Put 3 in (2)
<[ While( (1) is non zero)
-- Decrease (1) by 2
>[ While( (2) is non zero)
- Decrement (2)
<+ Increment (1)
>]
(2) would be 0 at this point
+++ Increase (2) by 3 making (2) = 3
<] (1) was decreased by 2 and then increased by 3, so net effect is increment
and hence the code runs on and on. A naive check of the number of +'s and -'s done on cell 1, however, would say the number of -'s is more, so would not detect the infinite loop.
Can anyone think of a good algorithm to detect infinite loops, given arbitrary nesting of arbitrary number of loops in bf?
EDIT: I do know that the halting problem is unsolvable in general, but I was not sure whether there did not exist special case exceptions. Like, maybe Matlab might function as a Super Turing machine able to determine the halting of the bf program. I might be horribly wrong, but if so, I would like to know exactly how and why.
SECOND EDIT: I have written what I purport to be infinite loop detector. It probably misses some edge cases (or less probably, somehow escapes Mr. Turing's clutches), but seems to work for me as of now.
In pseudocode form, here it goes:
subroutine bfexec(bfprogram)
begin
Looping through the bfprogram,
If(current character is '[')
Find the corresponding ']'
Store the code between the two brackets in, say, 'subprog'
Save the value of the current cell in oldval
Call bfexec recursively with subprog
Save the value of the current cell in newval
If(newval >= oldval)
Raise an 'infinite loop' error and exit
EndIf
/* Do other character's processings */
EndIf
EndLoop
end
Alan Turing would like to have a word with you.
http://en.wikipedia.org/wiki/Halting_problem
When I used linear genetic programming, I just used an upper bound for the number of instructions a single program was allowed to do in its lifetime. I think that this is sensible in two ways: I cannot really solve the halting problem anyway, and programs that take too long to compute are not worthy of getting more time anyway.
Let's say you did write a program that could detect whether this program would run in an infinite loop. Let's say for the sake of simplicity that this program was written in brainfuck to analyze brainfuck programs (though this is not a precondition of the following proof, because any language can emulate brainfuck and brainfuck can emulate any language).
Now let's say you extend the checker program to make a new program. This new program exits immediately when its input loops indefinitely, and loops forever when its input exits at some point.
If you input this new program into itself, what will the results be?
If this program loops forever when run, then by its own definition it should exit immediately when run with itself as input. And vice versa. The checker program cannot possibly exist, because its very existence implies a contradiction.
As has been mentioned before, you are essentially restating the famous halting problem:
http://en.wikipedia.org/wiki/Halting_problem
Ed. I want to make clear that the above disproof is not my own, but is essentially the famous disproof Alan Turing gave back in 1936.
State in bf is a single array of chars.
If I were you, I'd take a hash of the bf interpreter state on every "]" (or once in rand(1, 100) "]"s*) and assert that the set of hashes is unique.
The second (or more) time I see a certain hash, I save the whole state aside.
The third (or more) time I see a certain hash, I compare the whole state to the saved one(s) and if there's a match, I quit.
On every input command ('.', IIRC) I reset my saved states and list of hashes.
An optimization is to only hash the part of state that was touched.
I haven't solved the halting problem - I'm detecting infinite loops while running the program.
*The rand is to make the check independent of loop period
Infinite loop cannot be detected, but you can detect if the program is taking too much time.
Implement a timeout by incrementing a counter every time you run a command (e.g. <, >, +, -). When the counter reaches some large number, which you set by observation, you can say that it takes very long time to execute your program. For your purpose, "very long" and infinite is a good-enough approximation.
As already mentioned this is the Halting Problem.
But in your case there might be a solution: The Halting Problem is considering is about the Turing machine, which has unlimited memory.
In case you know that you have a upper limit of memory (e.g. you know you dont use more than 10 memory cells), you can execute your programm and stop it. The idea is that the computation space bounds computation time (as you cant write more than one cell at one step). After you executed as much steps as you can have different memory configurations, you can break. E.g. if you have 3 cells, with 256 conditions, you can have at most 3^256 different states, and so you can stop after executing that many steps. But be careful, there are implicit cells, like the instruction pointer and the registers. You do it even shorter, if you save every state configuration and as soon as you detect one, which you already had, you have an infite loop. This approach is definitly much better in the run time, but therefor needs much more space (here it might be suitable to hash the configurations).
This is not the halting problem, however, it is still not reasonable to try to detect halting even in such a limited machine as a 1000 cell BF machine.
Consider this program:
+[->[>]+<[-<]+]
This program will not repeat until it has filled up the entire of memory which for just 1000 cells will take about 10^300 years.
If I remember correctly, the halting problem proof was only true for some extreme case that involved self reference. However it's still trivial to show a practical example of why you can't make an infinite loop detector.
Consider Fermat's Last Theorem. It's easy to create a program that iterates through every number (or in this case 3 numbers), and detects if it's a counterexample to the theorem. If so it halts, otherwise it continues.
So if you have an infinite loop detector, it should be able to prove this theorem, and many many others (perhaps all others, if they can be reduced to searching for counterexamples.)
In general, any program that involves iterating through numbers and only stopping under some condition, would require a general theorem prover to prove if that condition can ever be met. And that's the simplest case of looping there is.
Off the top of my head (and I could be wrong), I would think it would be a little bit difficult to detect whether or not a program has an infinite loop without actually executing the program itself.
As the conditional execution of portions of the program depends on the execution state of the program, it will be difficult to know the particular state of the program without actually executing the program.
If you don't require that a program with an infinite loop be executed, you could try having an "instructions executed" counter, and only execute a finite number of instructions. This way, if a program does have an infinite loop, the interpreter can terminate the program which is stuck in an infinite loop.

Resources