How many lines of machine code are generated by one statement in programming language X? - compilation

Reading an article about Lost Programming Skills, the author brings up this chat:
Me: How much horsepower do you need?
SE: I don't know.
Me: Let's see, how many lines of code in your main loop?
SE: 10,000.
Me: what language?
SE: Fortran
Me: ok, that's about 10 lines of machine code per line of Fortran, so
100,000 instructions per loop; how many times does the loop execute per
second?
SE: every 1/20th of a second.
Me: OK, so that's 20 x 100,000 = 2mops (which was faster than anything we had
at the time), maybe we'd better rethink this.
Which makes me wonder, what is the number for modern languages, say Ruby? How does one find out?

i dont think there would be an exact no. saying "for languange x the compiled binary has y lines per source code line". But if you still want to find out may be you can take a large no. of compiled code and corresponding source code and find out the average per source code line.
You can open the binary with any binary editor to see how many lines it generates. for eg. Ollydbg

In terms of determining how long a piece of code will take to execute, that doesn't even really work for Fortran any more! If you write this in Fortran 90:
SUBROUTINE foo(x, y)
IMPLICIT NONE
REAL, DIMENSION(:), INTENT(IN) :: x
REAL, DIMENSION(:), INTENT(OUT) :: y
y = EXP(x)
END SUBROUTINE foo
the line that says y = EXP(x) can take arbitrarily long to execute, depending on the size of the arrays x and y. The same goes for any language with vector assignment.

In the chat they where trying to estimate CPU performance.
If you know CPU performance and time of execution of the loop you can get number of CPU commands per loop and then per line.
Calculation in your chat is not precises.
You can do similar unprecise calculations even for ruby.
Be aware that it wrong to say that one fortran line is 10 CPU commands BUT is average for certain loop it was true.
Estimate time taken by your loop in ruby.
Multiply your CPU performance (in operations per second) on loop time. You will get operations per second.
Divide operations per second on number of lines in loop. That is your value for your loop.

For X="C#" you might want to take a look at Faster Managed Code: Know What Things Cost from Microsoft. It says, that (particular) modern languages are heavily optimized before actually touching the hardware.

Related

Poor performance in matlab

So I had to write a program in Matlab to calculate the convolution of two functions, manually. I wrote this simple piece of code that I know is not that optimized probably:
syms recP(x);
recP(x) = rectangularPulse(-1,1,x);
syms triP(x);
triP(x) = triangularPulse(-1,1,x);
t = -10:0.1:10;
s1 = -10:0.1:10;
for i = 1:201
s1(i) = 0;
for j = t
s1(i) = s1(i) + ( recP(j) * triP(t(i)-j) );
end
end
plot(t,s1);
I have a core i7-7700HQ coupled with 32 GB of RAM. Matlab is stored on my HDD and my Windows is on my SSD. The problem is that this simple code is taking I think at least 20 minutes to run. I have it in a section and I don't run the whole code. Matlab is only taking 18% of my CPU and 3 GB of RAM for this task. Which is I think probably enough, I don't know. But I don't think it should take that long.
Am I doing anything wrong? I've searched for how to increase the RAM limit of Matlab, and I found that it is not limited and it takes how much it needs. I don't know if I can increase the CPU usage of it or not.
Is there any solution to how make things a little bit faster? I have like 6 or 7 of these for loops in my homework and it takes forever if I run the whole live script. Thanks in advance for your help.
(Also, it highlights the piece of code that is currently running. It is the for loop, the outer one is highlighted)
Like Ander said, use the symbolic toolbox in matlab as a last resort. Additionally, when trying to speed up matlab code, focus on taking advantage of matlab's vectorized operations. What I mean by this is matlab is very efficient at performing operations like this:
y = x.*z;
where x and z are some Nx1 vectors each and the operator '.*' is called 'dot multiplication'. This is essentially telling matlab to perform multiplication on x1*z1, x[2]*z[2] .... x[n]*z[n] and assign all the values to the corresponding value in the vector y. Additionally, many of the functions in matlab are able to accept vectors as inputs and perform their operations on each element and return an equal size vector with the output at each element. You can check this for any given function by scrolling down in its documentation to the inputs and outputs section and checking what form of array the inputs and outputs can take. For example, rectangularPulse's documentation says it can accept vectors as inputs. Therefore, you can simplify your inner loop to this:
s1(i) = s1(i) + ( rectangularPulse(-1,1,t) * triP(t(i)-t) );
So to summarize:
Avoid the symbolic toolbox in matlab until you have a better handle of what you're doing or you absolutely have to use it.
Use matlab's ability to handle vectors and arrays very well.
Deconstruct any nested loops you write one at a time from the inside out. Usually this dramatically accelerates matlab code especially when you are new to writing it.
See if you can even further simplify the code and get rid of your outer loop as well.

polyfit on GPUArray is extremely slow [duplicate]

function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing
This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.

Allocatable arrays performance

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.
Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
call Z(A,N)
end subroutine
subroutine Z(A,N)
Integer N
real A(N,N)
do stuff here
end
Then this code:
subroutine X
use Y ! Has allocatable array A(N,N) in it
do stuff here
end subroutine
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.
Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...
I could think of just these reasons when it comes to fortran performance using arrays:
arrays on the stack VS heap, but I doubt this could have a huge performance impact.
passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently

Print 1 followed by googolplex number of zeros

Assuming we are not concerned about running time of the program (which is practically infinite for human mortals) and using limited amount of memory (2^64 bytes), we want to print out in base 10, the exact value of 10^(googolplex), one digit at a time on screen (mostly zeros).
Describe an algorithm (which can be coded on current day computers), or write a program to do this.
Since we cannot practically check the output, so we will rely on collective opinion on the correctness of the program.
NOTE : I do not know the solution, or whether a solution exists or not. The problem is my own invention. To those readers who are quick to mark this offtopic... kindly reconsider. This is difficult and bit theoretical but definitely CS.
This is impossible. There are more states (10^(10^100)) in the program than there are electrons in the universe (~10^80). Therefore, in our universe, there can be no such realization of a machine capable of executing the task.
First of all, we note that 10^(10^100) is equivalent to ((((10^10)^10)^...)^10), 100 times.
Or 10↑↑↑↑↑↑↑↑↑↑10.
This gives rise to the following solution:
print 1
for i in A(10, 100)
print 0
in bash:
printf 1
while true; do
printf 0
done
... close enough.
Here's an algorithm that solves this:
print 1
for 1 to 10^(10^100)
print 0
One can trivially prove correctness using Hoare logic:
There are no pre-conditions
The post condition is that a one followed by 10^(10^100) zeros are printed
The cycle's invariant is that the number of zeros printed so far is equal to i
EDIT: A machine to solve the problem needs the ability to distinguish between one googolplex of distinct states: each state is the result of printing one more zero than the previous. The amount of memory needed to do this is the same needed to store the number one googolplex. If there isn't that much memory available, this problem cannot be solved.
This does not mean it isn't a computable problem: it can be solved by a Turing machine because a Turing machine has a limitless amount of memory.
There definitely is a solution to this problem in theory, assuming of course you have a machine that is capable of producing that sort of output. I'm pretty sure that a googolplex is larger than the number of atoms in the universe, at least according to what the physicists tell us, so I don't think that any physically realizable model of computation could print it out. However, mathematically speaking, you could define a Turing machine capable of printing out the value by just giving it a googolplex-ish number of states and having each write a zero and then move to the next lower state.
Consider the following:
The console window to which you are printing the output will have a maximum buffer size.
When this buffer size is exceeded, anything printed earlier is discarded, and the user will not be able to scroll back to see it.
The maximum buffer size will be minuscule compared to a googolplex.
Therefore, if you want to mimic the user experience of your program running to completion, find the maximum buffer size of the console you will print to and print that many zeroes.
Hurray laziness!

Generating random number in a given range in Fortran 77

I am a beginner trying to do some engineering experiments using fortran 77. I am using Force 2.0 compiler and editor. I have the following queries:
How can I generate a random number between a specified range, e.g. if I need to generate a single random number between 3.0 and 10.0, how can I do that?
How can I use the data from a text file to be called in calculations in my program. e.g I have temperature, pressure and humidity values (hourly values for a day, so total 24 values in each text file).
Do I also need to define in the program how many values are there in the text file?
Knuth has released into the public domain sources in both C and FORTRAN for the pseudo-random number generator described in section 3.6 of The Art of Computer Programming.
2nd question:
If your file, for example, looks like:
hour temperature pressure humidity
00 15 101325 60
01 15 101325 60
... 24 of them, for each hour one
this simple program will read it:
implicit none
integer hour, temp, hum
real p
character(80) junkline
open(unit=1, file='name_of_file.dat', status='old')
rewind(1)
read(1,*)junkline
do 10 i=1,24
read(1,*)hour,temp,p,hum
C do something here ...
10 end
close(1)
end
(the indent is a little screwed up, but I don't know how to set it right in this weird environment)
My advice: read up on data types (INTEGER, REAL, CHARACTER), arrays (DIMENSION), input/output (READ, WRITE, OPEN, CLOSE, REWIND), and loops (DO, FOR), and you'll be doing useful stuff in no time.
I never did anything with random numbers, so I cannot help you there, but I think there are some intrinsic functions in fortran for that. I'll check it out, and report tomorrow. As for the 3rd question, I'm not sure what you ment (you don't know how many lines of data you'll be having in a file ? or ?)
You'll want to check your compiler manual for the specific random number generator function, but chances are it generates random numbers between 0 and 1. This is easy to handle - you just scale the interval to be the proper width, then shift it to match the proper starting point: i.e. to map r in [0, 1] to s in [a, b], use s = r*(b-a) + a, where r is the value you got from your random number generator and s is a random value in the range you want.
Idigas's answer covers your second question well - read in data using formatted input, then use them as you would any other variable.
For your third question, you will need to define how many lines there are in the text file only if you want to do something with all of them - if you're looking at reading the line, processing it, then moving on, you can get by without knowing the number of lines ahead of time. However, if you are looking to store all the values in the file (e.g. having arrays of temperature, humidity, and pressure so you can compute vapor pressure statistics), you'll need to set up storage somehow. Typically in FORTRAN 77, this is done by pre-allocating an array of a size larger than you think you'll need, but this can quickly become problematic. Is there any chance of switching to Fortran 90? The updated version has much better facilities for dealing with standardized dynamic memory allocation, not to mention many other advantages. I would strongly recommend using F90 if at all possible - you will make your life much easier.
Another option, depending on the type of processing you're doing, would be to investigate algorithms that use only single passes through data, so you won't need to store everything to compute things like means and standard deviations, for example.
This subroutine generate a random number in fortran 77 between 0 and ifin
where i is the seed; some great number such as 746397923
subroutine rnd001(xi,i,ifin)
integer*4 i,ifin
real*8 xi
i=i*54891
xi=i*2.328306e-10+0.5D00
xi=xi*ifin
return
end
You may modifies in order to take a certain range.

Resources