Numerical equivalent of TRUE is -1? - visual-studio

I am using Intel Fortran in Visual Studio 2012 to compile a Fortran code.
When I try to use logical operators I have noticed that a standalone logical expression results in T or F as expected. However, if I need the numerical T or F (0 or 1), I get a -1 when logical result is T.
For example:
integer*4 a
a = 1
logicval = (node(5,L).gt.0)
numval = 1*(node(5,L).gt.0)
write(*,*) logicval, numval
would output
T, -1
Is there a way in which I can redefine numerical values assigned to T & F?

As others have stated, the Intel Fortran default for this non-standard usage is that integer values with the low bit set (odd) are true, even values (low bit clear) are false. The constant .TRUE. has the bit pattern of -1. It gets more complicated when you do conversions and up until version 17 the compiler hasn't been completely consistent in this. Note that if you use -standard-semantics, -fpscomp logicals is implied.
Please read https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/275071#comment-1548435 and also the version 17 release notes (https://software.intel.com/en-us/articles/intel-fortran-compiler-170-release-notes) for information on the changes in that version for how numeric-logical conversions are handled.

Yes, that is expected Intel Fortran's TRUE is indeed -1, because all bits are set to 1.
In addition, your use of integers as logical (Boolean) variables is completely non-standard. You should make your code a few lines longer and always do proper conversion. Your integer_var = 1*(node(5,L).gt.0) is not allowed in Fortran and it will be refused by many compilers. If you drop the 1* gfortran will issue a warning, but your form results in an error.
You can simply convert your logicals and integers in a standard conforming way
if (l) then
x = 1
else
x = 0
end if
You can convert a Fortran logical arrays to integer arrays with 1 and 0 easily using the MERGE() intrisic or using WHERE.
An easy fix for Intel Fortran is probably -fpscomp logicals for Intel Fortran which switches the treatment of LOGICAL type to consider anything nonzero as true and makes the .true. constant to be equivalent to integer 1.
Still be careful because it does not make your program portable, it just works around one portability issue in one particular compiler.
You can use the C interoperable logical kind to match the definition to the Intel C representation:
use iso_c_binding
logical(c_bool) :: variable
These will have values of +1 and 0 as C99 dictates for _Bool. If you need them to be interoperable with int you must do some simple conversion. C_int is not compatible with c_bool.
Depending on the version of Intel Compiler you may need to use -standard-semantics (or -fpscomp logicals) for correct logical(c_bool). I consider this to be very unfortunate.

Related

How to write asm code "bsrl" in golang

I need to write some asm code in golang. I read this question Is it possible to include inline assembly in Google Go code?, but not see how to write it.
Could anyone help me? thanks.
asm ("bsrl %1, %0;"
:"=r"(bits) /* output */
:"r"(value) ); /* input */
All the answers on the question you found say it's not possible to use inline-asm in Go, with any syntax. GNU C inline-asm syntax isn't going to help.
But fortunately, you don't need inline asm for bsr (which finds the bit-index of the highest set bit). Go 1.9 has an intrinsic / built-in function for bitwise operations that are close enough that they should compile efficiently.
Use math.bits.LeadingZeros32 to get lzcnt(x), which is 31-bsr(x) for non-zero x. This may cost extra instructions, especially on CPUs which only support bsr, not lzcnt (e.g. Intel pre-Haswell).
Or use Len32(x) - 1
Len32(x) returns the number of bits required to represent x. It returns 0 for x=0, and presumably it returns 1 for x=1, so it's bsr(x) + 1, with defined behaviour for 0 (thus potentially costing extra instructions). Hopefully Len32(x) - 1 can compile directly to a bsr.
Of course, if what you really wanted was lzcnt, then use LeadingZeros32 in the first place.
Note that bsr leaves the destination register unmodified for input = 0. Intel's docs only say with an undefined value, so compilers probably don't take advantage of this guarantee that AMD documents and Intel does provide in hardware.
At least in theory, though, Len32(x) - 1 could compile to a single bsr instruction if the compiler can prove that x is non-zero.

Boolean expression optimization in compiler and high end processor pipeline

I want to calculate a boolean expression. For ease of understanding let's assume the expression is,
O=( A & B & C) | ( D & E & F)---(eqn. 1),
Here A, B, C, D, E and F are random bits. Now, as my target platform is high-end intel i7-Haswell processor that supports 64 bit data type, I can make this much more efficient using bit-slicing.
So now, O, A, B, C, D, E and f are 64 bits data type,
O_64=( A_64 & B_64 & C_64) | ( D_64 & E_64 & F_64)---(eqn. 2), the & and | are bitwise operators similar to C language.
Now, I need the expression to take constant time to execute. That means, the calculation of Eqn. 2 should take the exact number of steps in the processor irrespective of the values in A_64, B_64, C_64, D_64, E_64, and F_64. The values are filled up using a random generator in the runtime.
Now my question is,
Considering I am using GCC or GCC-7 with -O3, How far can the compiler optimize the expression? for example, if A_64 becomes all zeroes (can happen with probability 2^{-64} ) Then we don't need to calculate the first part of eqn.2 then O_64 becomes equal to D_64 & E_64 & F_64. Is it possible for a c compiler to optimize such a way? We have to remember that the values are filled up at runtime and the boolean expressions have around 120 variables.
Is it possible for a for a processor to do such an optimization (List 1) during runtime? As my boolean expression is very long, the execution will be heavily pipelined, now is it possible for a processor to pull out an operation out of the pipeline in if such a situation arises?
Please, let me know if any part of the question is not understandable.
I appreciate your help.
Is it possible for a c compiler to optimize such a way?
It's allowed to do it, but it probably won't. There is nothing to gain in general. If part of the expression was statically known to be zero, that would be used. But inserting branches inside bitwise calculations is almost always counterproductive, and I've never seen a compiler judge a sequence of ANDs to be "long enough to be worth inserting an early-out" (you can certainly do so manually, of course). If you need a hard guarantee of course I can't give you that, if you want to be sure you should always check the assembly.
What it probably will do (for longer expressions at least) is reassociate the expression for more instruction-level parallelism. So code like that probably won't be just two long (but parallel with each other) chains of dependent ANDs, but be split up into more chains. That still wouldn't make the time depend on the values.
Is it possible for a for a processor to do such an optimization during runtime?
Extremely hypothetically yes. No processor architecture that I am aware of does that. It would be a slightly tricky mechanism, and as a general rule it would almost never help.
Hypothetically it could work like this: when the operands for an AND instruction are looked up and one (or both) of them is found to be renamed to the hard-wired zero-register, the renamer can immediately rename the destination to zero as well (rather than allocating a new register for the result), effectively giving that AND instruction 0-latency. The flags output would also be known so the µop would not even have to be executed. It would roughly be a cross between copy-elimination and a zeroing idiom.
That mechanism wouldn't even trigger unless one of the inputs is set to zero with a zeroing idiom, if an input is accidentally zero that wouldn't be detected. It would also not completely remove the influence of the redundant AND instructions, they still have to go through (most of) the front-end of the processor even if it is just to find out that they didn't need to be executed after all.

Overflow in a random number generator and 4-byte vs. 8-byte integers

The famous linear congruential random number generator also known as minimal standard use formula
x(i+1)=16807*x(i) mod (2^31-1)
I want to implement this using Fortran.
However, as pointed out by "Numerical Recipes", directly implement the formula with default Integer type (32bit) will cause 16807*x(i) to overflow.
So the book recommend Schrage’s algorithm is based on an approximate factorization of m. This method can still implemented with default integer type.
However, I am wondering fortran actually has Integer(8) type whose range is -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 which is much bigger than 16807*x(i) could be.
but the book even said the following sentence
It is not possible to implement equations (7.1.2) and (7.1.3) directly
in a high-level language, since the product of a and m − 1 exceeds the
maximum value for a 32-bit integer.
So why can't we just use Integer(8) type to implement the formula directly?
Whether or not you can have 8-byte integers depends on your compiler and your system. What's worse is that the actual value to pass to kind to get a specific precision is not standardized. While most Fortran compilers I know use the number of bytes (so 8 would be 64 bit), this is not guaranteed.
You can use the selected_int_kindmethod to get a kind of int that has a certain range. This code compiles on my 64 bit computer and works fine:
program ran
implicit none
integer, parameter :: i8 = selected_int_kind(R=18)
integer(kind=i8) :: x
integer :: i
x = 100
do i = 1, 100
x = my_rand(x)
write(*, *) x
end do
contains
function my_rand(x)
implicit none
integer(kind=i8), intent(in) :: x
integer(kind=i8) :: my_rand
my_rand = mod(16807_i8 * x, 2_i8**31 - 1)
end function my_rand
end program ran
Update and explanation of #VladimirF's comment below
Modern Fortran delivers an intrinsic module called iso_fortran_env that supplies constants that reference the standard variable types. In your case, one would use this:
program ran
use, intrinsic :: iso_fortran_env, only: int64
implicit none
integer(kind=int64) :: x
and then as above. This code is easier to read than the old selected_int_kind. (Why did R have to be 18 again?)
Yes. The simplest thing is to append _8 to the integer constants to make them 8 bytes. I know it is "old style" Fortran but is is portable and unambiguous.
By the way, when you write:
16807*x mod (2^31-1)
this is equivalent to take the result of 16807*x and use an and with a 32-bit mask where all the bits are set to one except the sign bit.
The efficient way to write it by avoiding the expensive mod functions is:
iand(16807_8*x, Z'7FFFFFFF')
Update after comment :
or
iand(16807_8*x, 2147483647_8)
if your super modern compiler does not have backwards compatibility.

Multiply two numbers whose range is 10^18

There is a variable first_variable which is always a mod of some number, mod_value.
In every step first_variable is multiplied with some number second_variable.
And the range of all three variables is from 1 to 10^18.
For that I build a formula,
first_variable = ((first_variable%mod_value)*(second_variable%mod_value))%mod_value
But this gives a wrong answer,
For example, If first_variable and second_variable is (10^18)-1 and mod_value = 10^18
Please suggest me method, so that first_variable will always give right answer.
Seems you are using a runtime where arithmetic is implemented using 64-bit integers. You can check this using multipliers like 2^32: if their product is 0, my guess is true. In that case, you should switch to an arbitrary long arithmetic implementation, or at least one that is much longer than the current one. E.g. Python supports integers up to 2^1016 (256^127), same for Erlang.
I've seen in comments you use C++. If so, look for GMP library and analogs. Or, if 128 bits is enough, modern GCC support it through own library.
This is basically overflows, so you should either use different value for mod_value (up to 10^9) or limit the range for first value and second value.
Your number is O(10^36) which is O(2^108) which cannot fit in any primitive data type in languages like java or C++. Use BigInt in C++ or Java or use numpy in python to get over it.

Are gfortan whole array expressions, enabled?

I'm new to fortran and to gfortran. I learned that whole expression arrays are calculated in parallel, but I see that calculations only take place in just one core of my computer.
I use the following code:
program prueba_matrices
implicit none
integer, parameter :: num = 5000
double precision, dimension(1:num,1:num) :: A, B, C
double precision, dimension (num*num) :: temp
integer :: i
temp = (/ (i/2.0, i=1,num*num) /)
A = reshape(temp, (/ num, num/) )
B = reshape(temp, (/ num, num/) )
C = matmul(A , B)
end program prueba_matrices
I complie like this:
gfortran prueba_matrices.f03 -o prueba_gfortran
And, watching the graphs produced in real time by gnome-system-monitor, I can see that there is only one core working. If I substitute the line with the calculation
C = matmul(A , B)
for
C = A * B
It yields the same behaviour.
What am I doing wrong?
GFortran/GCC does have some automatic parallelization features, see http://gcc.gnu.org/wiki/AutoParInGCC . They are frequently not that good, so they are not enabled at any of the -ON optimization levels, you have to select it specifically with -ftree-parallelize-loops=N, where N is the number of threads you want to use. Note however that in your example above a loop like "A*B" is likely constrainet by memory bandwidth (for sufficiently large arrays), and thus adding cores might not help that much. Furthermore, the MATMUL intrinsic leads to an implementation in the gfortran runtime library, which is not compiled with the autopar options (unless you have specifically built it that way).
What could help your example code above more is to actually enable any optimization at all. With -O3 Gfortran automatically enables vectorization, which can be seen as a way to parallelize loops as well, although not over several cpu cores.
If you want your call to matmult from gfortran to be multithreaded, easiest is to simply link to external BLAS package that has been compiled with multithreading support. Candidates include OpenBlas (née Goto Blas), ATLAS, or commercial packages like Intel's MKL, AMD's ACML, or Apple's accelerate framework.
So for instance, for this simple example:
program timematmult
real, allocatable, dimension(:,:) :: A, B, C
integer, parameter :: N = 2048
allocate( A(N,N) )
allocate( B(N,N) )
allocate( C(N,N) )
call random_seed
call random_number(A)
call random_number(B)
C = matmul(A,B)
print *, C(1,1)
deallocate(C)
deallocate(B)
deallocate(A)
end program timematmult
With the base matmul:
$ gfortran -o matmult matmult.f90
$ time ./matmult
514.38751
real 0m6.518s
user 0m6.374s
sys 0m0.021s
and with the multithreaded gotoblas library:
$ gfortran -o matmult matmult.f90 -fexternal-blas -lgoto2
$ time ./matmult
514.38696
real 0m0.564s
user 0m2.202s
sys 0m0.964s
Note in particular here that the real time is less than the user time, indicating multiple cores are being used.
I think that a key sentence in the course that you cited is "With array assignment there is no implied order of the individual assignments, they are performed, conceptually, in parallel." The key word is "conceptually". It isn't saying that whole array expressions are actually executed in parallel; you shouldn't expect more than one core to be used. For that, you need to use OpenMP or MPI (outside of Fortran itself) or the coarrays of Fortran 2008.
EDIT: Fortran didn't have, as part of the language, actual parallel execution until the coarrays of Fortran 2008. Some compilers might provide parallelization otherwise and some language features make it easier for compilers to implement parallel execution (optionally). The sentence that I cited from the web article better states reality than the portion you cite. Whole-array expressions were not intended to require parallel execution; they are a syntactical convenience to the programmer, making the language higher level, so that array operations can be expressed in single statements, without writing do loops. In any case, no article on the web is definitive. Your observation of the lack of parallel executions shows which statement is correct. It does not contradict the Fortran language.

Resources