Broken GFortran double precision in numerical multiplications and divisions? [duplicate] - precision

This question already has an answer here:
Precision not respected
(1 answer)
Closed 1 year ago.
I run a gfortran code for multiplication and division and they give me different results when compared to results by c++,c etc. All use double precision. I believe that double precision I use in fortran is not correct or it is broken... I've checked numbers with calculator and fortran seems to create some numbers in the near end of decimals... Below is the code in gfortran,
PROGRAM problem
!-----------------------------------------------------------------------
integer, parameter :: dp = selected_real_kind(15,307)
real(dp) :: answer,w
open(1, file = 'problem.txt')
w=0.99
answer=w
do i = 10, 0, -1
answer = answer*w
write (*,"(E18.9)",advance="yes") answer
!-print results to text file
write (1,"(E18.9)",advance="yes") answer
end do
write(*,*) "Done."
close(1)
END PROGRAM problem
gfortran results are,
0.980100019E+00
0.970299028E+00
0.960596047E+00
0.950990096E+00
0.941480204E+00
0.932065411E+00
0.922744766E+00
0.913517327E+00
0.904382162E+00
0.895338349E+00
0.886384974E+00
my calculator shows,
0.9801
0.970299
0.96059601
0.950990049
Am I missing something in variable type declaration or is it intrinsic in gfortran?

Although you've declared w to be double precision, you've initialised it to 0.99, which is only a single-precision constant. In order to initialise w as a double precision constant, you need w = 0.99_dp.

Related

how to make fortran random number [duplicate]

This question already has answers here:
Overflow in a random number generator and 4-byte vs. 8-byte integers
(2 answers)
Closed 2 years ago.
I wanna make rand number without using functions
PROGRAM RTEST
PARAMETER (N=100)
IMPLICIT REAL(A-H,O-Z), INTEGER(I-N)
DIMENSION A(N)
OPEN(99,FILE='RANDOM.DAT',FORM='FORMATTED')
IBMO=3149
SUM=0.0
DO 5 I=1,N
IBMO=IBMO*65549
IF(IBMO) 101, 102, 102
101 IBMO=IBMO+2147483647+1
102 RANDOM=0.46566128E-9*IBMO
A(I)=RANDOM
WRITE(99,10)I,A(I)
WRITE(6,10) I,A(I)
10 FORMAT(5X,I4,3X,F12.7)
SUM=SUM+A(I)
AVE1=SUM/FLOAT(N)
20 FORMAT(12X,F12.7)
30 FORMAT(5X,I4,3X,F12.7)
WRITE(6,20) AVE1
WRITE(99,30)I,AVE1
5 CONTINUE
CLOSE(99)
AVE=SUM/FLOAT(N)
WRITE(6,*)AVE
PAUSE
STOP
END
but i always got integer overflow or invalid floating point error.....
so what my think was type error
i try to change real >> real*8
and integer >> integer*8
but every try was fail....
what is problem?
Your issue is that you need even constants to have a specific type.
Here's what you need to do:
PROGRAM RTEST
USE, INTRINSIC :: ISO_FORTRAN_ENV, ONLY: int64, real64
IMPLICIT NONE
INTEGER, PARAMETER :: N = 100
REAL(KIND=real64) :: A(N)
INTEGER(KIND=int64) :: IBMO
...
IBMO = IBMO * 65549_int64
IBMO = IBMO + 2147483648_int64
and so on. The appended _int64 (after you have imported it from the iso_fortran_env module) tells the compiler to treat this number as a 64-bit integer. Instead of the line use, intrinsic :: iso_fortran_env you can also use the line
INTEGER, PARAMETER :: int64 = selected_int_kind(19)
INTEGER, PARAMETER :: real64 = selected_real_kind(r=300)
(after the IMPLICIT NONE, of course.)
That said, is there a reason you're using such antiquated Fortran syntax?
Any Fortran program that doesn't include the line implicit none is suspicious. Then you're using the syntax
do 5 i = 1, n
...
5 continue
What's wrong with
do i = 1, n
...
end do
And then
if (ibmo) 101, 102, 102
That's syntax I don't even recognise.

large integers change their value after applying operation [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 6 years ago.
Below is the small segment of code from my script:
The logic added is to covert any data value as per the scale value
x = "11400000206633458812"
scale = 2
x = x.to_f/10**scale.to_i
x = x == x.to_i ? x.to_i : x
puts x
observe that the value comes out to be "114000002066334592"
which is different from the initial value of x.
My test-points are failing at this point.
x can be any value. If i use x = 156, the output is correct, but when the length of the integer exceeds 16 digits then the problem arises.
The expected output in above case is 114000002066334588
Can anybody help me why the value is changing and how to fix it?
The expected output in above case is 114000002066334588
Not at all.
One can not expect floating point operations to be precise. That is just not the case because of processor architecture. In all languages, with no exceptions, there are workarounds to work with decimals precisely. In ruby it’s BigDecimal:
require 'bigdecimal'
BigDecimal("11400000206633458812") / 100
#⇒ 0.11400000206633458812E18
(BigDecimal("11400000206633458812") / 100).to_i
#⇒ 114000002066334588

Natural Logarithm of Bessel Function, Overflow

I am trying to calculate the logarithm of a modified Bessel function of second type in MATLAB, i.e. something like that:
log(besselk(nu, Z))
where e.g.
nu = 750;
Z = 1;
I have a problem because the value of log(besselk(nu, Z)) goes to infinity, because besselk(nu, Z) is infinity. However, log(besselk(nu, Z)) should be small indeed.
I am trying to write something like
f = double(sym('ln(besselk(double(nu), double(Z)))'));
However, I get the following error:
Error using mupadmex Error in MuPAD command: DOUBLE cannot convert the input expression into a double array. If the input expression contains a symbolic variable, use the VPA function instead.
Error in sym/double (line 514) Xstr = mupadmex('symobj::double', S.s, 0)`;
How can I avoid this error?
You're doing a few things incorrectly. It makes no sense to use double for your two arguments to to besselk and the convert the output to symbolic. You should also avoid the old string based input to sym. Instead, you want to evaluate besselk symbolically (which will return about 1.02×102055, much greater than realmax), take the log of the result symbolically, and then convert back to double precision.
The following is sufficient – when one or more of the input arguments is symbolic, the symbolic version of besselk will be used:
f = double(log(besselk(sym(750), sym(1))))
or in the old string form:
f = double(sym('log(besselk(750, 1))'))
If you want to keep your parameters symbolic and evaluate at a later time:
syms nu Z;
f = log(besselk(nu, Z))
double(subs(f, {nu, Z}, {750, 1}))
Make sure that you haven't flipped the nu and Z values in your math as large orders (nu) aren't very common.
As njuffa pointed out, DLMF gives asymptotic expansions of K_nu(z) for large nu. From 10.41.2 we find for real positive arguments z:
besselk(nu,z) ~ sqrt(pi/(2nu)) (e z/(2nu))^-nu
which gives after some simplification
log( besselk(nu,z) ) ~ 1/2*log(pi) + (nu-1/2)*log(2nu) - nu(1 + log(z))
So it is O(nu log(nu)). No surprise the direct calculation fails for nu > 750.
I dont know how accurate this approximation is. Perhaps you can compare it for the values where besselk is smaller than the numerical infinity, to see if it fits your purpose?
EDIT: I just tried for nu=750 and z=1: The above approximation gives 4.7318e+03, while with the result of horchler we get log(1.02*10^2055) = 2055*log(10) + log(1.02) = 4.7318e+03. So it is correct to at least 5 significant digits, for nu >= 750 and z=1! If this is good enough for you this will be much faster than symbolic math.
Have you tried the integral representation?
Log[Integrate[Cosh[Nu t]/E^(Z Cosh[t]), {t, 0, Infinity}]]

Implement equation in VHDL

I am trying to implement the equation in VHDL which has multiplication by some constant and addition. The equation is as below,
y<=-((x*x*x)*0.1666)+(2.5*(x*x))- (21.666*x) + 36.6653; ----error
I got the error
HDLCompiler:1731 - found '0' definitions of operator "*",
can not determine exact overloaded matching definition for "*".
entity is
entity eq1 is
Port ( x : in signed(15 downto 0);
y : out signed (15 downto 0) );
end eq1;
I tried using the function RESIZE and x in integer but it gives same error. Should i have to use another data type? x is having pure integer values like 2,4,6..etc.
Since x and y are of datatype signed, you can multiply them. However, there is no multiplication of signed with real. Even if there was, the result would be real (not signed or integer).
So first, you need to figure out what you want (the semantics). Then you should add type casts and conversion functions.
y <= x*x; -- OK
y <= 0.5 * x; -- not OK
y <= to_signed(integer(0.5 * real(to_integer(x))),y'length); -- OK
This is another case where simulating before synthesis might be handy. ghdl for instances tells you which "*" operator it finds the first error for:
ghdl -a implement.vhdl
implement.vhdl:12:21: no function declarations for operator "*"
y <= -((x*x*x) * 0.1666) + (2.5 * (x*x)) - (21.666 * x) + 36.6653;
---------------^ character position 21, line 12
The expressions with x multiplied have both operands with a type of signed.
(And for later, we also note that the complex expression on the right hand side of the signal assignment operation will eventually be interpreted as a signed value with a narrow subtype constraint when assigned to y).
VHDL determines the type of the literal 0.1666, it's an abstract literal, that is decimal literal or floating-point literal (IEEE Std 1076-2008 5.2.5 Floating-point types, 5.2.5.1 General, paragraph 5):
Floating-point literals are the literals of an anonymous predefined type that is called universal_real in this standard. Other floating-point types have no literals. However, for each floating-point type there exists an implicit conversion that converts a value of type universal_real into the corresponding value (if any) of the floating-point type (see 9.3.6).
There's only one predefined floating-point type in VHDL, see 5.2.5.2, and the floating-point literal of type universal_real is implicitly converted to type REAL.
9.3.6 Type conversions paragraph 14 tells us:
In certain cases, an implicit type conversion will be performed. An implicit conversion of an operand of type universal_integer to another integer type, or of an operand of type universal_real to another floating-point type, can only be applied if the operand is either a numeric literal or an attribute, or if the operand is an expression consisting of the division of a value of a physical type by a value of the same type; such an operand is called a convertible universal operand. An implicit conversion of a convertible universal operand is applied if and only if the innermost complete context determines a unique (numeric) target type for the implicit conversion, and there is no legal interpretation of this context without this conversion.
Because you haven't included a package containing another floating-point type that leaves us searching for a "*" multiplying operator with one operand of type signed and one of type REAL with a return type of signed (or another "*" operator with the opposite operand type arguments) and VHDL found 0 of those.
There is no
function "*" (l: signed; r: REAL) return REAL;
or
function "*" (l: signed; r: REAL) return signed;
found in package numeric_std.
Phillipe suggests one way to overcome this by converting signed x to integer.
Historically synthesis doesn't encompass type REAL, prior to the 2008 version of the VHDL standard you were likely to have arbitrary precision, while 5.2.5 paragraph 7 now tells us:
An implementation shall choose a representation for all floating-point types except for universal_real that conforms either to IEEE Std 754-1985 or to IEEE Std 854-1987; in either case, a minimum representation size of 64 bits is required for this chosen representation.
And that doesn't help us unless the synthesis tool supports floating-point types of REAL and is -2008 compliant.
VHDL has the float_generic_pkg package introduced in the 2008 version, which performs synthesis eligible floating point operations and is compatible with the used of signed types by converting to and from it's float type.
Before we suggest something so drastic as performing all these calculations as 64 bit floating point numbers and synthesize all that let's again note that the result is a 16 bit signed which is an array type of std_ulogic and represents a 16 bit integer.
You can model the multiplications on the right hand side as distinct expressions executed in both floating point or signed
representation to determine when the error is significant.
Because you are using a 16 bit signed value for y, significant would mean a difference greater than 1 in magnitude. Flipped signs or unexpected 0s between the two methods will likely tell you there's a precision issue.
I wrote a little C program to look at the differences and right off the bat it tells us 16 bits isn't enough to hold the math:
int16_t x, y, Y;
int16_t a,b,c,d;
double A,B,C,D;
a = x*x*x * 0.1666;
A = x*x*x * 0.1666;
b = 2.5 * x*x;
B = 2.5 * x*x;
c = 21.666 * x;
C = 21.666 * x;
d = 36;
D = 36.6653;
y = -( a + b - c + d);
Y = (int16_t) -(A + B - C + D);
And outputs for the left most value of x:
x = -32767, a = 11515, b = 0, c = 10967, y = -584, Y = 0
x = -32767, A = -178901765.158200, B = 2684190722.500000, C = -709929.822000
x = -32767 , y = -584 , Y= 0, double = -2505998923.829100
The first line of output is for 16 bit multiplies and you can see all three expressions with multiplies are incorrect.
The second line says double has enough precision, yet Y (-(A + B - C + D)) doesn't fit in a 16 bit number. And you can't cure that by making the result size larger unless the input size remains the same. Chaining operations then becomes a matter of picking best product and keeping track of the scale, meaning you might as well use floating point.
You could of course do clamping if it were appropriate. The double value on the third line of output is the non truncated value. It's more negative than x'LOW.
You could also do clamping in the 16 bit math domain, though all this tells you this math has no meaning in the hardware domain unless it's done in floating point.
So if you were trying to solve a real math problem in hardware it would require floating point, likely accomplished using package float_generic_pkg, and wouldn't fit meaningfully in a 16 bit result.
As stated in found '0' definitions of operator “+” in VHDL, the VHDL compiler is unable to find the matching operator for your operation, which is e.g. multiplying x*x. You probably want to use numeric_std (see here) in order to make operators for signed (and unsigned) available.
But note, that VHDL is not a programming language but a hardware design language. That is, if your long-term goal is to move the code to an FPGA or CPLD, these functions might not work any longer, because they are not synthesizable.
I'm stating this, because you will become more problems when you try to multiply with e.g. 0.1666, because VHDL usually has no knowledge about floating point numbers out of the box.

BLAS subroutines dgemm, dgemv and ddot doesn't work with scalars?

I have a Fortran subroutine which uses BLAS' subroutines dgemm, dgemv and ddot, which calculate matrix * matrix, matrix * vector and vector * vector. I have m * m matrices and m * 1 vectors. In some cases m=1. It seems that those subroutines doesn't work well in those cases. They doesn't give errors, but there seems to be some numerical unstability in results. So I have to write something like:
if(m>1) then
vtuni(i,t) = yt(i,t) - ct(i,t) - ddot(m, zt(i,1:m,(t-1)*tvar(3)+1), 1, arec, 1)
else
vtuni(i,t) = yt(i,t) - ct(i,t) - zt(i,1,(t-1)*tvar(3)+1)*arec(1)
So my actual question is, am I right that those BLAS' subroutines doesn't work properly when m=1 or is there just something wrong in my code? Can the compiler affect this? I'm using gfortran.
BLAS routines are supposed to behave correctly with objects of size 1. I don't think it can depend on compiler, but it could possible depend on the implementation of BLAS you're relying on (though I'd consider it a bug of the implementation). The reference (read: not target-optimised) implementation of BLAS, which can be found on Netlib, handles that case fine.
I've done some testing on both arrays of size 1, and size-1 slices of larger array (as in your own code), and they both work fine:
$ cat a.f90
implicit none
double precision :: u(1), v(1)
double precision, external :: ddot
u(:) = 2
v(:) = 3
print *, ddot(1, u, 1, v, 1)
end
$ gfortran a.f90 -lblas && ./a.out
6.0000000000000000
$ cat b.f90
implicit none
double precision, allocatable :: u(:,:,:), v(:)
double precision, external :: ddot
integer :: i, j
allocate(u(3,1,3),v(1))
u(:,:,:) = 2
v(:) = 3
i = 2
j = 2
print *, ddot(1, u(i,1:1,j), 1, v, 1)
end
$ gfortran b.f90 -lblas && ./a.out
6.0000000000000000
Things I'd consider to debug this problem further:
Check that your ddot definition is correct
Substitute the reference BLAS to your optimised one, to check if it changes anything (you can just compile and link in the ddot.f file I linked to earlier in my answer)

Resources