finding duplicate records in fortran - sorting

I have a data approximately a million record, each record have 6 floating point number. I want to find sets of records who share identical six values, and ideally I want to do it in Fortran since the rest of processing is done in Fortran. What would be the recommended approach for this? At the end i want to have mapping from original index to new index which is condensed version of these dataset without duplicate. Each record has other attributes and i am interested in aggregating those for groups based on the six attributes.
I tried to find those sets by exporting output as csv, import it into MS Access, then a query that finds those sets took 10 seconds or so to run. I wrote a code which does http://rosettacode.org/wiki/Remove_duplicate_elements#Fortran this ("linear search"?), but with million record it didnt complete after 10 min or so, i just abandoned this approach.
Approach I am thinking now is adapting ranking/sorting routine from slatec or orderpack which i assume do better than my crude code. But I am wondering if such things are already done and i can download, or if there is better approach for this.
EDIT:
I said "finding duplicate", but i actually need mapping from original data records to this reduced sets. I want to have mapping array like imap(1:n), where imap(1), imap(4), imap(5) has same values if those 6 float pt. values in original record 1, 4 and 5 are the same. Hope this is not too much a deviation from what I said originally...

This is what I ended up doing... I took code mrgrnk from ORDERPACK , and adapted for my purpose. The subroutine findmap below appears to be doing what I wanted it to do.
module fndmap
use m_mrgrnk, only:mrgrnk
implicit none
contains
subroutine findmap(stkprm, stkmap )
! given 2-d real array stkprm, find a mapping described below:
!
! (identical records are assigned with same index)
! stkmap(i) == stkmap(j) iff stkprm(:,i) == stkprm(:,j)
! (order conserved)
! if i < j and stkmap(i) /= stkmap(j), then stkmap(i) < stkmap(j)
! (new index are contiguous)
! set(stkmap) == {1,2,..,maxval(stkmap)}
!
real,dimension(:,:),intent(in) :: stkprm
integer,dimension(:), intent(out) :: stkmap
integer, dimension(size(stkprm,2)) :: irngt
integer, dimension(size(stkprm,2)) :: iwork
integer :: nrec, i, j
nrec = size(stkprm,2)
! find rank of each record, duplicate records kept
call ar_mrgrnk(stkprm, irngt)
! construct iwork array, which has index of original array where the
! record are identical, and the index is youguest
i = 1
do while(i<=nrec)
do j=i+1,nrec
if (any(stkprm(:,irngt(i))/=stkprm(:,irngt(j)))) exit
enddo
iwork(irngt(i:j-1)) = minval(irngt(i:j-1))
i = j
enddo
! now construct the map, where stkmap(i) shows index of new array
! with duplicated record eliminated, original order kept
j = 0
do i=1,nrec
if (i==iwork(i)) then
j = j+1
stkmap(i) = j
else
stkmap(i) = stkmap(iwork(i))
endif
enddo
end subroutine
recursive subroutine ar_mrgrnk(xdont, irngt)
! behaves like mrgrnk of ORDERPACK, except that array is 2-d
! each row are ranked by first field, then second and so on
real, dimension(:,:), intent(in) :: xdont
integer, dimension(:), intent(out), target :: irngt
integer, dimension(size(xdont,2)) :: iwork
integer :: nfld,nrec
integer :: i, j
integer, dimension(:), pointer :: ipt
nfld=size(xdont,1)
nrec=size(xdont,2)
! rank by the first field
call mrgrnk(xdont(1,:), irngt)
! if there's only one field, it's done
if (nfld==1) return
! examine the rank to see if multiple record has identical
! values for the first field
i = 1
do while(i<=nrec)
do j=i+1,nrec
if (xdont(1,irngt(i))/=xdont(1,irngt(j))) exit
enddo
! if one-to-one, do nothing
if (j-1>i) then
! if many-to-one,
! gather those many, and rank them
call ar_mrgrnk(xdont(2:,irngt(i:j-1)),iwork)
! rearrange my rank based on those fields to the right
ipt => irngt(i:j-1)
ipt = ipt(iwork(1:j-i))
endif
i = j
enddo
if(associated(ipt)) nullify(ipt)
end subroutine
end module

Related

Init array with unique random integers using OpenMP

I need to generate a random array of indices, i.e., unique integers beginning from 1.
So far I have this sequential code in Fortran:
subroutine rperm3(N, p)
integer, intent(in) :: N
integer, dimension(:), intent(out) :: p
integer :: j, k, l
real :: u
call random_seed()
p = 0
do j=1,N
call random_number(u)
k = floor(j*u) + 1
p(j) = p(k)
p(k) = j
call sleep(2)
end do
end subroutine rperm3
In every iteration a random index gets generated, value from this index is assigned to the position of the current index and the old value is rewritten with the current index itself.
But is there any way to parallelize this with OpenMP? I see, that simply using !$omp parallel for is not possible, as we are using the previous values of the array, which don't have to be assigned at the time they're needed.
And if there is no way to do it, is there any other parallel algorithm doing the same thing, i.e. Generating a random array of unique integers from some range?

Matrix equation not properly updating in time

As a simple example to illustrate my point, I am trying to solve the following equation f(t+1) = f(t) + f(t)*Tr (f^2) starting at t=0 where Tr is the trace of a matrix (sum of diagonal elements). Below I provide a basic code. My code compiles with no errors but is not updating the solution as I want. My expected result is also below which I calculated by hand (it's very easy to check by hand via matrix multiplication).
In my sample code below I have two variables that store solution, g is for f(t=0) which I implement, and then I store f(t+1) as f.
complex,dimension(3,3) :: f,g
integer :: k,l,m,p,q
Assume g=f(t=0) is defined as below
do l=1,3 !matrix index loops
do k=1,3 !matrix index loops
if (k == l) then
g(k,l) = cmplx(0.2,0)
else if ( k /= l) then
g(k,l) = cmplx(0,0)
end if
end do
end do
I have checked this result is indeed what I want it to be, so I know f at t=0 is defined properly.
Now I try to use this matrix at t=0 and find the matrix for all time, governed by the equation f(t+1) = f(t)+f(t)*Tr(f^2), but this is where I am not correctly implementing the code I want.
do m=1,3 !loop for 3 time iterations
do p=1,3 !loops for dummy indices for matrix trace
do q=1,3
g(1,1) = g(1,1) + g(1,1)*g(p,q)*g(p,q) !compute trace here
f(1,1) = g(1,1)
!f(2,2) = g(2,2) + g(2,2)*g(p,q)*g(p,q)
!f(3,3) = g(3,3) + g(3,3)*g(p,q)*g(p,q)
!assume all other matrix elements are zero except diagonal
end do
end do
end do
Printing this result is done by
print*, "calculated f where m=", m
do k=1,3
print*, (f(k,l), l=1,3)
end do
This is when I realize my code is not being implemented correctly.
When I print f(k,l) I expect for t=1 a result of 0.224*identity matrix and now I get this. However for t=2 the output is not right. So my code is being updated correctly for the first time iteration, but not after that.
I am looking for a solution as to how to properly implement the equation I want to obtain the result I am expecting.
I'll answer a couple things you seem to be having trouble with. First, the trace. The trace of a 3x3 matrix is A(1,1)+A(2,2)+A(3,3). The first and second indexes are the same, so we use one loop variable. To compute the trace of an NxN matrix A:
trace = 0.
do i=1,N
trace = trace + A(i,i)
enddo
I think you're trying to loop over p and q to compute your trace which is incorrect. In that sum, you'll add in terms like A(2,3) which is wrong.
Second, to compute the update, I recommend you compute the updated f into fNew, and then your code would look something like:
do m=1,3 ! time
! -- Compute f^2 (with loops not shown)
f2 = ...
! -- Compute trace of f2 (with loop not shown)
trace = ...
! -- Compute new f
do j=1,3
do i=1,3
fNew(i,j) = f(i,j) + trace*f(i,j)
enddo
enddo
! -- Now update f, perhaps recording fNew-f for some residual
! -- The LHS and RHS are both arrays of dimension (3,3),
! -- so fortran will automatically perform an array operation
f = fNew
enddo
This method has two advantages. First, your code actually looks like the math you're trying to do, and is easy to follow. This is very important for realistic problesm which are not so simple. Second, if fNew(i,j) depended on f(i+1,j), for example, you are not updating to the next time level while the current time level values still need to be used.

How to output 2 or more arrays in a fortran's function?

I am writing a program which computes the LU decomposition of a matrix, with partial pivoting, and I would like the function to output several (2 or 3) matrices without running the program several times to output each one individually, which is a waste of time since it gets me everything I want in one run. Is there a way of doing this? For example, here is my function using Doolittle's algorithm, for square matrix which don't need pivoting. I want my output to be matrix l and u at once, but I know no means of doing that.
function lu_d(aa) result(l)
real, dimension (:,:) :: aa !input matrix
real, dimension (size(aa,1), size(aa,2)) :: a !keeping input variable intact
real, dimension (size(a,1), size(a,2)) :: l , u !lower and upper matrices
integer :: i,j,k !index
real :: s !auxiliar variable
a=aa
do j=1 , size(a,2)
u(1,j)=a(1,j)
end do
l(1,1)=1
do j=2, size(a,2)
l(1,j)=0
end do
do i=2, size(a,1)
l(i,1)=a(i,1)/u(1,1)
u(i,1)=0
do j=2, i-1
s=0
u(i,j)=0
do k=1, j-1
s=s+l(i,k)*u(k,j)
end do
l(i,j)=(a(i,j)-s)/u(j,j)
end do
l(i,i)=1
do j=i, size(a,2)
s=0
l(i,j)=0
do k=1, i-1
s=s+l(i,k)*u(k,j)
end do
u(i,j)=a(i,j)-s
end do
end do
end function
You could switch from using a function to using a subroutine. This way you can output values for multiple arrays in the arguments list. Additionally using the
INTENT definition when declaring variables in the subroutine, e.g.:
REAL,INTENT(IN)::a declares a and does not allow its values to be altered inside the subroutine/function
REAL,INTENT(OUT)::b declares b and disregards any values it has coming into the subroutine/function
REAL,INTENT(INOUT)::c this is the case by default, if you don't write anything.
I will assume you need the output to be l and u (rather than m), in which case the structure would look something like the one below. Note that l and m should either be declared in the main program and their size defined with respect to aa (as in the first case shown below) OR declared with an allocatable size in the main program, passed to the subroutine without being allocated and allocated within the subroutine (second example). The latter may require you to put the subroutine in a module so that the interfaces are handled properly.
First example:
SUBROUTINE lu_d(aa,l,m)
implicit none
real,intent(in):: a(:,:)
real,intent(out):: l(:,:), m(:,:)
integer:: i,j,k
real:: s
<operations>
RETURN
END SUBROUTINE lud_d
Second example:
SUBROUTINE lu_d(aa,l,m)
implicit none
real,intent(in):: a(:,:)
real,allocatable,intent(out):: l(:,:), m(:,:)
integer:: i,j,k,size_a1,size_a2
real:: s
size_a1=size(aa,1)
size_a2=size(aa,2)
allocate( l(size_a1,size_a2), m(size_a1,size_a2))
<operations>
RETURN
END SUBROUTINE lud_d

Build a block tri-diagonal matrix

I am trying to build a block tridiagonal matrix in Fortran. Now I have this piece of code that would deal with just the matrices that are placed in the main diagonal of the A_matrix, one new matrix for every step in i.
do i = gs+1, total_mesh_points
start_line = (3*i)-2
start_colu = (3*i)-2
final_line = (3*i)
final_colu = (3*i)
do ii = 1, 3
do jj = 1, 3
A_matrix(start_line:final_line,start_colu:final_colu) = &
impflux(ii,jj)
end do
end do
end do
Here my A_matrix(i,j) is a big matrix that will receive another three by three matrix (impflux) in its main diagonal. Note that for each step in i I will have a new impflux matrix that needs to be positioned in the main diagonal of the A_matrix.
I can't think in a more simple solution for this problem. How people usually build block diagonal matrices in Fortran ?
Here's one way to build a block tridiagonal matrix. I'm not sure that there is, outside some well-known libraries, a usual way. This is a program, I'll leave it up to you to turn it into a function.
PROGRAM test
USE iso_fortran_env
IMPLICIT NONE
INTEGER :: k ! submatrix size
INTEGER :: n ! number of submatrices along main diagonal
INTEGER :: ix ! loop index
! the submatrices, a (lower diagonal) b (main diagonal) c (upper diagonal)
REAL(real64), DIMENSION(:,:,:), ALLOCATABLE :: amx, bmx, cmx
! the block tridiagonal matrix
REAL(real64), DIMENSION(:,:), ALLOCATABLE :: mat_a
k = 3 ! set these values as you wish
n = 4
ALLOCATE(amx(n-1,k,k), bmx(n,k,k), cmx(n-1,k,k))
ALLOCATE(mat_a(n*k,n*k))
mat_a = 0.0
! populate these as you wish
amx = 1.0
bmx = 2.0
cmx = 3.0
! first the lower diagonal
DO ix = 1,k*(n-1),k
mat_a(ix+k:ix+2*k-1,ix:ix+k-1) = amx(CEILING(REAL(ix)/REAL(k)),:,:)
END DO
! now the main diagonal
DO ix = 1,k*n,k
mat_a(ix:ix+k-1,ix:ix+k-1) = bmx(CEILING(REAL(ix)/REAL(k)),:,:)
END DO
! finally the upper diagonal
DO ix = 1,k*(n-1),k
mat_a(ix:ix+k-1,ix+k:ix+2*k-1) = cmx(CEILING(REAL(ix)/REAL(k)),:,:)
END DO
END PROGRAM test
Be warned, there's no error checking here at all and I've only made a few tests.
One obvious alternative would be to loop over the rows of mat_a only once, inserting amx, bmx, cmx at the same iteration, but this would require special handling for the first and last iterations and probably look a lot more complicated. As for performance, if it matters to you run some tests.
Note also that this produces a dense matrix. If your matrix gets very large then an approach which stores only the diagonal elements might be more useful. That takes us towards derived types and operations on them, and that's a whole other question.

Reading and Printing a random number

I have the following program and below the program an input data file, which contains 10 lines of different data. I want to read this data randomly not sequentially, for example, it will maybe read line 3 then maybe line 5, not like number 1 2 3 4... Then these numbers I want to print randomly.
program rand
implicit none
integer::i, ok
real(kind=8) , allocatable , dimension(:):: s
integer, parameter:: nstep = 1, natom = 10
integer:: seed, rand
open(unit=2,file="fort.2",status="old",action="read")
allocate(s(natom),stat=ok)
if(ok/=0)then
print*,"problem allocating position array"
end if
do i=1,natom
read(2,*)s(i)
print*,i=(rand(seed))
end do
end program rand
Input file:
1.004624
1.008447
1.028897
1.001287
0.9994195
1.036111
0.9829285
1.029622
1.005867
0.9372157
As suggested by #IanBush in a comment, and also by #Sazzad in his answer, a reasonable approach is to read the whole file into an array as your program is already doing. However, simply shuffing does not seem to me to lead to a random printing. It is just a new order. That is the reason while I am proposing this solution.
Random means that the same number can be printed many times while other are not printed at all, if the number of print is limited. And as I can see your problem is how to select randomly. Since you show some effort, here is a modified version of your program
program rand
implicit none
integer::i, ok, idx
real(kind=8) , allocatable , dimension(:):: s
integer, parameter:: nstep = 1, natom = 10
integer:: seed!, rand
real(kind = 8) :: randNum
!
!
open(unit=2,file="fort.2",status="old",action="read")
!
!
allocate(s(natom),stat=ok)
if(ok/=0)then
print*,"problem allocating position array"
end if
!
do i=1,natom
read(2,*)s(i)
!print*,i=(rand(seed))
end do
!
CALL random_seed() ! Initialize a pseudo-random number sequence
! to the default state. For serious program, do not use the default
! use for example the program on the website of gnu fortran
! https://gcc.gnu.org/onlinedocs/gfortran/RANDOM_005fSEED.html
!
do i=1,natom !you can and should change natom here to something else
CALL random_number(randNum)
idx = int(randNum*natom) + 1
print*,'element at ',idx,': ', s(idx)
end do
end program rand
This difference is that the printing is commented in your original program and there is a new loop to print randomly. You will see that some numbers will be printed more than once. To give each number a chance to be printed, you should set a large number of iteration inf the printing loop.
In this answer, I used the default seed for the random number which is not a good idea. On the web site of gnu fortran ( link ) you can find a good approach of initializing the random seed. It is a good programming habit if the reproducibility is not a concern.
General algorithm looks like,
Read all or N lines from file in lines[N]
Create an array index[N] = {1, 2, ... N}
Shuffle index array with simple shuffle algorithms
Traverse index[i] for each i up to size and output line[i]
You have to convert it in your language yourself

Resources