Large Matrix handling in Fortran between multiple routines

Large Matrix handling in Fortran between multiple routines - matrix

I have a couple of matrices which are generated in a subroutine, and are used and altered in different parts of the program. Since the matrices are 6-dimensional and get quite large (100^6 is nothing unusual), generating and passing through the routines is not an option.
I open files to read/write with form=unformatted, access='direct'
What I am doing at the moment is storing them like this:
do i=1,noct
do j=1,noct
read_in_some_vector()
do k=1,ngem**2
do l=1,nvir**2
mat(l) = mat(l) * some_vector(k) * some_mat(l,k)
end do
end do
ij=j+(i-1)*noct
write(unit=iunV,rec=ij) (mat(l),l=1,nvir**2)
end do
end do
To use the matrix, I read it recordwise from the files:
iunC=open_mat_file(mat)
do i = 1,noct
do j = 1,noct
ij=j+(i-1)*noct
read(unit=iunC,rec=ij) (mat(l),l=1,nvir**2)
ij = min(i,j) + intsum( max(i,j)-1)
read_some_vector(vec,rec=ij)
do_sth = do_sth + ddot(nvir**2,mat,1,vec,1)
end do
end do
At the moment, noct is a small number (compared to the others). But it will change to a quite huge number, so the size of the matrices will explode. The matrices are (and have to be) double precision, so 8Tb for one matrix is in the realm of possibilities.
The matrices are not sparse, and they are all strictly antisymmetric.
What I can think of is either generate the needed matrix-parts directly in the routines, or clutter the harddrives with huge files.
Both would use up a lot of time (calculating or reading/writing).
Can anybody think of a third way? Or a way to optimize this?

Related

A memory efficient way for a randomized single pass over a set of indices

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)
To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:
from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
f.seek(0)
counter = 1
while counter < SIZE+1:
f.write(pack('>i', counter))
counter+=1
Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files
So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:
from struct import pack
def rand_index_generator:
generator = RAND_INDEX(1, MAX+1, NO REPLACEMENT)
counter = 0
while counter < MAX:
counter+=1
yield generator.next_index()
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
for index in rand_index_generator:
f.seek(4*index)
f.write(pack('>i', counter))
counter+=1
I need it:
Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later
Is there a way to do this in Python 2.7?

Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.
index_gen = rand_index_generator(1934919251, 2**28)
Then each time you want the new index you use index_gen.next() and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'
There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

Poor performance in matlab

So I had to write a program in Matlab to calculate the convolution of two functions, manually. I wrote this simple piece of code that I know is not that optimized probably:
syms recP(x);
recP(x) = rectangularPulse(-1,1,x);
syms triP(x);
triP(x) = triangularPulse(-1,1,x);
t = -10:0.1:10;
s1 = -10:0.1:10;
for i = 1:201
s1(i) = 0;
for j = t
s1(i) = s1(i) + ( recP(j) * triP(t(i)-j) );
end
end
plot(t,s1);
I have a core i7-7700HQ coupled with 32 GB of RAM. Matlab is stored on my HDD and my Windows is on my SSD. The problem is that this simple code is taking I think at least 20 minutes to run. I have it in a section and I don't run the whole code. Matlab is only taking 18% of my CPU and 3 GB of RAM for this task. Which is I think probably enough, I don't know. But I don't think it should take that long.
Am I doing anything wrong? I've searched for how to increase the RAM limit of Matlab, and I found that it is not limited and it takes how much it needs. I don't know if I can increase the CPU usage of it or not.
Is there any solution to how make things a little bit faster? I have like 6 or 7 of these for loops in my homework and it takes forever if I run the whole live script. Thanks in advance for your help.
(Also, it highlights the piece of code that is currently running. It is the for loop, the outer one is highlighted)

Like Ander said, use the symbolic toolbox in matlab as a last resort. Additionally, when trying to speed up matlab code, focus on taking advantage of matlab's vectorized operations. What I mean by this is matlab is very efficient at performing operations like this:
y = x.*z;
where x and z are some Nx1 vectors each and the operator '.*' is called 'dot multiplication'. This is essentially telling matlab to perform multiplication on x1*z1, x[2]*z[2] .... x[n]*z[n] and assign all the values to the corresponding value in the vector y. Additionally, many of the functions in matlab are able to accept vectors as inputs and perform their operations on each element and return an equal size vector with the output at each element. You can check this for any given function by scrolling down in its documentation to the inputs and outputs section and checking what form of array the inputs and outputs can take. For example, rectangularPulse's documentation says it can accept vectors as inputs. Therefore, you can simplify your inner loop to this:
s1(i) = s1(i) + ( rectangularPulse(-1,1,t) * triP(t(i)-t) );
So to summarize:
Avoid the symbolic toolbox in matlab until you have a better handle of what you're doing or you absolutely have to use it.
Use matlab's ability to handle vectors and arrays very well.
Deconstruct any nested loops you write one at a time from the inside out. Usually this dramatically accelerates matlab code especially when you are new to writing it.
See if you can even further simplify the code and get rid of your outer loop as well.

Optimizing MATLAB work on N dim array(512,512,400)

I am working on images that are 512x512 pixels; I have written a code that analyzes my images and gives me the values that I need in matrices that have dimensions (512,512,400) in 10 minutes more or less, using pre-allocation.
My problem is when I want to work with this matrices: it takes me hours to see results and I want to implement some script that does what I want in much less time. Can you help me?
% meanm is a matrix (512,512,400) that contains the mean of every inputmatrix
% sigmam is a matrix (512,512,400) that contains the std of every inputmatrix
% Basically what I want is that for every inputmatrix (512x512), that is stored inside
% an array of dimensions (512,512,400),
% if a value is higher than the meanm + sigmam it has to be changed with
% the corrispondent value of meanm matrix.
p=400;
for h=1:p
if (inputmatrix(:,:,h) > meanm(:,:,h) + sigmam(:,:,h))
inputmatrix(:,:,h) = meanm(:,:,h);
end
end
I know that MatLab performs better on matrices calculation but I have no idea how to translate this for loop on my 400 images in something easier for it.

Try using the condition of your for loop to make a logical matrix
logical_mask = (meanm + sigmam) < inputmatrix;
inputmatrix(logical_mask) = meanm(logical_mask);
This should improve your performance by using two features of Matlab
Vectorization uses matrix operations instead of loops. To quote the linked site "Vectorized code often runs much faster than the corresponding code containing loops."
Logical Indexing allows you to access all elements in your array that meet a condition simultaneously.

construct a structured matrix efficiently in fortran

Having left Fortran for several years, now I have to pick it up and start to work with it again.
I'd like to construct a matrix with entry(i,j) in the form f(x_i,y_j), where f is a function of two variables, e.g., f(x,y)=cos(x-y). In Matlab or Python(Numpy), there are efficient ways to handle this kind of specific issue. I wonder whether there is such optimization in Fortran.
BTW, is it also true in Fortran that a vectorized operation is faster than a do/for loop (as is the case in Matlab and Numpy) ?

If you mean by vectorized the same as you mean in Matlab and Python, the short form you call on whole array then no, these forms are often slower, because they mey be harder to optimize than simple loops. What is faster is when the compiler actually uses the vector instructions of the CPU, but that is something else. And it is easier for the compiler to use them for simple loops.
Fortran has elemental functions, do concurrent, forall and where constructs, implied loops and array constructors. There is no point repeating them here, they have been described many times on this site or in tutorials.
Your example is most simply done using a loop
do j = 1, ny
do i = 1, nx
entry(i,j) = f(x(i), y(j))
end do
end do
One of the short ways, you probably meant by Python-like vectorization, would be the whole-array operations, e.g.,
A = cos(B)
C = A * B
D = f(A*B)
and similar. The function (which is called on each element of the array), must be elemental. These operations are not necessarily efficient. For example, the last call may require a temporary array to be created, which would be avoided when using a loop.

openMP chunk and cache size

I have a simple Fortran code which perform matrix multiplication and it is parallelized with OpenMP like this
!$OMP PARALLEL DO PRIVATE(...) SHARED(...) SCHEDULE(STATIC,N/128)
To make chunk size relatively large and number of chunks multiple of number of processors (4,8,16,etc.)
However, when matrix size goes really big, it seems more logical to set chunk size smaller than cache size (at least, it is worth to try). Is there a simple way to write a portable code which takes into account processor cache size? Or it is not supported by OpenMP?

It really depends on your algorithm and your problem. I suggest you to look for so called tiled algorithms and loop over tiles you setup yourself to have the right size. I use something like this for finite difference stencil computations:
!$omp do
do bk = 1,nz,tilenz
do bj = 1,ny,tileny
do bi = 1,nx,tilenx
do k = bk,min(bk+tilenz-1,nz)
do j = bj,min(bj+tileny-1,ny)
do i = bi,min(bi+tilenx-1,nx)
do something with array element A(i,j,k) and its neighbours
where tilenx, tileny and tilenz are the x,y and z dimensions of the tile.
There are more advanced ways how to organize the computation in the literature.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Large Matrix handling in Fortran between multiple routines - matrix

Related

A memory efficient way for a randomized single pass over a set of indices

Poor performance in matlab

Optimizing MATLAB work on N dim array(512,512,400)

construct a structured matrix efficiently in fortran

openMP chunk and cache size

Categories

Resources