The most efficient way to read a unformatted file - performance

Now I am data-processing 100,000 files by using Fortran. These data are generated by HPC using MPI I/O. Now I can just figure out the following ways to read the raw, which is not efficient. Is it possible that read every to read ut_yz(:,J,K), at one one time insteading of reading one by one? Thanks
The old code is as follows and the efficiency is not so high.
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED',&
ACCESS='DIRECT', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
DO I=1,nxt
READ(10,REC=COUNT) ut_yz(I,J,K)
COUNT = COUNT + 1
ENDDO
ENDDO
ENDDO
CLOSE(10)
The desired one is
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
READ(10,REC=COUNT) TEMP(:)
COUNT = COUNT + 153
ut_yz(:,J,K)=TEMP(:)
ENDDO
ENDDO
CLOSE(10)
However, it always fails. Can anyone make a comment on this? Thanks.

Direct IO read will read a single record, if I am not mistaken. Thus, in your new code version you need to increase the record length accordingly:
inquire(iolength=rl) ut_yz(:,1,1)
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
count = 1
do k=1,nz
do j=1,ny
read(10, rec=count) ut_yz(:,j,k)
count = count + 1
end do
end do
close(10)
Of course, in this example you could also read the complete array at once, which should be the fastest option:
inquire(iolength=rl) ut_yz
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
read(10, rec=1) ut_yz
close(10)

Related

reading in parallel from" generator in Keras

I have a big dataset divided in files.
I would like to read and process my data one file at the time and for this I have this keras generator:
def myGenerator():
while 1:
rnd = random.randint(1,200)
strRnd = str(rnd)
lenRnd = len(strRnd)
rndPadded = strRnd.rjust(5, '0')
nSearchesInBatch = 100
f = "path/part-" + rndPadded + "*" #read one block of data
data = sqlContext.read.load(f).toPandas()
imax = int(data.shape[0]/nSearchesInBatch) #number of batches that will be created sequentially from the generator
for i in range(imax):
data_batch = data[i*nSearchesInBatch:(i+1)*nSearchesInBatch]
features = data_batch['features']
output = data_batch['output']
yield features, output
The problem is that the reading takes the biggest part (each file is around 200mb), and in the meanwhile the GPU sits waiting, it is possible to pre-read the next batch while the GPU is traning on the previous one?
At the moment one file is read and split in steps (the inner loop), the CPUs are hidden and the GPU training, as soon as the epoch finishes, the GPU goes idle and the cpu start reading (which takes 20/30 seconds).
Any solution to parallelize this?

Working with very big data faster in Matlab?

I have to deal with very big data (Point clouds generally more than 30 000 000 points) using Matlab. I can read ascii data using textscan function. After reading, I need to detect invalid data (points with 0,0,0 coordinates) and then I need to do some mathematical operations on each point or each line in the data. In my way, first I read data with textscan and then I assign this data to a matrix. Secondly, I use for loops for detecting invalid points and doing some mathematical operations on each point or line in the data. A sample of my code is shown as below. According to profile tool of Matlab textscan takes 37% and line
transformed_list((i:i),(1:4)) = coordinate_list((i:i),(1:4))*t_matrix;
takes 35% of all computation time.
I tried it with another point cloud (stores around 5 500 000) and profile tool reported same results. Is there a way of avoiding for loops, or is there another way of speeding up this computation?
fileID = fopen('C:\Users\Mustafa\Desktop\ptx_all_data\dede5.ptx');
original_data = textscan(fileID,'%f %f %f %f %f %f %f', 'delimiter',' ');
fclose(fileID);
column = original_data{1}(1);
row = original_data{1}(2);
t_matrix = [original_data{1}(7) original_data{2}(7) original_data{3}(7) original_data{4}(7)
original_data{1}(8) original_data{2}(8) original_data{3}(8) original_data{4}(8)
original_data{1}(9) original_data{2}(9) original_data{3}(9) original_data{4}(9)
original_data{1}(10) original_data{2}(10) original_data{3}(10) original_data{4}(10)];
coordinate_list(:,1) = original_data{1}(11:length(original_data{1}));
coordinate_list(:,2) = original_data{2}(11:length(original_data{2}));
coordinate_list(:,3) = original_data{3}(11:length(original_data{3}));
coordinate_list(:,4) = 0;
coordinate_list(:,5) = original_data{4}(11:length(original_data{4}));
transformed_list = zeros(length(coordinate_list),5);
for i = 1:length(coordinate_list)
if coordinate_list(i,1) == 0 && coordinate_list(i,2) == 0 && coordinate_list(i,3) == 0
transformed_list(i,:) = NaN;
else
%transformed_list(i,:) = coordinate_list(i,:)*t_matrix;
transformed_list((i:i),(1:4)) = coordinate_list((i:i),(1:4))*t_matrix;
transformed_list(i,5) = coordinate_list(i,5);
end
%i
end
Thanks in advance
for loops with conditional statements like those will take ages to run. But what Matlab lacks in loop speed it makes up with vectorization and indexing.
Let's try some logical indexing like this to solve the first step:
coordinate_list(coordinate_list(:,1) == 0 .* ...
coordinate_list(:,2) == 0 .* ...
coordinate_list(:,3) == 0)=nan;
And then vectorize the second statement:
transformed_list(:,(1:4)) = coordinate_list(:,(1:4))*t_matrix;
As EBH mentioned above this might be a bit heavy on your RAM. If it's more than your computer can handle asks yourself if the coordinates really have to be doubles, maybe single precision will do. If that still doesn't do, try slicing the vector and performing the operation in parts.
Small example to give you an idea because I had a 2million element point cloud around here:
In R2015a
transformed_list = zeros(length(coordinate_list),5);
tic
for i = 1:length(coordinate_list)
if coordinate_list(i,1) == 0 && coordinate_list(i,2) == 0 && coordinate_list(i,3) == 0
transformed_list(i,:) = NaN;
else
%transformed_list(i,:) = coordinate_list(i,:)*t_matrix;
transformed_list((i:i),(1:3)) = coordinate_list((i:i),(1:3))*t_matrix;
transformed_list(i,5) = 1;
end
%i
end
toc
Returns Elapsed time is 10.928142 seconds.
transformed_list=coordinate_list;
tic
coordinate_list(coordinate_list(:,1) == 0 .* ...
coordinate_list(:,2) == 0 .* ...
coordinate_list(:,3) == 0)=nan;
transformed_list(:,(1:3)) = coordinate_list(:,(1:3))*t_matrix;
toc
Returns Elapsed time is 0.101696 seconds.
Rather than read the whole file, you'd be better off using a loop with
fscanf(fileID, '%f', 7)
and processing input as you read it.

Fast and furious random reading of huge files

I know that the question isn't new but I haven't found anything useful. In my case I have a 20 GB file and I need to read random lines from it. Now I have simple file index which contains line numbers and corresponding seek offsets. Also I disabled buffering when reading to read only the needed line.
And this is my code:
def create_random_file_gen(file_path, batch_size=0, dtype=np.float32, delimiter=','):
index = load_file_index(file_path)
if (batch_size > len(index)) or (batch_size == 0):
batch_size = len(index)
lines_indices = np.random.random_integers(0, len(index), batch_size)
with io.open(file_path, 'rb', buffering=0) as f:
for line_index in lines_indices:
f.seek(index[line_index])
line = f.readline(2048)
yield __get_features_from_line(line, delimiter, dtype)
The problem is that it's extremely slow: reading of 5000 lines takes 89 seconds on my Mac(here I point to ssd drive). There is code I used for testing:
features_gen = tedlium_random_speech_gen(5000) # just a wrapper for function given above
i = 0
for feature, cls in features_gen:
if i % 1000 == 0:
print("Got %d features" % i)
i += 1
print("Total %d features" % i)
I've read something about files memory mapping but I don't really understand how it works: how the mapping works in essence and will it speed up the process or no.
So the main question what are the possible ways to speed up the process? The only way I see now is to read randomly not every line but blocks of lines.

Getting fortran runtime error: end of file

I have recently learned how to work with basic files in Fortran
and I assumed it was as simple as:
open(unit=10,file="data.dat")
read(10,*) some_variable, somevar2
close(10)
So I can't understand why this function I wrote is not working.
It compiles fine but when I run it it prints:
fortran runtime error:end of file
Code:
Function Load_Names()
character(len=30) :: Staff_Name(65)
integer :: i = 1
open(unit=10, file="Staff_Names.txt")
do while(i < 65)
read(10,*) Staff_Name(i)
print*, Staff_Name(i)
i = i + 1
end do
close(10)
end Function Load_Names
I am using Fortran 2008 with gfortran.
A common reason for the error you report is that the program doesn't find the file it is trying to open. Sometimes your assumptions about the directory in which the program looks for files at run-time will be wrong.
Try:
using the err= option in the open statement to write code to deal gracefully with a missing file; without this the program crashes, as you have observed;
or
using the inquire statement to figure out whether the file exists where your program is looking for it.
You can check when a file has ended. It is done with the option IOSTAT for read statement.
Try:
Function Load_Names()
character(len=30) :: Staff_Name(65)
integer :: i = 1
integer :: iostat
open(unit=10, file="Staff_Names.txt")
do while(i < 65)
read(10,*, IOSTAT=iostat) Staff_Name(i)
if( iostat < 0 )then
write(6,'(A)') 'Warning: File containts less than 65 entries'
exit
else if( iostat > 0 )then
write(6,'(A)') 'Error: error reading file'
stop
end if
print*, Staff_Name(i)
i = i + 1
end do
close(10)
end Function Load_Names
Using Fortran 2003 standard, one can do the following to check if the end of file is reached:
use :: iso_fortran_env
character(len=1024) :: line
integer :: u1,stat
open (newunit=u1,action='read',file='input.dat',status='old')
ef: do
read(u1,'A',iostat=stat) line
if (stat == iostat_end) exit ef ! end of file
...
end do ef
close(u1)
Thanks for all your help i did fix the code:
Function Load_Names(Staff_Name(65))!Loads Staff Names
character(len=30) :: Staff_Name(65)
integer :: i = 1
open(unit=10, file="Staff_Names.txt", status='old', action='read')!opens file for reading
do while(i < 66)!Sets Set_Name() equal to the file one string at a time
read(10,*,end=100) Staff_Name(i)
i = i + 1
end do
100 close(10)!closes file
return!returns Value
end Function Load_Names
I needed to change read(10,*) to read(10,*,END=100)
so it knew what to do when it came to the end the file
as it was in a loop I assume.
Then your problem was that your file was a row vector, and it was likely
giving you this error immediately after reading the first element, as #M.S.B. was suggesting.
If you have a file with a NxM matrix and you read it in this way (F77):
DO i=1,N
DO j=1,M
READ(UNIT,*) Matrix(i,j)
ENDDO
ENDDO
it will load the first column of your file in the first row of your matrix and will give you an error as soon as it reaches the end of the file's first column, because the loop enforces it to read further lines and there are no more lines (if N<M when j=N+1 for example). To read the different columns you should use an implicit loop, which is why your solution worked:
DO i=1,N
READ(UNIT,*) (Matrix(i,j), j=1,M)
ENDDO
I am using GNU Fortran 5.4.0 on the Ubuntu system 16.04. Please check your file if it is the right one you are looking for, because sometimes files of the same name are confusing, and maybe one of them is blank. As you may check the file path if it is in the same working directory.

Reading columns from data file in fortran

I wrote the following block to read from an external data file:
open(unit=338,file='bounnodes.dat',form='formatted')
DO I=1,NQBOUN
DO J=1,NUMBOUNNODES(I)
read(338,2001) NODEBOUN(i,j)
write(6,*) 'BOUNDARY NODES', NODEBOUN(i,j)
ENDDO
ENDDO
2001
FORMAT(32I5)
As far as I understood, this should read a 2 x 32 array from bounnodes.dat.
However, I get an error end-of-file during read and it prints the first column.
I tried to read a 32 x 2 array using the same code, and it reads 32 elements of the first column, but outputs 0s for the next column.
Can you please explain what is happening? Is my formatting wrong?
Every read statement in Fortran advances to the next record. This means a new line in normal text files. Try this:
DO I=1,NQBOUN
DO J=1,NUMBOUNNODES(I)
read(338,2001,advance='no') NODEBOUN(i,j)
write(*,*) 'BOUNDARY NODES', NODEBOUN(i,j)
ENDDO
read(338,*)
ENDDO
where NQBOUN is number of rows and NUMBOUNNODES(I) is number of columns in a row. (I have allway problems, what is 32x2 vs. 2x32)
You can make it even shorter, using the implied do
DO I=1,NQBOUN
read(338,2001) ( NODEBOUN(i,j) , j=1,NUMBOUNNODES(I) )
write(*,*) ( 'BOUNDARY NODES', NODEBOUN(i,j) , j=1,NUMBOUNNODES(I) )
ENDDO
or even
DO I=1,NQBOUN
read(338,2001) NODEBOUN(i,:)
write(*,*) 'BOUNDARY NODES', NODEBOUN(i,1:NUMBOUNNODES(I))
ENDDO
All of these use Fortran 90 features.

Resources