Reading data from a large file - parallel-processing

I have a text file which is as follows.
0.031 0.031 0.031 1.4998 0.9976 0.5668 0.9659
0.062 0.031 0.031 0.9620 0.7479 0.3674 0.4806
and so on......
This is a 32^3 grid which means there will be 32768 lines. In each line, there are 7 columns. I need to read each column and store it in separate 1D arrays.
The Fortran code looks like
open(unit=1,file='32data.txt') ! that's the filename
do i= 1,32767
read(1,*) x(i),y(i),z(i),norm(i),xv(i),yv(i),zv(i)
end do
I want to know to parallelize this in MPI when a file bigger than this is given (say 512^3). I need to read in part of a data at a time and work(to minimise workload and also the master worker won't have enough local memory).
How do I start by sending pieces of data at a time?

Usually, the input/output part of parallel programs are not parallel.
norio suggests splitting the file ahead of time. This preprocessing will not be parallel. This is advantageous if your nodes have their own file system. If your cluster has a shared file system then all the nodes are going fight for file access at start up.
Option 2: As the master reads in the file, it distributes the data and forgets the data so it doesn’t run out of memory.
Option 3: Each node scans entire file ignoring lines not assigned to it.

Related

A better alternative to chunking a file line by line

The closest question which I found to have any resemblance to what I am asking is here.
Linux shell command to read/print file chunk by chunk
My system conditions
A cluster with a shared filesystem served over NFS
Disk capacity = 20T
File Description
Standard FASTQ files used in large scale genomics analysis
A File containing n lines or n/4 records.
Typical file size is 100 - 200 G
I keep them as bunzips with a compression value of -9 (when specifying to bzip2)
When analyzing these files, I use SGE for my jobs therefore I analyze them in chunks of 1M or 10M records.
So when dividing the file I use
<(bzcat [options] filename) > Some_Numbered_Chunk
to divide these files up into smaller chunks for efficient processing over SGE.
Problems
When dividing these files up, this chunking step represents a significant amount of computation time.
i. Because there are a lot of records to sift through.
ii. Because NFS IO is not as fast as the bzcat pipe which I am using for chunking so NFS is limiting the speed at which a file can be chunked.
Many times I have to analyze almost 10-20 of these files together and unpacked all of them aggregate to nearly 1-2T of data. So on a shared system this is a very big limiting step and causes space cruches as others have to wait for me to go back and delete these files. (No I cannot delete all of these files as soon as the process has finished because I need to manually make sure that all processes completed successfully)
So how can I optimize this using other methods to lower the computation time, and also so that these chunks use up lesser amounts of hard disk space?
Several options spring to mind:
Increase your bandwidth of your storage (add more physical links).
Store your data in smaller files.
Increase your storage capacity so you can reduce your compression ratio.
Do your analysis off your shared storage (get the file over NFS, write to a local disk).

Does hadoop create InputSplits parallely

I have a large text file of size around 13gb. I want to process the file using Hadoop. I know that hadoop uses FileInputFormat to create InputSplits which are assigned to mapper tasks. I want to know if hadoop creates these InputSplits sequentially or in parallel. I mean does it read the large text file sequentially on a single host and create split files which are then distributed to datanodes, or does it read chunks of say 50mb in parallel?
Does hadoop replicate the big file on multiple hosts before splitting it up?
Is it recommended that I split up the file into 50mb chunks to speed up the processing? There are many questions on appropriate split size for mapper tasks but not the exact split process itself.
Thanks
InputSplits are created in the client side and it just a logical representation of the file in the sense it would only contain the file path,start and end offset values(calculated from linerecordreader initialize function). So calculating this logical rep. will not take much time so need to split your chunks the real execution happens at the mapper end where the execution is done in a parallel way. Then the client places the inputsplits into hdfs and jobtracker takes it from there and depending on the splits it allocates a tasktracker. Now here one mapper execution is not dependent on the other. The second mapper knows very well that where it has to start processing that split, so the mapper executions are done in parallel.
I suppose you want to process the file using MapReduce not Hadoop. Hadoop is a platform which provide tools to process and store large size data.
When you store the file in HDFS (Hadoop filesystem) it splits the file into multiple blocks. The size of the block is defined in hdfs-site.xml file as dfs.block.size. For example, if dfs.block.size=128 then your input file will be split into 128MB blocks. This is how HDFS store the data internally. For user it is always as a single file.
When you provide the input file (stored in HDFS) to MapReduce, it launches mapper task for each block/split of the file. This is default behavior.
you need not to split the file in chunks, just store the file in HDFS and it will the desired for you.
First let us understand what is meant by input split.
When your text file is divided into blocks of 128 MB size (default) by hdfs, assume that 10th line of the file is divided and first half of the is in first block and the other half is in second block. But when you submit a Map Program, hadoop understands that the last line of 1st block (which becomes input split here) is not complete. So it carries the second half of the 10th line to first input split. Which implies,
1) 1st input split = 1st Block + 2nd part of 10th line from 2nd block
2) 2nd input split = 2nd Block - 2nd part of 10th line from 2nd block.
This is an inbuilt process of hadoop and you cannot change or set the size of input split. The block size of hadoop v2 is by default 128 MB. You can increase during installation but you cannot decrease it.

how can & where can i edit the InputSplit size in CDH4.7. By default it is 64 MB, but i want to mention it as 1MB

How can and where can i edit the Input Split size in CDH4.7 By default it is 64 MB but i want to mention it as 1MB because my MR job is running slow and i want increase the speed of MR job. i guess need to edit cor-site property IO.file.buffer.size but CDH4.7 is not allowing me to edit as it is read only.
just reapeating the question below the get my question posted
How can and where can i edit the Input Split size in CDH4.7 By default it is 64 MB but i want to mention it as 1MB because my MR job is running slow and i want increase the speed of MR job. i guess need to edit cor-site property IO.file.buffer.size but CDH4.7 is not allowing me to edit as it is read only.
The parameter "mapred.max.split.size" which can be set per job individually is what you looking for.
You don't change "dfs.block.size" because Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.
You can change it directly from within the command using -D mapred.max.split.size=.. from command line and don't necessarily change any file permanently.

Slow MapReduce performance when using Custom Input Format

I am having an issue with MapReduce. I had to read multiple CSV files.
1 CSV file outputs 1 single row.
I cannot split the CSV files in custom input format as the rows in the CSV files are not in the same format. For example:
row 1 contains A, B, C
row 2 contains D, E, F
my output value should be like A, B, D, F
I have 1100 CSV files so 1100 splits are created and hence 1100 Mappers are created. The mappers are very simple and they shouldn't take much time to process.
But the 1100 input files take a lot of time to process.
Can anyone please guide me what I can take a look at or if I am doing anything wrong in this approach?
Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block.)
The technical reasons for this are well explained in this Cloudera blog post
Map tasks usually process a block of input at a time (using the
default FileInputFormat). If the file is very small and there are a
lot of them, then each map task processes very little input, and there
are a lot more map tasks, each of which imposes extra bookkeeping
overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or
so 100KB files. The 10,000 files use one map each, and the job time
can be tens or hundreds of times slower than the equivalent one with a
single input file.
You can refer this link to get methods to solve this issue

is there any way to control inputsplit in map reduce

I have lots of small(150-300 KB) text file 9000 per hour,I need to process them through map reduce. I created a simple MR which will process all the file and create single output file. when i run job this job for 1 hour data, it took 45 min. i started digging reason of poor performance, i found it takes as many input-split as the number of file. as i am guessing one reason for poor performance.
is there any way to control the input split by which i can say 1000 file would be entertained by one input split/Map.
Hadoop is designed for huge files in small numbers and not the other way. There are some ways to get around it like preprocessing the data, using the CombineFileInputFormat.

Resources