Merging MapReduce output - hadoop

I have two MapReduce jobs which produce files in two separate directories which look like so:
Directory output1:
------------------
/output/20140102-r-00000.txt
/output/20140102-r-00000.txt
/output/20140103-r-00000.txt
/output/20140104-r-00000.txt
Directory output2:
------------------
/output-update/20140102-r-00000.txt
I want to merge these two directories together in a new directory /output-complete/ where the 20140102-r-00000.txt replaces the original file in the /output directory and all of the "-r-0000x" is removed from the file name. The two original directories will now be empty and the resulting directory should look as follows:
Directory output3:
-------------------
/output-complete/20140102.txt
/output-complete/20140102.txt
/output-complete/20140103.txt
/output-complete/20140104.txt
What is the best way to do this? Can I use only HDFS shell commands? Do I need to create a java program to traverse both directories and do the logic?

you can use pig ...
get_data = load '/output*/20140102*.txt' using Loader()
store get_data into "/output-complete/20140102.txt"
or HDFS Command...
hadoop fs -cat '/output*/20140102*.txt' > output-complete/20140102.txt
single qoutes may not work, then try with double quotes

You can use hdfs command -getMerge for merging hdfs files.

Related

Shell Script - Iterate through each line in text file and rename HDFS file

I have a text file in HDFS which would have records like below. The number of lines in file may vary every time.
hdfs://myfile.txt
file_name_1
file_name_2
file_name_3
I have the below hdfs directory and file structure like below.
hdfs://myfolder/
hdfs://myfolder/file1.csv
hdfs://myfolder/file2.csv
hdfs://myfolder/file3.csv
Using shell script I am able to count the number of files in HDFS directory and number of lines available in my HDFS text file. Only if the count matches between the number of files in directory and number of records in my text file, I am going to proceed further with the process.
Now, i am trying to rename hdfs://myfolder/file1.csv to hdfs://myfolder/file_name_1.csv using the first record from my text file.
Second file should be renamed to hdfs://myfolder/file_name_2.csv and third file to hdfs://myfolder/file_name_3.csv
I have difficulty in looping through both the text file and also the files in HDFS directory.
Is there an optimal way to achieve this using shell script.
You cannot do this directly from HDFS, you'd need to stream the file contents, then issue individual move commands.
e.g.
#!/bin/sh
COUNTER = 0
for file in $(hdfs dfs -cat file.txt)
do
NAME = $(sed $file ...) # replace text, as needed. TODO: extract the extension
hdfs dfs -mv file "$NAME_${COUNTER}.csv" # 'csv' for example - make sure the extension isn't duplicated!!
COUNTER = $((COUNTER + 1)
done

Having multiple reduce tasks assemble a single HDFS file as output

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order
The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.
If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

Multiple source files for s3distcp

Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.
I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.
Workaround that I am currently using is to tell all the file names in srcPattern
hadoop jar s3distcp.jar
--src s3n://bucket/src_folder/
--dest hdfs:///test/output/
--srcPattern '.*somefile.*|.*anotherone.*'
Can this thing work when the number of files is too many? like around 10 000?
hadoop distcp should solve your problem.
we can use distcp to copy data from s3 to hdfs.
And it also supports wildcards and we can provide multiple source paths in the command.
http://hadoop.apache.org/docs/r1.2.1/distcp.html
Go through the usage section in this particular url
Example:
consider you have the following files in s3 bucket(test-bucket) inside test1 folder.
abc.txt
abd.txt
defg.txt
And inside test2 folder you have
hijk.txt
hjikl.txt
xyz.txt
And your hdfs path is hdfs://localhost.localdomain:9000/user/test/
Then distcp command is as follows for a particular pattern.
hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/
Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here

Hadoop read files with following name patterns

This may sound very basic but I have a folder in HDFS with 3 kinds of files.
eg:
access-02171990
s3.Log
catalina.out
I want my map/reduce to read only files which begin with access- only. How do I do that via program? or specifying via the input directory path?
Please help.
You can set the input path as a glob:
FileInputFormat.addInputPath(jobConf, new Path("/your/path/access*"))

Resources