How do I show the exact job detail in hadoop? - hadoop

Suppose I run hadoop jar ... -mapper path_to_a -reducer path_to_b and its job id is job_id_xxx.
Reversely, how can I know get something like hadoop jar ... -mapper path_to_a -reducer path_to_b from job_id_xxx?

In case the job is still running you could look up the parameters using ps as for any shell command.

Related

hadoop cp vs streaming with /bin/cat as mapper and reducer

I am new to Hadoop, have very basic question on hadoop copy (cp) vs hadoop streaming if /bin/cat is used for mapper and reducer.
hadoop -input -output
-mapper /bin/cat -reducer /bin/cat
I believe above command would copy the files (how is it different from hadoop cp?) or correct me if my understanding is wrong.
They kind of do the same thing but in different fashion:
hadoop cp will just invoke the JAVA HDFS API and performs a copy to another specified location, which is way faster than streaming solution.
hadoop streaming on the other (see the example command below) will kick off a mapreduce job. Hence like any other mapreduce job it has to go through map -> sort & shuffle -> reduce phases which will take a long time to complete depending on your input dataset size. Because of the default sort & shuffle phase your input data also gets sorted in the output directory.
hadoop jar /path/to/hadoop-streaming.jar \
-input /input/path
-output /output/path
-mapper /bin/cat
-reducer /bin/cat

BWA tool with hadoop streaming

Burrows-Wheeler Aligner(BWA), a bioinformatic tool (algorithm) to map short nucleotide sequences to a reference genome. I have tried to run BWA using Hadoop Streaming but getting error.
Command:
hadoop/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.1.1.jar -input /user/hduser/bwainput/chr21.fa -output /user/hduser/bwa_output17 -mapper "/home/hduser/Desktop/bwa-0.7.5a/bwa index /user/hduser/bwainput/chr21.fa" -file /home/hduser/Desktop/bwa_input/chr21.fa
Error Message:
INFO streaming.StreamJob: Tracking URL: /ubuntu:50030/jobdetails.jsp?jobid=job_201401230236_0007
ERROR streaming.StreamJob: Job not successful.
Error: # of failed Map Tasks exceeded allowed limit. FailedCount:1
INFO streaming.StreamJob: killJob...
Please suggest how to resolve this issue? Thanks for your help.
You can run bwa mem tool with hadoop streaming with the help of following command
hduser#ubuntu:~/apps/hadoop$ bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.1.1.jar -input /user/hduser/fastq/ERR091571.fastq -output
/user/hduser/bwa_output33 -mapper 'bwa mem -p s_suis.fa -' -reducer 'cat' -file bwa -file s_suis.fa -file s_suis.fa.amb -file s_suis.fa.ann -file s_suis.fa.bwt -file s_suis.fa.pac -file s_suis.fa.sa -numReduceTasks 1
Refer this link for more details

Does hadoop auto-copy input files not on HDFS?

Using hadoop streaming:
hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -file mapper.rb -mapper mapper.rb -file reducer.rb -reducer reducer.rb -input textfile.txt -output output
Assuming the directory I am in is "/home/user/sei/Documents" and the textfile.txt
1) is in the same folder as the directory I am currently in
2) I did not use -copyFromLocal to put textfile.txt into HDFS
Does hadoop automatically copy the input files (in this case textfile.txt) to some location on HDFS (i.e. "/user/sei/textfile.txt" automatically upon execution) to use for processing? Does this apply to all cases of hadoop commands (i.e. hadoop jar jarfile myfilename)
No it does not copy the records into HDFS, that you will have to do by yourself. If you are running a single node, or a pseudo distributed cluster on one machine you should be ok with a local file path. But if you are running a distributed cluster, the mappers and reducers will not be able to find that file.

Starting jobs with direct calls to Hadoop from within SSH

I've been able to kick off job flows using the elastic-mapreduce ruby library just fine. Now I have an instance which is still 'alive' after it's jobs have finished. I've logged in to is using SSH and would like to start another job, but each of my various attempts have failed because hadoop can't find the input file. I've tried storing the input file locally and on S3.
How can I create new hadoop jobs directly from within my SSH session?
The errors from my attempts:
(first attempt using local file storage, which I'd created by uploading files using SFTP)
hadoop jar hadoop-0.20-streaming.jar \
-input /home/hadoop/mystic/search_sets/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:33:57 ERROR streaming.StreamJob: Error Launching job :Input path does not exist: hdfs://ip-xx-xxx-xxx-xxx.us-west-1.compute.internal:9000/home/hadoop/mystic/search_sets/test_sample.txt
(second attempt using s3):
hadoop jar hadoop-0.20-streaming.jar \
-input s3n://xxxbucket1/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:26:45 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: s3n://xxxbucket1/test_sample.txt
The first will not work. Hadoop will look for that location in HDFS, not local storage. It might work if you use the file:// prefix, like this:
-input file:///home/hadoop/mystic/search_sets/test_sample.txt
I've never tried this with streaming input, though, and it probably isn't the best idea even if it does work.
The second (S3) should work. We do this all the time. Make sure the file actually exists with:
hadoop dfs -ls s3n://xxxbucket1/test_sample.txt
Alternately, you could put the file in HDFS and use it normally. For jobs in EMR, though, I usually find S3 to be the most convenient.

Merging multiple files into one within Hadoop

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar \
$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
-Dmapred.reduce.tasks=1 \
-Dmapred.job.queue.name=$QUEUE \
-input "$INPUT" \
-output "$OUTPUT" \
-mapper cat \
-reducer cat
If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
okay...I figured out a way using hadoop fs commands -
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
It worked when I tested it...any pitfalls one can think of?
Thanks!
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.
You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.
If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat
You can download this jar from
Get hadoop streaming jar
If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
This will merge all part files into one and save it again into hdfs location
Addressing this from Apache Pig perspective,
To merge two files with identical schema via Pig, UNION command can be used
A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1)
C = UNION A,B
store C into 'tmp/fileoutput' Using PigStorage('\t')
All the solutions are equivalent to doing a
hadoop fs -cat [dir]/* > tmp_local_file
hadoop fs -copyFromLocal tmp_local_file
it only means that the local m/c I/O is on the critical path of data transfer.

Resources