BWA tool with hadoop streaming - hadoop

Burrows-Wheeler Aligner(BWA), a bioinformatic tool (algorithm) to map short nucleotide sequences to a reference genome. I have tried to run BWA using Hadoop Streaming but getting error.
Command:
hadoop/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.1.1.jar -input /user/hduser/bwainput/chr21.fa -output /user/hduser/bwa_output17 -mapper "/home/hduser/Desktop/bwa-0.7.5a/bwa index /user/hduser/bwainput/chr21.fa" -file /home/hduser/Desktop/bwa_input/chr21.fa
Error Message:
INFO streaming.StreamJob: Tracking URL: /ubuntu:50030/jobdetails.jsp?jobid=job_201401230236_0007
ERROR streaming.StreamJob: Job not successful.
Error: # of failed Map Tasks exceeded allowed limit. FailedCount:1
INFO streaming.StreamJob: killJob...
Please suggest how to resolve this issue? Thanks for your help.

You can run bwa mem tool with hadoop streaming with the help of following command
hduser#ubuntu:~/apps/hadoop$ bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.1.1.jar -input /user/hduser/fastq/ERR091571.fastq -output
/user/hduser/bwa_output33 -mapper 'bwa mem -p s_suis.fa -' -reducer 'cat' -file bwa -file s_suis.fa -file s_suis.fa.amb -file s_suis.fa.ann -file s_suis.fa.bwt -file s_suis.fa.pac -file s_suis.fa.sa -numReduceTasks 1
Refer this link for more details

Related

Permission error when trying to run a mapreduce Hadoop job

I'm getting an error when trying to run a Hadoop job. The command that I'm trying to run is the following from /root/folderX:
[root#hadoop folderX]# hadoop jar /usr/lib/Hadoop-mapreduce/Hadoop-streaming- 2.2.0.2.0.10.0-1.jar
-input /user/cxxx/txxx/uxxx.txt
-output /user/cxxx/txxx/count
-file map.py
-file reduce.py
-mapper map.py
-combiner reduce.py
-reducer recude.py
I see in a part of the error the following message:
Error straming.StreamJob: Error Launching job : Permission denied: user=root,
access=WRITE, inode=”user”:hdfs:drwxr-wr-x
Running the command hadoop fs -ls /user I get the following
drwxr-xr-x - root hdfs 0 2016-11-01 10:10 /user/cxxx
Any ideas on what I'm doing wrong?
Thanks
Try below command
sudo -u hdfs hadoop jar /usr/lib/Hadoop-mapreduce/Hadoop-streaming- 2.2.0.2.0.10.0-1.jar
-input /user/cxxx/txxx/uxxx.txt
-output /user/cxxx/txxx/count
-file map.py
-file reduce.py
-mapper map.py
-combiner reduce.py
-reducer recude.py
I managed to resolve the problem with the following statement:
sudo -u hdfs hadoop fs -chmod -R 777 /user/cxxx
I'm not sure how wise is this to do though

hadoop cp vs streaming with /bin/cat as mapper and reducer

I am new to Hadoop, have very basic question on hadoop copy (cp) vs hadoop streaming if /bin/cat is used for mapper and reducer.
hadoop -input -output
-mapper /bin/cat -reducer /bin/cat
I believe above command would copy the files (how is it different from hadoop cp?) or correct me if my understanding is wrong.
They kind of do the same thing but in different fashion:
hadoop cp will just invoke the JAVA HDFS API and performs a copy to another specified location, which is way faster than streaming solution.
hadoop streaming on the other (see the example command below) will kick off a mapreduce job. Hence like any other mapreduce job it has to go through map -> sort & shuffle -> reduce phases which will take a long time to complete depending on your input dataset size. Because of the default sort & shuffle phase your input data also gets sorted in the output directory.
hadoop jar /path/to/hadoop-streaming.jar \
-input /input/path
-output /output/path
-mapper /bin/cat
-reducer /bin/cat

Does hadoop auto-copy input files not on HDFS?

Using hadoop streaming:
hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -file mapper.rb -mapper mapper.rb -file reducer.rb -reducer reducer.rb -input textfile.txt -output output
Assuming the directory I am in is "/home/user/sei/Documents" and the textfile.txt
1) is in the same folder as the directory I am currently in
2) I did not use -copyFromLocal to put textfile.txt into HDFS
Does hadoop automatically copy the input files (in this case textfile.txt) to some location on HDFS (i.e. "/user/sei/textfile.txt" automatically upon execution) to use for processing? Does this apply to all cases of hadoop commands (i.e. hadoop jar jarfile myfilename)
No it does not copy the records into HDFS, that you will have to do by yourself. If you are running a single node, or a pseudo distributed cluster on one machine you should be ok with a local file path. But if you are running a distributed cluster, the mappers and reducers will not be able to find that file.

How do I show the exact job detail in hadoop?

Suppose I run hadoop jar ... -mapper path_to_a -reducer path_to_b and its job id is job_id_xxx.
Reversely, how can I know get something like hadoop jar ... -mapper path_to_a -reducer path_to_b from job_id_xxx?
In case the job is still running you could look up the parameters using ps as for any shell command.

Starting jobs with direct calls to Hadoop from within SSH

I've been able to kick off job flows using the elastic-mapreduce ruby library just fine. Now I have an instance which is still 'alive' after it's jobs have finished. I've logged in to is using SSH and would like to start another job, but each of my various attempts have failed because hadoop can't find the input file. I've tried storing the input file locally and on S3.
How can I create new hadoop jobs directly from within my SSH session?
The errors from my attempts:
(first attempt using local file storage, which I'd created by uploading files using SFTP)
hadoop jar hadoop-0.20-streaming.jar \
-input /home/hadoop/mystic/search_sets/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:33:57 ERROR streaming.StreamJob: Error Launching job :Input path does not exist: hdfs://ip-xx-xxx-xxx-xxx.us-west-1.compute.internal:9000/home/hadoop/mystic/search_sets/test_sample.txt
(second attempt using s3):
hadoop jar hadoop-0.20-streaming.jar \
-input s3n://xxxbucket1/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:26:45 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: s3n://xxxbucket1/test_sample.txt
The first will not work. Hadoop will look for that location in HDFS, not local storage. It might work if you use the file:// prefix, like this:
-input file:///home/hadoop/mystic/search_sets/test_sample.txt
I've never tried this with streaming input, though, and it probably isn't the best idea even if it does work.
The second (S3) should work. We do this all the time. Make sure the file actually exists with:
hadoop dfs -ls s3n://xxxbucket1/test_sample.txt
Alternately, you could put the file in HDFS and use it normally. For jobs in EMR, though, I usually find S3 to be the most convenient.

Resources