Permission error when trying to run a mapreduce Hadoop job - hadoop

I'm getting an error when trying to run a Hadoop job. The command that I'm trying to run is the following from /root/folderX:
[root#hadoop folderX]# hadoop jar /usr/lib/Hadoop-mapreduce/Hadoop-streaming- 2.2.0.2.0.10.0-1.jar
-input /user/cxxx/txxx/uxxx.txt
-output /user/cxxx/txxx/count
-file map.py
-file reduce.py
-mapper map.py
-combiner reduce.py
-reducer recude.py
I see in a part of the error the following message:
Error straming.StreamJob: Error Launching job : Permission denied: user=root,
access=WRITE, inode=”user”:hdfs:drwxr-wr-x
Running the command hadoop fs -ls /user I get the following
drwxr-xr-x - root hdfs 0 2016-11-01 10:10 /user/cxxx
Any ideas on what I'm doing wrong?
Thanks

Try below command
sudo -u hdfs hadoop jar /usr/lib/Hadoop-mapreduce/Hadoop-streaming- 2.2.0.2.0.10.0-1.jar
-input /user/cxxx/txxx/uxxx.txt
-output /user/cxxx/txxx/count
-file map.py
-file reduce.py
-mapper map.py
-combiner reduce.py
-reducer recude.py

I managed to resolve the problem with the following statement:
sudo -u hdfs hadoop fs -chmod -R 777 /user/cxxx
I'm not sure how wise is this to do though

Related

Hadoop directory file to user folder

I have created a folder in the root directly and I'm trying to copy a folder to hdfs hadoop but I'm getting an error message. This is the steps that I have followed:
[root#dh] ls
XXdirectoryXX
[root#dh] sudo –u hdfs hadoop fs –mkdir /user/uname
[root#hd] uname
[root#hd] sudo –u hdfs hadoop fs –chown uname /user/uname
[root#hd] su - uname
[uname#hd] hadoop fs –copyFromLocal XXdirectoryXX/ /user/uname
copyFromLocal: 'XXdirectoryXX/': No such file or directory
Is there a problem in the command or what I've done or should I use another command to copy the files over?
I'm using Centos 6.8 in the machine
Any ideas?
Thanks
Thanks to the comments I've managed to resolve the issue. Here is the code it it helps someone:
[root#dh] sudo -u hdfs hadoop fs -chown -R root /user/uname
[root#dh] hadoop fs –copyFromLocal XXdirectoryXX/ /user/uname
Regards

Hadoop missing input which is present in HDFS

Evening All,
I'm trying to run a training sample on Hadoop mapreduce, but am receiving an error that the input path does not exist.
16/09/26 05:56:45 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: hdfs://bigtop1.vagrant:8020/training
However, looking inside the hdfs directory, it's clear that the "training" folder is present.
[vagrant#bigtop1 code]$ hadoop fs -ls
Found 3 items
drwx------ - vagrant hadoop 0 2016-09-26 05:47 .staging
drwxr-xr-x - vagrant hadoop 0 2016-09-26 04:28 hw2
drwxr-xr-x - vagrant hadoop 0 2016-09-26 04:14 training
Using HDFS commands:
[vagrant#bigtop1 code]$ hdfs dfs -ls training
Found 2 items
-rw-r--r-- 3 vagrant hadoop 0 2016-09-26 04:14 training/_SUCCESS
-rw-r--r-- 3 vagrant hadoop 3311720 2016-09-26 04:14 training/part-r-00000
Does anyone know of a possible reason that Hadoop would be missing data that is clearly present?
Invocation Below, had to hide one input (-f):
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D mapreduce.job.reduces=5 -files lr -mapper "python lr/mapper.py -n 5 -r 0.4" -reducer "python lr/reducer.py -e 0.1 -c 0.0 -f ####" -input /training/ -output /models
Please change the input parameter as something like this.
From
-input /training/
To
-input training/
When you run $ hadoop fs -ls it shows you the data in the current users home directory.
Are you sure the path to your data isnt /user/vagrant/?
If the training directory isn't present when you run $ hadoop fs -ls / then you have the path wrong.
Please change the input parameter as something like this.
-input hdfs://<machinename>/user/vagrant/training/

BWA tool with hadoop streaming

Burrows-Wheeler Aligner(BWA), a bioinformatic tool (algorithm) to map short nucleotide sequences to a reference genome. I have tried to run BWA using Hadoop Streaming but getting error.
Command:
hadoop/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.1.1.jar -input /user/hduser/bwainput/chr21.fa -output /user/hduser/bwa_output17 -mapper "/home/hduser/Desktop/bwa-0.7.5a/bwa index /user/hduser/bwainput/chr21.fa" -file /home/hduser/Desktop/bwa_input/chr21.fa
Error Message:
INFO streaming.StreamJob: Tracking URL: /ubuntu:50030/jobdetails.jsp?jobid=job_201401230236_0007
ERROR streaming.StreamJob: Job not successful.
Error: # of failed Map Tasks exceeded allowed limit. FailedCount:1
INFO streaming.StreamJob: killJob...
Please suggest how to resolve this issue? Thanks for your help.
You can run bwa mem tool with hadoop streaming with the help of following command
hduser#ubuntu:~/apps/hadoop$ bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.1.1.jar -input /user/hduser/fastq/ERR091571.fastq -output
/user/hduser/bwa_output33 -mapper 'bwa mem -p s_suis.fa -' -reducer 'cat' -file bwa -file s_suis.fa -file s_suis.fa.amb -file s_suis.fa.ann -file s_suis.fa.bwt -file s_suis.fa.pac -file s_suis.fa.sa -numReduceTasks 1
Refer this link for more details

How do I show the exact job detail in hadoop?

Suppose I run hadoop jar ... -mapper path_to_a -reducer path_to_b and its job id is job_id_xxx.
Reversely, how can I know get something like hadoop jar ... -mapper path_to_a -reducer path_to_b from job_id_xxx?
In case the job is still running you could look up the parameters using ps as for any shell command.

Starting jobs with direct calls to Hadoop from within SSH

I've been able to kick off job flows using the elastic-mapreduce ruby library just fine. Now I have an instance which is still 'alive' after it's jobs have finished. I've logged in to is using SSH and would like to start another job, but each of my various attempts have failed because hadoop can't find the input file. I've tried storing the input file locally and on S3.
How can I create new hadoop jobs directly from within my SSH session?
The errors from my attempts:
(first attempt using local file storage, which I'd created by uploading files using SFTP)
hadoop jar hadoop-0.20-streaming.jar \
-input /home/hadoop/mystic/search_sets/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:33:57 ERROR streaming.StreamJob: Error Launching job :Input path does not exist: hdfs://ip-xx-xxx-xxx-xxx.us-west-1.compute.internal:9000/home/hadoop/mystic/search_sets/test_sample.txt
(second attempt using s3):
hadoop jar hadoop-0.20-streaming.jar \
-input s3n://xxxbucket1/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:26:45 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: s3n://xxxbucket1/test_sample.txt
The first will not work. Hadoop will look for that location in HDFS, not local storage. It might work if you use the file:// prefix, like this:
-input file:///home/hadoop/mystic/search_sets/test_sample.txt
I've never tried this with streaming input, though, and it probably isn't the best idea even if it does work.
The second (S3) should work. We do this all the time. Make sure the file actually exists with:
hadoop dfs -ls s3n://xxxbucket1/test_sample.txt
Alternately, you could put the file in HDFS and use it normally. For jobs in EMR, though, I usually find S3 to be the most convenient.

Resources