Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment - hadoop

I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.
I tried to run this command
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat
But I got this error output :
File:
/home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py
does not exist, or is not readable.
Try -help for more information Streaming Command Failed!
Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?
Thank You

The -file option from hadoop-streaming only works for local files. Note however, that its help text mentions that the -file flag is deprecated in favor of the generic -files option. Using the generic -files option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.
Your invocation would become:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py,gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py \
-mapper mapper_prod_cat.py \
-reducer reducer_prod_cat.py \
-input gs://bucket-name/intro_to_mapreduce/purchases.txt \
-output gs://bucket-name/intro_to_mapreduce/output_prod_cat

Related

Issue with Hadoop and Google Cloud Storage Connector

I've deployed a hadoop cluster via Deployments interface in google console. (Hadoop 2.x)
My task was to filter data stored in one Google Storage (GS) bucket and put the results to another. So, this is a map only job with simple python script. Note that cluster and output bucket are in the same zone (EU).
Leveraging Google Cloud Storage Connector, I run the following streaming job:
hadoop jar /home/hadoop/hadoop-install/share/hadoop/tools/lib/hadoop-streaming-2.4.1.jar \
-D mapreduce.output.fileoutputformat.compress=true \
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapreduce.job.reduces=0 \
-file file_1 \
-file mymapper.py \
-input gs://inputbucket/somedir/somedir2/*-us-* \
-output gs://outputbucket/somedir3/somedir2 \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper mymapper.py
What happens is all the mappers process data and store the results in temporary directory in GS, which looks like:
gs://outputbucket/somedir3/somedir2/_temporary/1/mapper-0000/part-0000.gz
After all mappers are finished, job progress hangs at 100% map, 0% reduce. Looking at output bucket with gsutil, I see that result files are being copied to the destination directory:
gs://outputbucket/somedir3/somedir2
This process takes a very long time and kills the whole benefit from using hadoop.
My questions are:
1) Is it a known issue or I just done something wrong? I couldn't find any relevant info.
2) Am I correct saying that normally hdfs would move those files to destination dir, but GS can't perform move and thus the files are copied?
3) What can I do to avoid this pattern?
You're almost certainly running into the "Slow FileOutputCommitter" issue which affects Hadoop 2.0 through 2.6 inclusive and is fixed in 2.7.
If you're looking for a nice managed Hadoop option on Google Cloud Platform, you should consider Google Cloud Dataproc (documentation here), where we maintain our distro to ensure we pick up patches relevant to Google Cloud Platform quickly. Dataproc indeed configures the mapreduce.fileoutputcommitter.algorithm.version so that the final commitJob is fast.
For something more "do-it-yourself", you can user our command-line bdutil tool , which also has the latest update to use the fast FileOutputCommitter.

Setting S3 output file grantees for spark output files

I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files (rdd.saveAsTextFile('<file_dir_name>')). In hive, I would add a line in the beginning with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the correct permissions. For Spark, I tried running:
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.driver.extraJavaOptions -Dfs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
But the permissions do not get set properly on the output files. What is the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or any of the S3 canned permissions to the spark job?
Thanks in advance
I found the solution. In the job, you have to access the JavaSparkContext and from there get the Hadoop configuration and set the parameter there. For example:
sc._jsc.hadoopConfiguration().set('fs.s3.canned.acl','BucketOwnerFullControl')
The proper way to pass hadoop config keys in spark is to use --conf with keys prefixed with spark.hadoop.. Your command would look like
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.hadoop.fs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
Unfortunately I cannot find any reference in official documentation of spark.

How to submit a Hadoop streaming job and check execution history with Hadoop 2.x

I am newbie to Hadoop. In Hadoop 1.X, I can submit a hadoop streaming job from master node and check the result and execution time from the namenode web.
The following is the sample code for hadoop streaming in Hadoop 1.X:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
However, in Hadoop 2.x, the job tracker is removed. How can I get the same feature in Hadoop 2.X?
In Hadoop 2.0, you can view the jobs in multiple ways
1) View the jobs from ResourceManager UI ResourceMnagerhostname:8088/cluster
2) View the jobs from HUE - HUEServerHostname.com:8888/jobbrowser/
3) From command line (once the job is completed)
usage: yarn logs -applicationId [OPTIONS]
general options are:
-appOwner AppOwner (assumed to be current user if
not specified)
-containerId ContainerId (must be specified if node
address is specified)
-nodeAddress NodeAddress in the format nodename:port
(must be specified if container id is
specified)
Example: yarn logs -applicationId application_1414530900704_0005

How to configure Pivotal Hadoop

We are working on a Greenplum with HAWQ installed. I would like to run a hadoop-streaming job. However, it seems that hadoop is not configured or started. How can i start mapred to make sure that i can use hadoop-streaming?
Try the below command to get word count:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input <inputDir> \
-output <outputDir> \
-mapper /bin/cat \
-reducer /bin/wc
If that gives you correct word count then its working else check the error that's spit out by running this command
First, make sure that cluster is started and is working. To make it go to the Pivotal Command Center (usually the link is like this: https://<admin_node>:5443/ ) and see the cluster status or ask your administrator to do so.
Next, make sure that you have the PHD client libraries installed on the machine you are trying to start your job. Run "rpm -qa | grep phd"
Next, if the cluster is running and libraries are installed, you can run the streaming job like this:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-streaming.jar -mapper /bin/cat -reducer /bin/wc -input /example.txt -output /testout
/example.txt file should exist on HDFS
I do it long back, Greenplum/Pivotal Hadoop
--1. For Instatllation
icm_client deploy
ex. - icm_client deploy HIVE
--2. For status
HDFS
Service hadoop-namenode status
Service hadoop-datanode status
Service hadoop-secondarynamenode status
MapRed
Service hadoop-jobtracker status
Service hadoop-tasktracker status
Hive
service hive-server status
service hive-metastore status
--3. For start/stop/restart
service hive-server start
service hive-server stop
service hive-server restart
Note: You will find all this command and details in installation guide, may be available online somewhere hadoop installation guide
Thanks,

Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. on running Lucene search on Hadoop

I use each of the records in a big text file to perform a search on Lucene's index, then massage the results as i wanted and write to output.
I'm trying to use Hadoop by putting the big input text file and pre-created Lucene index onto Hadoop's file system. Then I changed my java program that does the file processing (read file records, search on Lucene, write output) to read records from Hadoop filesystem, and create Lucene index in memory. The command I kick off the Hadoop job is like below:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar
-libjars lucene-core-3.6.0.jar,hadoop-core-1.0.3.jar,concept.jar
-mapper "java concept.HadoopConceptRunner"
-input myBigInputFile
-output myOutput
-reducer NONE
Note that "concept.jar" contains concept.HadoopConceptRunner class and this is written by me.
My problem is that I can't get this Hadoop job to run correctly =.=". I got exception like below, and I'm unable to find anything else meaningful that can help me resolve this.
Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
and
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
How can I fix this error?
I think you should not be calling the java command but only give the fully qualified classname you want to be run as mapper. If the mapper called 'java concept.HadoopConceptRunner', I guess it would barf since the classpath is not defined thus the class would not be found ;)
So in short try again like this:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar
-libjars lucene-core-3.6.0.jar,hadoop-core-1.0.3.jar,concept.jar
-mapper "concept.HadoopConceptRunner"
-input myBigInputFile
-output myOutput
-reducer NONE
Also, I think the following is unlikely to work
-reducer NONE
you could try instead:
-jobconf mapred.reduce.tasks=0

Resources