How to configure Pivotal Hadoop - hadoop

We are working on a Greenplum with HAWQ installed. I would like to run a hadoop-streaming job. However, it seems that hadoop is not configured or started. How can i start mapred to make sure that i can use hadoop-streaming?

Try the below command to get word count:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input <inputDir> \
-output <outputDir> \
-mapper /bin/cat \
-reducer /bin/wc
If that gives you correct word count then its working else check the error that's spit out by running this command

First, make sure that cluster is started and is working. To make it go to the Pivotal Command Center (usually the link is like this: https://<admin_node>:5443/ ) and see the cluster status or ask your administrator to do so.
Next, make sure that you have the PHD client libraries installed on the machine you are trying to start your job. Run "rpm -qa | grep phd"
Next, if the cluster is running and libraries are installed, you can run the streaming job like this:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-streaming.jar -mapper /bin/cat -reducer /bin/wc -input /example.txt -output /testout
/example.txt file should exist on HDFS

I do it long back, Greenplum/Pivotal Hadoop
--1. For Instatllation
icm_client deploy
ex. - icm_client deploy HIVE
--2. For status
HDFS
Service hadoop-namenode status
Service hadoop-datanode status
Service hadoop-secondarynamenode status
MapRed
Service hadoop-jobtracker status
Service hadoop-tasktracker status
Hive
service hive-server status
service hive-metastore status
--3. For start/stop/restart
service hive-server start
service hive-server stop
service hive-server restart
Note: You will find all this command and details in installation guide, may be available online somewhere hadoop installation guide
Thanks,

Related

Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment

I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.
I tried to run this command
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat
But I got this error output :
File:
/home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py
does not exist, or is not readable.
Try -help for more information Streaming Command Failed!
Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?
Thank You
The -file option from hadoop-streaming only works for local files. Note however, that its help text mentions that the -file flag is deprecated in favor of the generic -files option. Using the generic -files option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.
Your invocation would become:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py,gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py \
-mapper mapper_prod_cat.py \
-reducer reducer_prod_cat.py \
-input gs://bucket-name/intro_to_mapreduce/purchases.txt \
-output gs://bucket-name/intro_to_mapreduce/output_prod_cat

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:
Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
java.lang.ClassNotFoundException: com.example.SimpleApp
Here's the command:
$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar
I'm using hadoop version 2.6.0, spark version 1.2.1
The only way it worked for me, when I was using
--master yarn-cluster
To make HDFS library accessible to spark-job , you have to run job in cluster mode.
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar
Also, There is Spark JIRA raised for client mode which is not supported yet.
SPARK-10643 :Support HDFS application download in client mode spark submit
There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.
I did the same (with azure blob storage, but it should be similar for HDFS)
example command for azure wasb
sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777
Now, in your spark submit command, you provide the path from the command above
$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar
spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py
For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.
Yes, it has to be a local file. I think that's simply the answer.

How to report JMX from Spark Streaming on EC2 to VisualVM?

I have been trying to get a Spark Streaming job, running on a EC2 instance to report to VisualVM using JMX.
As of now I have the following config file:
spark/conf/metrics.properties:
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
And I start the spark streaming job like this:
(the -D bits I have added afterwards in the hopes of getting remote access to the ec2's jmx)
terminal:
spark/bin/spark-submit --class my.class.StarterApp --master local --deploy-mode client \
project-1.0-SNAPSHOT.jar \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=54321 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false
There are two issues with the spark-submit command line:
local - you must not run Spark Standalone with local master URL because there will be no threads to run your computations (jobs) and you've got two, i.e. one for a receiver and another for the driver. You should see the following WARN in the logs:
WARN StreamingContext: spark.master should be set as local[n], n > 1
in local mode if you have receivers to get data, otherwise Spark jobs
will not get resources to process the received data.
-D options are not picked up by the JVM as they're given after the Spark Streaming application and effectively became its command-line arguments. Put them before project-1.0-SNAPSHOT.jar and start over (you have to fix the above issue first!)
spark-submit --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8090 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"/path/example/src/main/python/pi.py 10000
Notes:the configurations format : --conf "params" . tested under spark 2.+

How to submit a Hadoop streaming job and check execution history with Hadoop 2.x

I am newbie to Hadoop. In Hadoop 1.X, I can submit a hadoop streaming job from master node and check the result and execution time from the namenode web.
The following is the sample code for hadoop streaming in Hadoop 1.X:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
However, in Hadoop 2.x, the job tracker is removed. How can I get the same feature in Hadoop 2.X?
In Hadoop 2.0, you can view the jobs in multiple ways
1) View the jobs from ResourceManager UI ResourceMnagerhostname:8088/cluster
2) View the jobs from HUE - HUEServerHostname.com:8888/jobbrowser/
3) From command line (once the job is completed)
usage: yarn logs -applicationId [OPTIONS]
general options are:
-appOwner AppOwner (assumed to be current user if
not specified)
-containerId ContainerId (must be specified if node
address is specified)
-nodeAddress NodeAddress in the format nodename:port
(must be specified if container id is
specified)
Example: yarn logs -applicationId application_1414530900704_0005

Streaming job fails in HDP 2.0

Iam trying to run streaming job as below in HDP-2.0.6 cluster.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar -input /USER/city.txt -output /USER/streamout -mapper /bin/cat -reducer /bin/cat -numReduceTasks 2
(Referred:
http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/
On executing the command iam getting an error as below.
Streaming Command Failed!
I couldn't find any other details from the console which would help me to track the issue.
But the same query is running in the local installation of hadoop-2.2.0.2.0.6.0-76 without any errors(same version of hadoop as in cluster, except that : in cluster- it is HDP installed using Ambari and in local-individual component hadoop installed manually from tarball).
Has any one came across such an issue ???
Or has any idea about the root cause?
Any suggestions would really be appreciated.
Thanks in Advance..

Resources