Trouble running pig in both local or mapreduce mode

Trouble running pig in both local or mapreduce mode - hadoop

I already have Hadoop 1.2 running on my Ubuntu VM which is running on Windows 7 machine. I recently installed Pig 0.12.0 on my same Ubuntu VM. I have downloaded the pig-0.12.0.tar.gz from the apache website. I have all the variables such as JAVA_HOME, HADOOP_HOME, PIG_HOME variables set correctly. When I try to start pig in local mode this is what I see:
chandeln#ubuntu:~$ pig -x local
pig: invalid option -- 'x'
usage: pig
chandeln#ubuntu:~$ echo $JAVA_HOME
/usr/lib/jvm/java7
chandeln#ubuntu:~$ echo $HADOOP_HOME
/usr/local/hadoop
chandeln#ubuntu:~$ echo $PIG_HOME
/usr/local/pig
chandeln#ubuntu:~$ which pig
/usr/games/pig
chandeln#ubuntu:~$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java7/bin:/usr/local/hadoop/bin:/usr/local/pig/bin
Since I am not a Unix expert I am not sure if this is the problem but the command which pig actually returns /usr/games/pig instead of /usr/local/pig. Is this the root cause of the problem?
Please guide.

I was able to fix the problem by changing the following lines in my .bashrc. This gave precedence to the /usr/local/pig directory instead of /usr/games/pig
BEFORE: export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin
AFTER: export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH

Related

spark-submit command not found in airflow

I am trying to run my spark job in airflow, when I executed this command spark-submit --class dataload.dataload_daily /home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar in terminal, it works fine without any issue.
However, I am doing the same here in airflow, but keep getting the error
/tmp/airflowtmpKQMdzp/spark-submit-scalaWVer4Z: line 1: spark-submit:
command not found
t1 = BashOperator(task_id = 'spark-submit-scala',
bash_command = 'spark-submit --class dataload.dataload_daily \
/home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar',
dag=dag,
retries=0,
start_date=datetime(2018, 4, 14))
I have my spark path mentioned in bash_profile,
export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7
export PATH="$SPARK_HOME/bin/:$PATH"
sourced this file as well. Not sure how to debug this, can anyone help me on this?

You could start with bash_command = 'echo $PATH' to see if your path is being updated correctly.
This is because you are metioning editing the bash_profile, but as far as I know Airflow is being run as another user. Since the other user has no changes in the bash_profile, the path to Spark might be missing.
As mentioned here (How do I set an environment variable for airflow to use?) you could try setting the path in .bashrc.

How do I launch pyspark and arrive in an ipython shell

When I launch pyspark, spark loads properly, however I end up in a standard python shell environment.
Using Python version 2.7.13 (default, Dec 20 2016 23:05:08)
SparkSession available as 'spark'.
>>>
I want to launch into the ipython interpreter.
IPython 5.1.0 -- An enhanced Interactive Python.
In [1]:
How do I do that? I tried modifying my .bashprofile in this way and using the alias:
# Spark variables
export SPARK_HOME="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7"
export PYTHONPATH="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7/python/:"
# Spark 2
export PYSPARK_DRIVER_PYTHON=ipython
export PATH=$SPARK_HOME/bin:$PATH
# export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
alias sudo='sudo '
alias pyspark="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7/bin/pyspark \
--conf spark.sql.warehouse.dir='file:///tmp/spark-warehouse' \
--packages com.databricks:spark-csv_2.11:1.5.0 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
--packages org.apache.hadoop:hadoop-aws:2.7.3 \
--packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0"
I also tried navigating to spark home where pyspark is located and directly launching from there, but again I arrive in the python interpreter.
I found this post: How to load IPython shell with PySpark and the accepted answer looked promising, but am activating python 2 environment (source activate py2) before launching spark and changing my bash profile in this way attempts to start spark with python 3 which I'm not setup to do (throws errors).
I'm using spark 2.1

Spark 2.1.1
For some reason typing sudo ./bin/pyspark changes the file permissions of metastore_db/db.lck that cause running ipython and pyspark not to work. From the decompressed root directory try:
sudo chown -v $(id -un) metastore_db/db.lck
export PYSPARK_DRIVER_PYTHON=ipython
./bin/pyspark
Another solution is to just re-download and decompress from spark.apache.org. Navigate to the root of the decompressed directory and then:
export PYSPARK_DRIVER_PYTHON=ipython
./bin/pyspark
And it should work.

Since asking this question I found a helpful solution is to write bash scripts that load Spark in a specific way. Doing this will give you an easy way to start Spark in different environments (for example ipython and a jupyter notebook).
To do this open a blank script (using whatever text editor you prefer), for example one called ipython_spark.sh
For this example I will provide the script I use to open spark with the ipython interpreter:
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=ipython
${SPARK_HOME}/bin/pyspark \
--master local[4] \
--executor-memory 1G \
--driver-memory 1G \
--conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
--packages com.databricks:spark-csv_2.11:1.5.0 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
--packages org.apache.hadoop:hadoop-aws:2.7.3
Note that I have SPARK_HOME defined in my bash_profile, but you could just insert the whole path to wherever pyspark is located on your computer
I like to put all scripts like this in one place so I put this file in a folder called "scripts"
Now for this example you need to go to your bash_profile and enter the following lines:
export PATH=$PATH:/Users/<username>/scripts
alias ispark="bash /Users/<username>/scripts/ipython_spark.sh"
These paths will be specific to where you put ipython_spark.sh
and then you might need to update permissions:
$ chmod 711 ipython_spark.sh
and source your bash_profile:
$ source ~/.bash_profile
I'm on a mac, but this should all work for linux as well, although you will be updating .bashrc instead of bash_profile most likely.
What I like about this method is that you can write up multiple scripts, with different configurations and open spark accordingly. Depending on if you are setting up a cluster, need to load different packages, or change the number of cores spark has at it's disposal, etc. you can either update this script, or make new ones. Note that PYSPARK_DRIVER_PYTHON= is the correct syntax for Spark > 1.2
I am using Spark 2.2

How to get the default HIVE_HOME in cloudera quickstartVM -5.7.0?

How can i get value for environment variable $HIVE_HOME in cloudera-quickstartVM-5.7?
Tried to see the existing environment variables by printenv, It does not exist.

HIVE_HOME is set when hive shell is invoked. here are three ways to find out HIVE_HOME
From hivecommandline:
[cloudera#quickstart ~]$ hive -e '!env'|grep HIVE_HOME
HIVE_HOME=/usr/lib/hive
From hive shell - this will print same variables as above
but you can't use grep here, so you will have to find HIVE_HOME from list of all variables:
hive> !env;
From hive command file itself:
[cloudera#quickstart ~]$ cat /usr/bin/hive|grep HIVE_HOME
export HIVE_HOME=/usr/lib/hive

Unable to find Namenode class when setting up Hadoop on Windows 8

Trying to set up Hadoop 2.4.1 on my machine using Cygwin and I'm stuck when I try to run
$ hdfs namenode -format
which gives me
Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode
I think it's due to an undefined environment variable, since I can run
$ hadoop version
without a problem. I've defined the following:
JAVA_HOME
HADOOP_HOME
HADOOP_INSTALL
as well as adding the Hadoop \bin and \sbin (and Cygwin's \bin) to the Path. Am I missing an environment variable that I need to define?

Ok, looks like the file hadoop\bin\hdfs also has to be changed like the hadoop\bin\hadoop file described here.
The end of the file must be changed from:
exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"
to
exec "$JAVA" -classpath "$(cygpath -pw "$CLASSPATH")" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"
I assume I'll have to make similar changes to the hadoop\bin\mapred and hadoop\bin\yarn when I get to using those files.

getting started with pig

This might be a really stupid question but I'm not able to install pig properly on my machine.
Pig's version is 0.9.0.
I have even set my JAVA_HOME to its designated path .
I've set the PATH to
export PATH=/usr/local/pig-0.9.0/bin:$PATH
since my pig dir is in /usr/local/.
Whenever I type pig or pig -help I get the following message
su: /usr/local/pig-0.9.0/bin/pig: Permission denied
Please help. Thank you.

try to type:
chmod +x /usr/local/pig-0.9.0/bin/pig

'chmod 777 -R /usr/local/pig-0.9.0/
usethis code definitely run

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Trouble running pig in both local or mapreduce mode - hadoop

I was able to fix the problem by changing the following lines in my .bashrc. This gave precedence to the /usr/local/pig directory instead of /usr/games/pig BEFORE: export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin AFTER: export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH

Related

spark-submit command not found in airflow

How do I launch pyspark and arrive in an ipython shell

How to get the default HIVE_HOME in cloudera quickstartVM -5.7.0?

Unable to find Namenode class when setting up Hadoop on Windows 8

getting started with pig

Categories

Resources