Can't seem to build hive for spark - amazon-ec2

I have been trying to run this code in pyspark.
sqlContext = HiveContext(sc)
datumDF = sqlContext.createDataFrame(datumX, schema)
But have been receiving this warning:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o44))
I log in to AWS and spin up clusters with this code: /User/Downloads/spark-1.5.2-bin-hadoop2.6/ec2/spark-ec2 -k name -i /User/Desktop/pemfile.pem login clustername
However I all the docs I've found involve this commands, which exist in the file
/users/downloads/spark-1.5.2/ I've run them anyway, and tried logging into was using the ec2 command in that folder after I did. Still, just got the same error
I submit export SPARK_HIVE=TRUE before running these commands on my local machine, but I've seen messages saying its deprecated and will be ignored anyway.
Build hive with maven:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver -DskipTests clean package
Build hive with sbt
build/sbt -Pyarn -Phadoop-2.3 assembly
And another I found
./sbt/sbt -Phive assembly
I also took the hive-site.xml file and put in both the /Users/Downloads/spark-1.5.2-bin-hadoop2.6/conf folder and the /Users/Downloads/spark-1.5.2/conf
Still no luck.
I can't seem to run the hive commands no matter what I build it with or how I log in. Is there anything obvious I'm missing.

I too had the same error when using a HiveContext on a EC2 cluster built with the ec2 scripts that comes with the Spark package (v1.5.2 in my case). Through much trial and error, I found that building a EC2 cluster with the following options got the right version of Hadoop with Hive properly built so that I can use a HiveContext in my PySpark jobs:
spark-ec2 -k <your key pair name> -i /path/to/identity-file.pem -r us-west-2 -s 2 --instance-type m3.medium --spark-version 1.5.2 --hadoop-major-version yarn launch <your cluster name>
The key parameters here is that you set --spark-version to 1.5.2 and --hadoop-major-version to yarn - even though you aren't using to use Yarn to submit jobs as it forces the hadoop build to be 2.4. Of course, adjust the other parameters as appropriate for your desired cluster.

Related

Is there a way to load the install-interpreter.sh file in EMR in order to load 3rd party interpreters?

I have an Apache Zeppelin notebook running and I'm trying to load the jdbc and/or postgres interpreter to my notebook in order to write to a postgres DB from Zeppelin.
The main resource to load new interpreters here tells me to run the code below to get other interpreters:
./bin/install-interpreter.sh --all
However, when I run this command in EMR terminal, I find that the EMR cluster does not come with an install-interpreter.sh executable file.
What is the recommended path?
1. Should I find the install-interpreter.sh file and load that to the EMR cluster under ./bin/?
2. Is there an EMR configuration on start time that would enable the install-interpreter.sh file?
Currently all tutorials and documentations assumes that you can run the install-interpreter.sh file.
The solution is to not run this code below in root (aka - ./ )
./bin/install-interpreter.sh --all
Instead in EMR, run the code above in Zeppelin, which in the EMR cluster, is in /usr/lib/zeppelin

CDH 5.3.2 - Need to restart impala daemon from shell/script

I am using CDH 5.3.2 cluster and have a requirement to be able to start/stop impala daemons from a script. The command mentioned in Cloudera Docs
sudo service impala-server start
works fine on my CDH 5.10 local VM but on CDH 5.3.2 cluster I get an error "impala-server: unrecognized service". On checking in /etc/init.d I see that no such service is listed either (while its listed in 5.10 version)
Then i tried to restart the service directly from impala bin directory
cd /usr/bin
./impalad stop
However running into below error now:
E0918 11:55:27.815739 12046 JniFrontend.java:622] FileSystem is file:///
W0918 11:55:27.817589 12046 JniFrontend.java:534] Cannot detect CDH version. Skipping Hadoop configuration checks
E0918 11:55:27.817620 12046 impala-server.cc:210] Unsupported file system. Impala only supports DistributedFileSystem but the configured filesystem is: LocalFileSystem.fs.defaultFS(file:///) might be set incorrectly
E0918 11:55:27.817631 12046 impala-server.cc:212] Aborting Impala Server startup due to improper configuration
I checked core-site.xml on Cloudera Manager and fs.defaultFS is correctly set so not sure where its picking the value from. Any pointers on how to go further on this?
The init.d service packages to start Impala from the command line are meant to be used for CDH users who do NOT want to use Cloudera Manager. The right way to start and stop Impala on a Cloudera Manager cluster is to use the CM API:
https://cloudera.github.io/cm_api/apidocs/v17/index.html
start cluster service API
stop cluster service API
commands API
The tutorial shows how to use the CM APIs but for your situation you probably need to do:
$ curl -X POST -u USER:PASSWORD \
'CM_URL//api/v1/clusters/CLUSTERNAME/services/IMPALA_SERVICE/commands/stop'
replacing USER, PASSWORD, CM_URL, CLUSTERNAME, IMPALA_SERVICE_NAME with the appropriate values. The curl command will return a command ID.
Then poll this API with the command ID to see that the start/stop operation completed.
$ curl -u USER:PASSWORD 'CM_URL//api/v1/commands/COMMAND_ID'
However, if you still want to use the init.d service packages then you'll need to install the impala-server package.

Changing JDK on cluster deployed with ./spark-ec2

I have deployed an Amazon EC2 cluster with Spark like so:
~/spark-ec2 -k spark -i ~/.ssh/spark.pem -s 2 --region=eu-west-1 --spark-version=1.3.1 launch spark-cluster
I copy a file I need first to the master and then from master to HDFS using:
ephemeral-hdfs/bin/hadoop fs -put ~/ANTICOR_2_10000.txt ~/user/root/ANTICOR_2_10000.txt
I have a jar I want to run which was compiled with JDK 8 (I am using a lot of Java 8 features) so I copy it over with scp and run it with:
spark/bin/spark-submit --master spark://public_dns_with_port --class package.name.to.Main job.jar -f hdfs://public_dns:~/ANTICOR_2_10000.txt
The problem is that spark-ec2 loads the cluster with JDK7 so I am getting the Unsupported major.minor version 52.0
My question is, which are all the places where I need to change JDK7 to JDK8?
The steps I am doing thus far on master are:
Install JDK8 with yum
Use sudo alternatives --config java and change prefered java to java-8
export JAVA_HOME=/usr/lib/jvm/openjdk-8
Do I have to do that for all the nodes? Also do I need to change the java path that hadoop uses at ephemeral-hdfs/conf/hadoop-env.sh or are there any other spots I missed?
Unfortunately, Amazon doesn't offer out-of-the-box Java 8 installations, yet: see available versions.
Have you seen this post on how to install it on running instances?
Here is what i have been doing for all java installations which are different from versions provided by default installations: -
Configure the JAVA_HOME environment variable on each machine/ node: -
export JAVA_HOME=/home/ec2-user/softwares/jdk1.7.0_25
Modify the default PATH and place the "java/bin" directory before the rest of the PATH on all Nodes/ machines.
export PATH=/home/ec2-user/softwares/jdk1.7.0_25/bin/:$M2:$SCALA_HOME/bin/:$HIVE_HOME/bin/:$PATH:
And the above needs to be done with the same "OS user" which is used to execute/ own the spark master/ worker process.

Spark Submit Issue

I am trying to run a fat jar on a Spark cluster using Spark submit.
I made the cluster using "spark-ec2" executable in Spark bundle on AWS.
The command I am using to run the jar file is
bin/spark-submit --class edu.gatech.cse8803.main.Main --master yarn-cluster ../src1/big-data-hw2-assembly-1.0.jar
In the beginning it was giving me the error that at least one of the HADOOP_CONF_DIR or YARN_CONF_DIR environment variable must be set.
I didn't know what to set them to, so I used the following command
export HADOOP_CONF_DIR=/mapreduce/conf
Now the error has changed to
Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
Run with --help for usage help or --verbose for debug output
The home directory structure is as follows
ephemeral-hdfs hadoop-native mapreduce persistent-hdfs scala spark spark-ec2 src1 tachyon
I even set the YARN_CONF_DIR variable to the same value as HADOOP_CONF_DIR, but the error message is not changing. I am unable to find any documentation that highlights this issue, most of them just mention these two variables and give no further details.
You need to compile spark against Yarn to use it.
Follow the steps explained here: https://spark.apache.org/docs/latest/building-spark.html
Maven:
build/mvn -Pyarn -Phadoop-2.x -Dhadoop.version=2.x.x -DskipTests clean package
SBT:
build/sbt -Pyarn -Phadoop-2.x assembly
You can also download a pre-compiled version here: http://spark.apache.org/downloads.html (choose a "pre-built for Hadoop")
Download prebuilt spark which supports hadoop 2.X versions from https://spark.apache.org/downloads.html
The --master argument should be: --master spark://hostname:7077 where hostname is the name of your Spark master server. You can also specify this value as spark.master in the spark-defaults.conf file and leave out the --master argument when using Spark submit from the command line. Including the --master argument will override the value set (if exists) in the spark-defaults.conf file.
Reference: http://spark.apache.org/docs/1.3.0/configuration.html

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Resources