Issue on configure hive on spark - hadoop

I have downloaded spark-2.0.0-bin-hadoop2.7. Could any one advise how to configure hive on this and use in scala console? Now I am able to run RDD's on file using Scala (spark-shell console).

Follow the official Hive on Spark documentation:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
You can set on Hive the spark engine by using the following command:
set hive.execution.engine=spark;
or by adding it on hive-site.xml (refer to kanishka post)
Then prior to Hive 2.2.0, copy the spark-assembly jar to HIVE_HOME/lib.
Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn't have an assembly jar.
To run with YARN mode (either yarn-client or yarn-cluster), copy the following jars to HIVE_HOME/lib.
scala-library
spark-core
spark-network-common
Set the spark_home:
export $SPARK_HOME=/path-to-spark
Start Spark Master and Workers:
spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker spark://MASTER_IP:PORT
Configure Spark:
set spark.master=<Spark Master URL>;
set spark.executor.memory=512m;
set spark.yarn.executor.memoryOverhead=10~20% of spark.executor.memory(value);
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

Put your hive-site.xml on spark conf directory

Hive can support multiple execution engine. Like TEZ, Spark.
You can set the property in hive-site.xml
</property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>
I am choosing Spark as the execution engine
</description>
</property>
Copy jars spark-assembly jar to HIVE_HOME/lib
Set the spark_home
set the below properties
set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Above steps would suffice i think

Related

Hive metastore database details missing in hive-site.xml

We are using CDH 5.4.6. I am able to find Hive Metastore details in Cloudera UI .
But I am trying to find the same details on configuartion file.
I can only find hive.metastore.uris parameter in /etc/hive/conf/hive-site.xml . conf file hive-site.xml supposed to have javax.jdo.option.ConnectionURL / ConnectionDriverName / ConnectionUserName / ConnectionPassword. Where can I find those details?
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://xxxxx.com:9083</value>
</property>
JDO details are only applicable to Hive Metastore. So, for security reasons they are not included in client configuration version of hive-site.xml. The settings that you see in Cloudera Manager UI are stored in Cloudera Manager's database. CM retrieves and adds those values dynamically to a special server-side hive-site.xml which it generates before HMS process is started. That file can be seen in configuration directory /var/run/cloudera-scm-agent/process/nnn-hive-HIVEMETASTORE/ on the node running HMS role (with proper permissions; nnn here is an incremental process counter).
By the way, CDH 5.4.6 has been EOL'ed for ages. Why aren't you upgrading?

Spark with custom Hadoop FileSystem

I already have a cluster with Yarn, configured to use a custom Hadoop FileSystem in core-site.xml:
<property>
<name>fs.custom.impl</name>
<value>package.of.custom.class.CustomFileSystem</value>
</property>
I want to run a Spark Job on this Yarn cluster, which reads an input RDD from this CustomFilesystem:
final JavaPairRDD<String, String> files =
sparkContext.wholeTextFiles("custom://path/to/directory");
Is there some way I can do this without re-configuring Spark? i.e. Can I point Spark to the existing core-site.xml, and what would be the best way to do that?
Set HADOOP_CONF_DIR to the directory that contains core-site.xml. (This is documented in Running Spark on YARN.)
You will still need to make sure package.of.custom.class.CustomFileSystem is on the classpath.

How to set HADOOP_CLASSPATH via oozie while running HBase job

I'm using CDH5. I'm hit by a HBase bug while running a MapReduce job through Oozie in a fully distributed environment. This job connects to HBase and adds records programmatically. Requesting to refer these links to understand the bug I'm hitting. Please note that I cannot modify the map reduce job code. The job runs fine from commandline after setting HADOOP_CLASSPATH env variable. But there seem to be no way to set/override this environment variable from oozie. As a result the job fails when running from oozie. Anybody experienced and found a workaround for this problem?
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdpch_relnotes-hdp-2.0.9.0-knownissues-hbase.html
https://issues.apache.org/jira/browse/HBASE-11118
You can set the HADOOP_CLASSPATH in the system that runs oozie server. So, sending it every time in request is not required.
Otherwise, we can set it in the xml. In file oozie-site.xml set:
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/home/user/oozie/etc/hadoop</value>
</property>
Where /home/user/oozie/etc/hadoop is the absolute path where hadoop
configuration files are located.

How to Configure MR1 on CDH5.1 vm

I have installed CDH5.1 VM on my machine. CDH 5.1 is by default set to MR2(YARN). I would want to change the configuration from MR2 to MR1. Request to let me know the changes that need to be done.
Just do the steps to set MR configuration as given in the cdh5.1.2 Documentation
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_mr_cluster_deploy.html#topic_11_3
then use the hadoop command and not he yarn command to run the jar

Best place for json Serde JAR in CDH Hadoop for use with Hive/Hue/MapReduce

I'm using Hive/Hue/MapReduce with a json Serde. To get this working I have copied the json_serde.jar to several lib directories on every cluster node:
/opt/cloudera/parcels/CDH/lib/hive/lib
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib
/opt/cloudera/parcels/CDH/lib/hadoop/lib
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib
...
On every CDH update of the cluster I have to do that again.
Is there a more elegant way where the distribution of the Serde in the cluster would be automatic and resistant to updates?
If using HiveServer2 (Default in Cloudera 5.0+) the following configuration will work across your entire cluster without having to copy the jar to each node.
In your hive-site.xml config file, or if you're using Cloudera Manager in the "HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml" config box
<property>
<name>hive.aux.jars.path</name>
<value>/user/hive/aux_jars/hive-serdes-1.0-snapshot.jar</value>
</property>
Then create the directory in your HDFS filesystem (/user/hive/aux_jars) and place the jar file in it. If you are running HUE you can do this part via the web UI, just click on File Browser at the top right.
It depends on the version of Hue and if using Beeswax or HiveServer2:
Beeswax: there is a workaround with the HIVE_AUX_JARS_PATH https://issues.cloudera.org/browse/HUE-1127
HiveServer2 supports a hive.aux.jars.path property in the hive-site.xml. HiveServer2 does not support a .hiverc and Hue is looking at providing an equivalent at some point: https://issues.cloudera.org/browse/HUE-1066

Resources