I would like to execute hive queries on spark. Currently we are using mapreduce as execution engine. please do let me know is spark support to execute hive queries on MapR cluster?.
earlier i executed hive queries on spark engine with Cloudera. But not sure about MapR.
You can use shark , check the below link.
https://www.mapr.com/products/product-overview/shark
You can call hive queries from Spark-SQL which is much faster.
Please follow below spark 2.0 documentation ,
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Related
This question already has an answer here:
What is the difference between hbase and hive? (Hadoop)
(1 answer)
Closed 5 years ago.
In my project, we are using Hadoop 2, Spark, Scala. Scala is the programming language and Spark is using here for analysing. we are using Hive and HBase both. I can access all details like file etc. of HDFS using Hive.
But my confusions are -
When I can able to performed all jobs using Hive, Then why HBase is required to store the data. Is it not an overhead?
What are the functionality of HIVE and HBase?
If we only used Hive, Then what should be the problem?
Can anyone please let me know.
When I can able to performed all jobs using Hive, Then why HBASE is required to store the data. Is it not a overhead?
What are the functionality of Hive and Hbase
HBase is No Sql database which stores the data in key value pair. Hive has integration with Hbase.Hbase HIve Integration
Advantage :- Hive queries over HBase. Think joins and a easy way to do aggregates and simple operations on your Hbase data.
Hbase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses Hbase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
If we only used Hive, Then what should be the problem?
If we will use Hive There is no problem . But in project there so many scenarios we have to consider .
Performance
Storage
Stability of used technology
Compatibility (Hive ware house is easily accessible for most of the Tools in Hadoop)
When I can able to performed all jobs using Hive, Then why HBase is
required to store the data. Is it not an overhead?
I can't say it's overhead or not. But HBase responds to requests in real-time as its database when it comes to Hive it runs jobs on MapReduce/Spark/Tez engines.
What are the functionality of Hive and HBase?
Hive:
It's a SQL-like language that gets translated into MapReduce/Spark/Tez jobs. it only runs batch processes on Hadoop. for more check this how Hive queries run on MapReduce engine
HBase:
It's key/value store database which runs on top of HDFS/S3(on AWS). It does real-time operations for requests.
If we only used Hive, Then what should be the problem?
As discussed If the query needs to process in real-time then HBase is the choice over Hive.
Currently we are working on Hive, which by default uses map reduce as processing framework in our MapR cluster. Now we want to change from map reduce to spark for better performance. As per my understanding we need to set hive.execution.engine=spark.
Now my question is Hive on spark is currently supported by MapR ? if yes, what are configuration changes that we need to do ?
Your help is very much appreciated. Thanks
No, MapR (5.2) doesn't support that. From their docs,
MapR does not support Hive on Spark. Therefore, you cannot use Spark as an execution engine for Hive. However, you can run Hive and Spark on the same cluster. You can also use Spark SQL and Drill to query Hive tables.
Cheers.
I know and understand that your question is about using Spark as data processing engine for Hive; and as you can see in the various answer it is today not officially supported by MapR.
However, if you goal is to make Hive faster, and do not use MapReduce you can switch to Tez, for this install the MEP 3.0.
See: http://maprdocs.mapr.com/home/Hive/HiveandTez.html
Hive installation guide says that Hive can be applied to RDBMS, my question is, sounds like Hive can exist without Hadoop, right? It's an independent HQL engineer that could work with any data source?
You can run Hive in local mode to use it without Hadoop for debugging purposes. See below url
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode
Hive provided JDBC driver to query hive like JDBC, however if you are planning to run Hive queries on production system, you need Hadoop infrastructure to be available. Hive queries eventually converts into map-reduce jobs and HDFS is used as data storage for Hive tables.
I'm new to Hadoop, but I have been trying to create a single-node cluster for a college project. Just to set a context to my question, my projetct goal is to perform mapreduce jobs into the same data but while using different Hadoop-based software, these being Hive and Pig.
So, I wanted to know if, once I have a Hadoop running with Hive installed, how can I differ its commands? Since Hive is set, the node is its?
Yes, regular MapReduce jobs can still be run against the Hadoop cluster after Hive has been installed. Hive itself just converts the HiveQL queries into MapReduce jobs that it submits.
I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?
No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.
From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster