Hive on Spark in Mapr Distribution - hadoop

Currently we are working on Hive, which by default uses map reduce as processing framework in our MapR cluster. Now we want to change from map reduce to spark for better performance. As per my understanding we need to set hive.execution.engine=spark.
Now my question is Hive on spark is currently supported by MapR ? if yes, what are configuration changes that we need to do ?
Your help is very much appreciated. Thanks

No, MapR (5.2) doesn't support that. From their docs,
MapR does not support Hive on Spark. Therefore, you cannot use Spark as an execution engine for Hive. However, you can run Hive and Spark on the same cluster. You can also use Spark SQL and Drill to query Hive tables.
Cheers.

I know and understand that your question is about using Spark as data processing engine for Hive; and as you can see in the various answer it is today not officially supported by MapR.
However, if you goal is to make Hive faster, and do not use MapReduce you can switch to Tez, for this install the MEP 3.0.
See: http://maprdocs.mapr.com/home/Hive/HiveandTez.html

Related

Purpose of using HBase in Hadoop instead of Hive [duplicate]

This question already has an answer here:
What is the difference between hbase and hive? (Hadoop)
(1 answer)
Closed 5 years ago.
In my project, we are using Hadoop 2, Spark, Scala. Scala is the programming language and Spark is using here for analysing. we are using Hive and HBase both. I can access all details like file etc. of HDFS using Hive.
But my confusions are -
When I can able to performed all jobs using Hive, Then why HBase is required to store the data. Is it not an overhead?
What are the functionality of HIVE and HBase?
If we only used Hive, Then what should be the problem?
Can anyone please let me know.
When I can able to performed all jobs using Hive, Then why HBASE is required to store the data. Is it not a overhead?
What are the functionality of Hive and Hbase
HBase is No Sql database which stores the data in key value pair. Hive has integration with Hbase.Hbase HIve Integration
Advantage :- Hive queries over HBase. Think joins and a easy way to do aggregates and simple operations on your Hbase data.
Hbase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses Hbase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
If we only used Hive, Then what should be the problem?
If we will use Hive There is no problem . But in project there so many scenarios we have to consider .
Performance
Storage
Stability of used technology
Compatibility (Hive ware house is easily accessible for most of the Tools in Hadoop)
When I can able to performed all jobs using Hive, Then why HBase is
required to store the data. Is it not an overhead?
I can't say it's overhead or not. But HBase responds to requests in real-time as its database when it comes to Hive it runs jobs on MapReduce/Spark/Tez engines.
What are the functionality of Hive and HBase?
Hive:
It's a SQL-like language that gets translated into MapReduce/Spark/Tez jobs. it only runs batch processes on Hadoop. for more check this how Hive queries run on MapReduce engine
HBase:
It's key/value store database which runs on top of HDFS/S3(on AWS). It does real-time operations for requests.
If we only used Hive, Then what should be the problem?
As discussed If the query needs to process in real-time then HBase is the choice over Hive.

Is Hive on Spark supported by MapR cluster?

I would like to execute hive queries on spark. Currently we are using mapreduce as execution engine. please do let me know is spark support to execute hive queries on MapR cluster?.
earlier i executed hive queries on spark engine with Cloudera. But not sure about MapR.
You can use shark , check the below link.
https://www.mapr.com/products/product-overview/shark
You can call hive queries from Spark-SQL which is much faster.
Please follow below spark 2.0 documentation ,
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

PiG + Cassandra + Hadoop

I have a Hadoop (2.7.2) setup over a Cassandra (3.7) Cluster. I have no problem with using Hadoop MapReduce. Similarly, I have no problem to create tables and keyspace in CQLSH. However, I have been trying to install PIG over hadoop, so as to access the tables in Cassandra. (Installation of PIG is as such fine) It is where I'm having trouble.
I have come across numerous websites, most are either for outdated versions of Cassandra or just plain vague.
The one thing I gleaned from this website is that we can load access the cassandra tables in pig using CqlStorage / CqlNativeStorage. However, in the latest version, it seems this support has been removed (since 2015).
Now my question is, are there any workarounds?
I would be running mapreduce jobs over cassandra tables, and use PiG for querying, mostly.
Thanks in Advance.
All pig support was Deprecated in 2.2 and removed in 3.0. https://issues.apache.org/jira/browse/CASSANDRA-10542
So I think you are a bit out of luck here. You may be able to use old classes with modern C* but Pig is very niche right now. SparkSql is definitely the current favorite child (I may be biased since I work on the Spark + Cassandra Connector) and allows for very flexible querying of C* data.

Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

I am new to all these terms and given some time to understand it. But i have some confusions in it. Please correct me if i am wrong.
Nutch: It's for web crawling, using it we can crawl web pages. We can store these web pages somewhere in db.
Solr: Solr can be used for indexing web pages crawled by Apache Nutch. It helps in searching the indexes web pages.
HBase: It's used as an interface to interact with Hadoop. It helps in getting data at real time from HDFS. It provides simple SQL type interface for interacting.
Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Its basically used for offline data backup etc.
Gora and ZooKeeper: I am not sure of.
Confusions:
1). Is HBase a key-value pair DB or just an interface to Hadoop ? or i should ask, can HBase exist without Hadoop ?
If yes, can you explain a bit more about its usage.
2). Is there any use of crawling data using Apache Nutch without indexing into Solr ?
3). For running apache nutch, do we need HBase and Hadoop ? If no, how we can make it work without it?
4). Is Hadoop part of HBase ?
Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS
Because HBase is built on top of Hadoop you can't really have HBase without Hadoop.
Yes you can run Nutch without Solr; there do not seem to be lots of use cases, however, much less living examples in the wild.
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this.
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well.
Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Gora is a memory management/persistence framework and is built on top of Hadoop.

Is it possible to use Impala in Hadoop 1 (without YARN)?

I saw in Hadoop 1 limitations that the only paradigm we can use is mapreduce. If you want to use other paradigms (like spark for instance), you have to use Hadoop 2.0 and YARN.
But i have a question related to Impala. Was it possible to use Impala without YARN or not ?
Thanks.
Yes, Impala can be used independently of YARN.

Resources