We have multiple hadoop cluster using hive and pig, what is the best way to consolidate them in to one? In BI this was done by build EDW or MDM approach. how about hadoop is any one thinking about this
Pig and Hive are client side libraries, so there is nothing to migrate for them. Only the Pig and Hive client have to point to the appropriate Hadoop clusters.
Regarding the data, you could use DistCp to move the data across clusters.
Related
I was doing a case study on Spotify. I found out that Spotify uses Cassandra as a DB and also Hadoop. My question is, how is Hadoop different from a database. What type of files does Hadoop datanode stores? Why every corporation has DB and Hadoop as well. I know Hadoop is not a DB but what is it used for if there is DB cluster to save data?
Hadoop is not a database at all. Hadoop is a set of tools for distributed storage and processing, such as distributed filesystem (HDFS), MapReduce framework libraries, YARN resource manager.
Other tools like Hive, Spark, Pig, Giraph, sqoop, etc, etc can use Hadoop or it's components. For example Hive is a database. It uses HDFS for storing it's data and MapReduce framework primitives for building query execution graph.
There is a whole lot of hadoop ecosystem pictures on the internet, so i struggle to get an understanding how the tools work together.
E.g. in the picture attached, why are pig and hive based on map reduce whereas the other tools like spark or storm on YARN?
Would you be so kind and explain this?
Thanks!
BR
haddop ecosystem
The picture shows Pig and Hive on top of MapReduce. This is because MapReduce is a distributed computing engine that is used by Pig and Hive. Pig and Hive queries get executed as MapReduce jobs. It is easier to work with Pig and Hive, since they give a higher-level abstraction to work with MapReduce.
Now let's take a look at Spark/Storm/Flink on YARN in the picture. YARN is a cluster manager that allows various applications to run on top of it. Storm, Spark and Flink are all examples of applications that can run on top of YARN. MapReduce is also considered as an application that can run on YARN, as shown in the diagram. YARN handles the resource management piece so that multiple applications can share the same cluster. (If you are interested in another example of a similar technology, check out Mesos).
Finally, at the bottom of the picture is HDFS. This is the distributed storage layer that allows applications to store and access data. It provides features such as distributed storage, replication and fault tolerance.
If you are interested in deeper-dives, check out the Apache Projects page.
This question already has an answer here:
What is the difference between hbase and hive? (Hadoop)
(1 answer)
Closed 5 years ago.
In my project, we are using Hadoop 2, Spark, Scala. Scala is the programming language and Spark is using here for analysing. we are using Hive and HBase both. I can access all details like file etc. of HDFS using Hive.
But my confusions are -
When I can able to performed all jobs using Hive, Then why HBase is required to store the data. Is it not an overhead?
What are the functionality of HIVE and HBase?
If we only used Hive, Then what should be the problem?
Can anyone please let me know.
When I can able to performed all jobs using Hive, Then why HBASE is required to store the data. Is it not a overhead?
What are the functionality of Hive and Hbase
HBase is No Sql database which stores the data in key value pair. Hive has integration with Hbase.Hbase HIve Integration
Advantage :- Hive queries over HBase. Think joins and a easy way to do aggregates and simple operations on your Hbase data.
Hbase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses Hbase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
If we only used Hive, Then what should be the problem?
If we will use Hive There is no problem . But in project there so many scenarios we have to consider .
Performance
Storage
Stability of used technology
Compatibility (Hive ware house is easily accessible for most of the Tools in Hadoop)
When I can able to performed all jobs using Hive, Then why HBase is
required to store the data. Is it not an overhead?
I can't say it's overhead or not. But HBase responds to requests in real-time as its database when it comes to Hive it runs jobs on MapReduce/Spark/Tez engines.
What are the functionality of Hive and HBase?
Hive:
It's a SQL-like language that gets translated into MapReduce/Spark/Tez jobs. it only runs batch processes on Hadoop. for more check this how Hive queries run on MapReduce engine
HBase:
It's key/value store database which runs on top of HDFS/S3(on AWS). It does real-time operations for requests.
If we only used Hive, Then what should be the problem?
As discussed If the query needs to process in real-time then HBase is the choice over Hive.
I'm new to Hadoop, but I have been trying to create a single-node cluster for a college project. Just to set a context to my question, my projetct goal is to perform mapreduce jobs into the same data but while using different Hadoop-based software, these being Hive and Pig.
So, I wanted to know if, once I have a Hadoop running with Hive installed, how can I differ its commands? Since Hive is set, the node is its?
Yes, regular MapReduce jobs can still be run against the Hadoop cluster after Hive has been installed. Hive itself just converts the HiveQL queries into MapReduce jobs that it submits.
I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?
No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.
From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster