What is the difference between Apache Pig and Apache Hive? - hadoop

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for both. So when to use and which technology? Is there any specification for both which shows clearly the difference between both in terms of applicability and performance?

Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.

From a purely engineering point of view, I find PIG both easier to write and maintain than SQL-like languages. It is procedural, so you apply a bunch of relations to your data one-by-one, and if something fails you can easily debug at intermediate steps, and even have a command called “illustrate” which uses an algorithm to sample some data matching your relation. I’d say for jobs with complex logic, this is definitely much more convenient than Hive, but for simple stuff the gain is probably minimal.
Regarding interfacing, I find that PIG offers a lot of flexibility compared to Hive. You don’t have a notion of table in PIG so you manipulate files directly, and you can define loader to load it into pretty much any format very easily with loader UDFs, without having to go through the table loading stage before you can do your transformations. They have a nice feature in the recent versions of PIG where you can use dynamic invokers, i.e. use pretty much any Java method directly in your PIG script, without having to write a UDF.
For performance/optimization, from what I’ve seen you can directly control in PIG the type of join and grouping algorithm you want to use (I believe 3 or 4 different algorithms for each). I’ve personally never used it, but as you’re writing demanding algorithms it could probably be useful to be able to decide what to do instead of relying on the optimizer as it’s the case in Hive. So I wouldn’t say it necessarily performs better than Hive, but in cases where the optimizer makes the wrong decision, you have the option to choose what algorithm to use and have more control on what happens.
One of the cool things I did lately was splits: you can split your execution flow and apply different relations to each split. So you can have a non-linear dataset, split it based on a field, and apply a different processing to each part, and maybe join the results together in the end, all this in the same script. I don’t think you can do this in Hive, you’d have to write different queries for each case, but I may be wrong.
One thing to note also is that you can increment counters in PIG. Currently you can only do this in PIG UDFs though. I don’t think you can use counters in Hive.
And there are some nice projects that allow you to interface PIG with Hive as well (like HCatalog), so you can basically read data from a hive table, or write data to a hive table (or both) by simply changing your loader in the script. Supports dynamic partitions as well.

Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.
Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.
https://cwiki.apache.org/confluence/display/PIG/Index%3bjsessionid=F92DF7021837B3DD048BF9529A434FDA
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
https://cwiki.apache.org/Hive/

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work.
Have a look at Pig Vs Hive Comparison in a nut shell from dezyre article
Hive scores over PIG in Partitions, Server, Web interface & JDBC/ODBC support.
Some differences:
Hive is best for structured Data & PIG is best for semi structured data
Hive used for reporting & PIG for programming
Hive used as a declarative SQL & PIG used as procedural language
Hive supports partitions & PIG does not
Hive can start an optional thrift based server & PIG can't
Hive defines tables before hand (schema) + stores schema information in database and PIG don't have dedicated metadata of database
Hive does not support Avro but PIG does
Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically
So when to use and which technology?
Above difference clarifies your query.
HIVE : Structured data, SQL like queries and used for reporting purpose
PIG: Semi-structured data, program a work-flow involving a sequence of activities for Map Reduce jobs.
Regarding performance of job, both HIVE and PIG are slow compared to traditional Map Reduce job. Reason : Finally Hive or PIG scripts have to be converted into a series of Map Reduce jobs.
Have a look at related SE question:
Pig vs Hive vs Native Map Reduce

The main difference is PIG is a data flow language and Hive is data warehouse.
As PIG can be used similar as a step by step procedural language.
But HIVE is used as a declarative language.
PIG can be used for getting online streaming unstructured data. But HIVE can only access structured data and it can also access data from RDBMS databases such as SQL, NOSQL by using JDBC and ODBC drivers.
PIG can convert data into Avro format but PIG can't.
PIG can't create partitions but HIVE can do it.
As HIVE is top of PIG that's why HIVE can only access the data once it is processed by PIG.
It depends when we have to use PIG and HIVE if you are working structured, relational data then we can use HIVE else we can use PIG.
By PIG we can communicate with ETL tools but it takes more time compared with hive. But it is easy in PIG rather HIVE because in HIVE we have to create table before processing the data.

Related

Why Hive when HDFS already provide data storage?

I have started learning Hadoop.I understood that HDFS provides distributed storage system and Mapreduce is for data processing.Now i ma reading Hadoop ecosystem.
From the definition of Hive, it is a data ware house built on hadoop for providing SQL like interface.
My question is when hadoop provides HDFS which is falut tolerant , distributed then why hive? Does hive replaces HDFS?.
Does hive provide only sql interface or storage also?
Hive does not replace HDFS. Hive provides sql type interface to data that is stored in HDFS. Its basically used for querying and analysis of data that is stored. Hive in a sense actually eliminates a lot of boiler plate code, that you would have to write if you were using mapreduce. for example just think of how you are going to create different types of joins(left, right, bucketed) or group by clause or any other sql clause in mapreduce and you will get your answer (you lines of code will easily scale to 100's ). Hive provides them out-of-the-box. You dont need to write those lengthy programs in mapreduce. Hive already does that for you.
One thing to note is, Hive itself uses Mapreduce behind the scenes. So any group by, count, join is converted to mapreduce jobs only. You can change this though to Tez/Spark.
for your second question, hive does not provide any storage, it just uses a database (derby as default, MySQL would be a good choice if you want to use a different db) as a metastore just to store the metadata related to the tables, partitions, views, buckets etc.. (metadata is like location of tables, type of data stored in tables, partitions info of the tables, created date, modified date etc..) you create with hive.
To answer your question in comment...
Hive can process structured (csv,txt etc) data & semi-structured(xml,json,parquet etc). It cannot process unstructured data like audio, video etc.
Note: Semi structured data can be handled in DDLs and also through spark to be put into Hive.
I encourage you to learn what is external and managed tables in hive too.
Happy learning.

Spark on Parquet vs Spark on Hive(Parquet format)

Our use case is a narrow table(15 fields) but large processing against the whole dataset(billions of rows). I am wondering what combination provides better performance:
env: CDH5.8 / spark 2.0
Spark on Hive tables(as format of parquet)
Spark on row files(parquet)
Without additional context of your specific product and usecase - I'd vote for SparkSql on Hive tables for two reasons:
sparksql is usually better than core spark since databricks wrote different optimizations in sparksql, which is higher abstaction and gives ability to optimize code(read about Project Tungsten). In some cases manually written spark core code will be better, but it demands from the programmer deep understanding of the internals. In addition sparksql sometimes is limited and doesn't permit you to control low-level mechanisms, but you can always fallback to work with core rdd.
hive and not files - I'm assuming hive with external metastore. Metastore saves definitions of partitions of your "tables"(in files it could be some directory). This is one of the most important parts for the good performance. I.e. when working with files spark will need to load this info(which could be time consuming - e.g. s3 list operation is very slow). So metastore permits spark to fetch this info in simple and fast way
There's only two options here. Spark on files, or Spark on Hive. SparkSQL works on both, and you should prefer to use the Dataset API, not RDD
If you can define the Dataset schema yourself, Spark reading the raw HDFS files will be faster because you're bypassing the extra hop to the Hive Metastore.
When I did a simple test myself years ago (with Spark 1.3), I noticed that extracting 100000 rows as a CSV file was orders of magnitude faster than a SparkSQL Hive query with the same LIMIT

Difference between hive,pig,map-reduce use cases

Difference between map-reduce ,hive ,pig
pig : its a data flow language, it can work on any data basically used to convert semi structure ,unstructured data to structure so that can be used in hive advance analytics using windowing function etc.
Hive : Work on structure data and provide sql type query language .
I know at back end both pig and hive uses map -reduces .
I know map-reduce can be good tool for programmer ,hive or pig for sql guy
I just want to know is there any specific use cases where we go for hive,pig and map-reduce
basically we decide that we have to use pig here hive here or we must use map -reduce .
Map-Reduce: Has better performance than pig or hive but requires more development time.
PIg: Less development time but poor performance when compared to map-reduce.
Hve: SQL type language with some good features like partitioning and bucketing to improve performance reads.Also, hive enforces schema on read.
Pig is used to format your unstructured/semi structure data format.Lets say you have a timestamp in your data which is not as per Hive timestamp format.You can convert same using pigUDF and format your data.This is just a example to explain.You can do many more things using Pig.
Hive is basically used for structured data .This maynot work well with unstructured data.This takes more time to execute as it converts into Mapreduce job.I suggest you to use impala which is much faster than hive.
Pig is a data flow language. This means that you can not use if statements or loops.
If you need to do a lot of repetition, it would be preferable to learn mapreduce.
You are able to get around this by embedding pig into a python script but this would take even longer since it would have to load all the jar files with every iteration of the loop.
Basically it boils down to how much time you spend prototyping vs. how much production work you have.
If you are a data scientist or an analyst, most of your work is new projects that require a lot of prototyping. This means that you care about getting results fast. Then you would prefer Pig or Hive.
If you are in a development team, you want to build robust code based on agreed upon methodology that does not need to be tested and then you would prefer mapreduce.
There are companies like Cloudera that provide a package of Pig, Hive, and other Hadoop tools so you wouldn't have to choose between the two.
Map Reduce is a inner component of hadoop, other Pig and hive are hadoop eco systems it means run on the top of hadoop. The purpose of both mapreduce, pig and hive purpose is process the vast amount of data in different manner.
Mapreduce: apache implemented it. highly recommendable to process entire data, it's time consume and required program skills like java (highly recommendable), pyghon, ruby and other programming languages. total data aggregate and sort by using mapper and reducer functions. Hadoop use it by default.
Hive: Facebook implemented it. most of the analysts especially bigdata analysts use this tool to analyze the data especially structure data. Backend this hive tool use mapreduce to be processed. Internally Hive use special language called HQL, It's subset of SQL language. Who is wellever in SQL, they can goes with Hive. It's highly recommended to the Datawarehouse oriented projects. Much difficult to process un structured especially schema-less data.
Pig:
Pig is a scripting language, implemented by Yahoo. The main difference between pig and Hive is pig can process any type of data, either structured or unstructured data. It means it's highly recommendable for streaming data like satellite generated data, live events, schema-less data etc. Pig first load the data later programmer write a program depends on data to make it structured. Who is expert in programming languages they will choose this Hadoop ecosystems.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the page—for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Resources