Difference between hive,pig,map-reduce use cases - hadoop

Difference between map-reduce ,hive ,pig
pig : its a data flow language, it can work on any data basically used to convert semi structure ,unstructured data to structure so that can be used in hive advance analytics using windowing function etc.
Hive : Work on structure data and provide sql type query language .
I know at back end both pig and hive uses map -reduces .
I know map-reduce can be good tool for programmer ,hive or pig for sql guy
I just want to know is there any specific use cases where we go for hive,pig and map-reduce
basically we decide that we have to use pig here hive here or we must use map -reduce .

Map-Reduce: Has better performance than pig or hive but requires more development time.
PIg: Less development time but poor performance when compared to map-reduce.
Hve: SQL type language with some good features like partitioning and bucketing to improve performance reads.Also, hive enforces schema on read.

Pig is used to format your unstructured/semi structure data format.Lets say you have a timestamp in your data which is not as per Hive timestamp format.You can convert same using pigUDF and format your data.This is just a example to explain.You can do many more things using Pig.
Hive is basically used for structured data .This maynot work well with unstructured data.This takes more time to execute as it converts into Mapreduce job.I suggest you to use impala which is much faster than hive.

Pig is a data flow language. This means that you can not use if statements or loops.
If you need to do a lot of repetition, it would be preferable to learn mapreduce.
You are able to get around this by embedding pig into a python script but this would take even longer since it would have to load all the jar files with every iteration of the loop.
Basically it boils down to how much time you spend prototyping vs. how much production work you have.
If you are a data scientist or an analyst, most of your work is new projects that require a lot of prototyping. This means that you care about getting results fast. Then you would prefer Pig or Hive.
If you are in a development team, you want to build robust code based on agreed upon methodology that does not need to be tested and then you would prefer mapreduce.
There are companies like Cloudera that provide a package of Pig, Hive, and other Hadoop tools so you wouldn't have to choose between the two.

Map Reduce is a inner component of hadoop, other Pig and hive are hadoop eco systems it means run on the top of hadoop. The purpose of both mapreduce, pig and hive purpose is process the vast amount of data in different manner.
Mapreduce: apache implemented it. highly recommendable to process entire data, it's time consume and required program skills like java (highly recommendable), pyghon, ruby and other programming languages. total data aggregate and sort by using mapper and reducer functions. Hadoop use it by default.
Hive: Facebook implemented it. most of the analysts especially bigdata analysts use this tool to analyze the data especially structure data. Backend this hive tool use mapreduce to be processed. Internally Hive use special language called HQL, It's subset of SQL language. Who is wellever in SQL, they can goes with Hive. It's highly recommended to the Datawarehouse oriented projects. Much difficult to process un structured especially schema-less data.
Pig:
Pig is a scripting language, implemented by Yahoo. The main difference between pig and Hive is pig can process any type of data, either structured or unstructured data. It means it's highly recommendable for streaming data like satellite generated data, live events, schema-less data etc. Pig first load the data later programmer write a program depends on data to make it structured. Who is expert in programming languages they will choose this Hadoop ecosystems.

Related

Hadoop data visualization

I am a new hadoop developer and I have been able to install and run hadoop services in a single-node cluster. The problem comes during data visualization. What purpose does MapReduce jar file play when I need to use a data visualization tool like Tableau. I have a structured data source in which I need to add a layer of logic so that the data could make sense during visualization. Do I need to write MapReduce programs if I am going to visualize with other tools? Please shed some light on how I could go about on this issue.
This probably depends on what distribution of Hadoop you are using and which tools are present. It also depends on the actual data preparation task.
If you don't want to actually write map-reduce or spark code yourself you could try SQL-like queries using Hive (which translates to map-reduce) or the even faster Impala. Using SQL you can create tabular data (hive tables) which can easily be consumed. Tableau has connectors for both of them that automatically translate your tableau configurations/requests to Hive/Impala. I would recommend connecting with Impala because of its speed.
If you need to do work that requires more programming or where SQL just isn't enough you could try Pig. Pig is a high level scripting language that compiles to map-reduce code. You can try all of the above in their respective editor in Hue or from CLI.
If you feel like all of the above still don't fit your use case I would suggest writing map-reduce or spark code. Spark does not need to be written in Java only and has the advantage of being generally faster.
Most tools can integrate with hive tables meaning you don't need to rewrite code. If a tool does not provide this you can make CSV extracts from the hive tables or you can keep the tables stored as CSV/TSV. You can then import these files in your visualization tool.
The existing answer already touches on this but is a bit broad, so I decided to focus on the key part:
Typical steps for data visualisation
Do the complex calculations using any hadoop tool that you like
Offer the output in a (hive) table
Pull the data into the memory of the visualisation tool (e.g. Tableau), for instance using JDBC
If the data is too big to be pulled into memory, you could pull it into a normal SQL database instead and work on that directly from your visualisation tool. (If you work directly on hive, you will go crazy as the simplest queries take 30+ seconds.)
In case it is not possible/desirable to connect your visualisation tool for some reason, the workaround would be to dump output files, for instance as CSV, and then load these into the visualisation tool.
Check out some end to end solutions for data visualization.
For example like Metatron Discovery, it uses druid as their OLAP engine. So you just link your hadoop with Druid and then you can manage and visualize your hadoop data accordingly. This is an open source so that you also can see the code inside it.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Hadoop vs. NoSQL-Databases

As I am new to Big Data and the related technologies my question is, as the title implies:
When would you use Hadoop and when would you use some kind of NoSQL-Databases to store and analyse massive amounts of data?
I know that Hadoop is a Framework and that Hadoop and NoSQL differs.
But you can save lots of data with Hadoop on HDFS and also with NoSQL-DBs like MongoDB, Neo4j...
So maybe the use of Hadoop or of a NoSQL-Database depends if you just want to analyse data or if you just want to store data?
Or is it just that HDFS can save lets say RAW data and a NoSQL-DB is more structured (more structured than raw data and less structured than a RDBMS)?
Hadoop in an entire framework of which one of the components can be NOSQL.
Hadoop generally refers to cluster of systems working together to analyze data. You can take data from NOSQL and parallel process them using Hadoop.
HBase is a NOSQL that is part of Hadoop ecosystem. You can use other different NOSQL too.
Your question is missleading you are comparing Hadoop, which is a framework, to a database ...
Hadoop is containing a lot of features (including NoSQL database named HBase) in order to provide you a big data environment. If you're having a massive quantity of data you will probably use Hadoop (for the MapReduce functionalities or the datawarehouse capabilities) but it's not sure, depending on what you're processing and how you want to process it. If you're just storing a lot of data and don't need other feature (batch data processing or data transformations ...) a simple NoSQL database is enough.

Hadoop Ecosystem - What technological tool combination to use in my scenrio? (Details Inside)

This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
HBASE/HIVE/PIG: ?
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
HBase can fit your scenario.
HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR.
Pig and Hive also integrate with HBase.
As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.

What is the difference between Apache Pig and Apache Hive?

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for both. So when to use and which technology? Is there any specification for both which shows clearly the difference between both in terms of applicability and performance?
Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.
From a purely engineering point of view, I find PIG both easier to write and maintain than SQL-like languages. It is procedural, so you apply a bunch of relations to your data one-by-one, and if something fails you can easily debug at intermediate steps, and even have a command called “illustrate” which uses an algorithm to sample some data matching your relation. I’d say for jobs with complex logic, this is definitely much more convenient than Hive, but for simple stuff the gain is probably minimal.
Regarding interfacing, I find that PIG offers a lot of flexibility compared to Hive. You don’t have a notion of table in PIG so you manipulate files directly, and you can define loader to load it into pretty much any format very easily with loader UDFs, without having to go through the table loading stage before you can do your transformations. They have a nice feature in the recent versions of PIG where you can use dynamic invokers, i.e. use pretty much any Java method directly in your PIG script, without having to write a UDF.
For performance/optimization, from what I’ve seen you can directly control in PIG the type of join and grouping algorithm you want to use (I believe 3 or 4 different algorithms for each). I’ve personally never used it, but as you’re writing demanding algorithms it could probably be useful to be able to decide what to do instead of relying on the optimizer as it’s the case in Hive. So I wouldn’t say it necessarily performs better than Hive, but in cases where the optimizer makes the wrong decision, you have the option to choose what algorithm to use and have more control on what happens.
One of the cool things I did lately was splits: you can split your execution flow and apply different relations to each split. So you can have a non-linear dataset, split it based on a field, and apply a different processing to each part, and maybe join the results together in the end, all this in the same script. I don’t think you can do this in Hive, you’d have to write different queries for each case, but I may be wrong.
One thing to note also is that you can increment counters in PIG. Currently you can only do this in PIG UDFs though. I don’t think you can use counters in Hive.
And there are some nice projects that allow you to interface PIG with Hive as well (like HCatalog), so you can basically read data from a hive table, or write data to a hive table (or both) by simply changing your loader in the script. Supports dynamic partitions as well.
Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.
Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.
https://cwiki.apache.org/confluence/display/PIG/Index%3bjsessionid=F92DF7021837B3DD048BF9529A434FDA
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
https://cwiki.apache.org/Hive/
What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work.
Have a look at Pig Vs Hive Comparison in a nut shell from dezyre article
Hive scores over PIG in Partitions, Server, Web interface & JDBC/ODBC support.
Some differences:
Hive is best for structured Data & PIG is best for semi structured data
Hive used for reporting & PIG for programming
Hive used as a declarative SQL & PIG used as procedural language
Hive supports partitions & PIG does not
Hive can start an optional thrift based server & PIG can't
Hive defines tables before hand (schema) + stores schema information in database and PIG don't have dedicated metadata of database
Hive does not support Avro but PIG does
Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically
So when to use and which technology?
Above difference clarifies your query.
HIVE : Structured data, SQL like queries and used for reporting purpose
PIG: Semi-structured data, program a work-flow involving a sequence of activities for Map Reduce jobs.
Regarding performance of job, both HIVE and PIG are slow compared to traditional Map Reduce job. Reason : Finally Hive or PIG scripts have to be converted into a series of Map Reduce jobs.
Have a look at related SE question:
Pig vs Hive vs Native Map Reduce
The main difference is PIG is a data flow language and Hive is data warehouse.
As PIG can be used similar as a step by step procedural language.
But HIVE is used as a declarative language.
PIG can be used for getting online streaming unstructured data. But HIVE can only access structured data and it can also access data from RDBMS databases such as SQL, NOSQL by using JDBC and ODBC drivers.
PIG can convert data into Avro format but PIG can't.
PIG can't create partitions but HIVE can do it.
As HIVE is top of PIG that's why HIVE can only access the data once it is processed by PIG.
It depends when we have to use PIG and HIVE if you are working structured, relational data then we can use HIVE else we can use PIG.
By PIG we can communicate with ETL tools but it takes more time compared with hive. But it is easy in PIG rather HIVE because in HIVE we have to create table before processing the data.

Resources