What Hadoop Mapreduce can achieve? - hadoop

I am reading Hadoop mapreduce tutorials and come up with the following shallow understanding. Could anyone help confirm if my understanding is correct?
Mapreduce is a way to aggregate data
in a distributed environment
with non-structured data in very large files
using Java, Python, etc.
to produce similar results like what could be done in RDBMS using SQL aggregate functions
select count, sum, max, min, avg, k2
from input_file
group by k2
map() method basically pivots horizontal data v1 that is a line from
the input file into vertical rows, with each row having a string key
and a numeric value.
The grouping will happen in the shuffling and partition stage of the
data flow.
reduce() method will be responsible for computing/aggregating data.
Mapreduce jobs can be combined/nested just as SQL statement can be nested to produce complex aggregation output.
Is that correct?
With Hive on top of Hadoop, MR code will be generated by HiveQL Process Engine.
Therefore from coding perspective, MR coding using Java will gradually be replaced with high level HiveQL.
Is that true?

Have a look at this post for comparison between RDBMS & Hadoop
1.Unlike RDBMS, Hadoop can handle Peta bytes of data, which is distributed over thousands of nodes using commodity hardware. The efficiency of Map reduce algorithm lies with data locality during processing of data.
2.RDBMS can handle structured data only unlike Hadoop, which can handle structured, unstructured and semi-structured data.
Your understanding is correct regarding aggregation, grouping and partitioning.
You have provided example only for processing structured data.
HiveQL is getting converted into series of Map reduce jobs. On performance wise, HiveQL jobs will be slower compared to raw Map reduce jobs. HiveQL can't handle all types of data as explained above and hence it can't replace Map reduce jobs with java code.
HiveQL will co-exists with Map Reduce jobs in other languages. If you are looking for performance as key criteria of your map reduce job, you have to consider Java Map Reduce job as alternative. If you are looking for Map reduce jobs for semi-structured & un-structured data, you have to consider alternatives for Hive QL map reduce jobs.

Related

Hadoop Performance When retrieving Data Only

We know that performance of Hadoop may be increased by adding more data nodes. My question is: if we want to retrieve the data only without the need to process it or analyze it, is adding more data nodes will be useful? or it won't increase performance at all because we have retrieve operations only without any computations or map reduce jobs?
I will try to answer in parts:
If you only retrieve information from a hadoop cluster or HDFS then
it is similar to Cat command in linux, meaning only reading data
not processing.
If you want some calculations like SUM, AVG or any other aggregate
functions on top of your data then comes the concept of REDUCE ,
hence Map reduce comes into picture.
So hadoop is useful or worthy when your data is Huge and you do
calculations also. I think their is no performance benefits while
reading a small amount of data in HDFS than reading a Large amount
of data in HDFS (just think like you are storing your data in RDBMS
regularly and you only query select * statements on daily basis),
but when your data grows exponentially and you want to do
calculations your RDBMS query would take time to execute.
For Map reduce to work efficiently on huge data sets , you need to
have good amount of nodes and computing power, depending upon your
use case.

MapReduce real life uses

I have a doubt that in which cases , MapReduce is chosen over hive or pig.
I know that it is used when
We need indepth filtering of the input data.
working with unstructured data.
Working with graph. ....
But is there any place where we cant use hive, pig or we can work much better with MapReduce and it is used highly in real projects
Hive and Pig are generic solutions and they will have overhead while processing the data. Most of the scenarios it is negligible but in some cases it can be considerable.
If there are many tables that needs to be joined, using Hive and Pig tries to apply generic solution, if you use map reduce after understanding the data, you can come up with more optimal solution.
However map reduce should be treated as kernel. If your solution can be reused else where, it will be better to develop it using map reduce and integrate with Hive/Pig/Sqoop.
Pig can be used to process unstructured data. It will give more flexibility than Hive while processing the data.
Bare MapReduce is not written very often these days. Higher level abstractions such as the two you mentioned are more popular and adequate for query workloads.
Even in scenarios where HiveQL is too restrictive one might seek alternatives such as Cascading or Scalding for low-level batch jobs or the ever more popular Spark.
A primary motivation of using these high level abstractions is because most applications require a sequence of map and reduce phase which the MapReduce APIs leave you on your own to figure out how to serialize data between tasks.

Map Reduce with HIVE

I have 4 different Data Sets in the form of 4 CSV files and the common field among those is ID. I have to implement using Join . For implementing this concept which would be better Map Reduce or HIVE and is it possible combine both Map Reduce and HIVE
Many Thanks .
Hive translates Hive queries into a series of MapReduce jobs to emulate query's behaviour. While Hive is very useful, it is not always efficient to represent your business logic as a Hive query.
If you are fine with delay in performance & large data sets to join, you can go for HIVE.
If your data sets are small, you can still use Map Reduce Joins Or Distributed Cache.
Have a look at Map Reduce Joins article.
Most of the times Map Reduce will give better performance and control compared to Hive for any of the usecase. The code has to be written with better understanding of the usecase.
Yes, it is possible to combine both Map Reduce and Hive.

Hive query generation is taking long time to generate dataset

I am trying to run hive query on huge amount of data(almost in half of petabyte), and these query running map reduce internally. it takes very long time to generate the data set(map reduce to complete) what optimization mechanism for hive and Hadoop i can use to make these query faster, one more important question i have does the amount of disk available for map reduce or in /tmp directory is important for faster map reduce?
There is not too much you can do, but I can give a few direction what usually can be done with Hive:
You should select SQLs which cause less shuffling. For example you can try to cause map side joins when possible. You can also do some operations in a way that will lead to map-only queries.
Another way is to tune number of reducers - sometimes Hive defines much less reducers then needed - so you can set it manually to better utilize your cluster
If you have number of queries to run to do your transformation - you can define low replication factor for this temporary data in HDFS
More help can be provided if we have info what are you doing.

How does sorting(Order by) be implemented in Hive?

We know that hive doesn't do sampling before a sorting job start.It just leverage the sorting machenism of MapReduce and perform merge-sort in reduce side and only one reduce is used.Since reduce collects all data output by mapper in this scenario,say a machine running reduce has ony 100GB disk, what if the data is too big to fit in the disk?
The parallel sorting mechanism of Hive is still under development, see here.
A well designed data warehouse or database application will avoid such global sorting. If needed, try using Pig or Terasort(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html)

Resources