We know that hive doesn't do sampling before a sorting job start.It just leverage the sorting machenism of MapReduce and perform merge-sort in reduce side and only one reduce is used.Since reduce collects all data output by mapper in this scenario,say a machine running reduce has ony 100GB disk, what if the data is too big to fit in the disk?
The parallel sorting mechanism of Hive is still under development, see here.
A well designed data warehouse or database application will avoid such global sorting. If needed, try using Pig or Terasort(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html)
Related
We know that performance of Hadoop may be increased by adding more data nodes. My question is: if we want to retrieve the data only without the need to process it or analyze it, is adding more data nodes will be useful? or it won't increase performance at all because we have retrieve operations only without any computations or map reduce jobs?
I will try to answer in parts:
If you only retrieve information from a hadoop cluster or HDFS then
it is similar to Cat command in linux, meaning only reading data
not processing.
If you want some calculations like SUM, AVG or any other aggregate
functions on top of your data then comes the concept of REDUCE ,
hence Map reduce comes into picture.
So hadoop is useful or worthy when your data is Huge and you do
calculations also. I think their is no performance benefits while
reading a small amount of data in HDFS than reading a Large amount
of data in HDFS (just think like you are storing your data in RDBMS
regularly and you only query select * statements on daily basis),
but when your data grows exponentially and you want to do
calculations your RDBMS query would take time to execute.
For Map reduce to work efficiently on huge data sets , you need to
have good amount of nodes and computing power, depending upon your
use case.
I am reading Hadoop mapreduce tutorials and come up with the following shallow understanding. Could anyone help confirm if my understanding is correct?
Mapreduce is a way to aggregate data
in a distributed environment
with non-structured data in very large files
using Java, Python, etc.
to produce similar results like what could be done in RDBMS using SQL aggregate functions
select count, sum, max, min, avg, k2
from input_file
group by k2
map() method basically pivots horizontal data v1 that is a line from
the input file into vertical rows, with each row having a string key
and a numeric value.
The grouping will happen in the shuffling and partition stage of the
data flow.
reduce() method will be responsible for computing/aggregating data.
Mapreduce jobs can be combined/nested just as SQL statement can be nested to produce complex aggregation output.
Is that correct?
With Hive on top of Hadoop, MR code will be generated by HiveQL Process Engine.
Therefore from coding perspective, MR coding using Java will gradually be replaced with high level HiveQL.
Is that true?
Have a look at this post for comparison between RDBMS & Hadoop
1.Unlike RDBMS, Hadoop can handle Peta bytes of data, which is distributed over thousands of nodes using commodity hardware. The efficiency of Map reduce algorithm lies with data locality during processing of data.
2.RDBMS can handle structured data only unlike Hadoop, which can handle structured, unstructured and semi-structured data.
Your understanding is correct regarding aggregation, grouping and partitioning.
You have provided example only for processing structured data.
HiveQL is getting converted into series of Map reduce jobs. On performance wise, HiveQL jobs will be slower compared to raw Map reduce jobs. HiveQL can't handle all types of data as explained above and hence it can't replace Map reduce jobs with java code.
HiveQL will co-exists with Map Reduce jobs in other languages. If you are looking for performance as key criteria of your map reduce job, you have to consider Java Map Reduce job as alternative. If you are looking for Map reduce jobs for semi-structured & un-structured data, you have to consider alternatives for Hive QL map reduce jobs.
I have 4 different Data Sets in the form of 4 CSV files and the common field among those is ID. I have to implement using Join . For implementing this concept which would be better Map Reduce or HIVE and is it possible combine both Map Reduce and HIVE
Many Thanks .
Hive translates Hive queries into a series of MapReduce jobs to emulate query's behaviour. While Hive is very useful, it is not always efficient to represent your business logic as a Hive query.
If you are fine with delay in performance & large data sets to join, you can go for HIVE.
If your data sets are small, you can still use Map Reduce Joins Or Distributed Cache.
Have a look at Map Reduce Joins article.
Most of the times Map Reduce will give better performance and control compared to Hive for any of the usecase. The code has to be written with better understanding of the usecase.
Yes, it is possible to combine both Map Reduce and Hive.
Hive provides an abstraction layer over java Map Reduce job , so it should have performance issue when compared to Java Map Reduce Jobs.
Do we have any benchmark to compare the performance of Hive Query & Java Map Reduce Jobs ?
Actual use-cases scenario with run time data , would be real help .
Thanks
Your premise that " so it should have performance issue when compared to Java Map Reduce Jobs." is wrong......
Hive (and Pig and crunch and other map/reduce abstractions) would be slower than a fully tuned hand written map/reduce.
However, unless you're experienced with the Hadoop and map/reduce, the chances are, that the map/reduce you'd write would be slower on non-trivial queries vs. what Hive et. al. will do
I did some small test in a VM some time back and I couldn't really notice any difference. Maybe Hive was a few seconds slower sometimes but I can't really tell if that was Hives performance or my VM that was hanging due to low memory. I think that one thing to keep in mind is, Hive will always determine the fastest way to do a MapReduce job. Now, when you write small MapReduce jobs, you'll probably be able to find the fastest way yourself. But with large complex jobs (with joins, etc.) will you always be able to compete with Hive?
Also, the time you need to write a MapReduce job of multiple classes and methods seems to take ages in comparison with writing a HiveQL query.
On the other hand, I had the feeling that when I wrote the job myself it was easier to know what was going on.
If you've small dataset on your machine and want to process using Apache Hive, execution of Job on small dataset would be slow as compared to process the same dataset using Hadoop MapReduce. Performance of hive slightly degrades, if you consider small datasets. Whereas, for large datasets, Apache Hive performace would be better as compared to MapReduce.
While processing datasets in MapReduce, data-set is stored in HDFS. MapReduce has no database of its own, as Hive has meta-store. From Hive's Metastore, data can be shared with Impala, Beeline, JDBC and ODBC drivers.
I am trying to run hive query on huge amount of data(almost in half of petabyte), and these query running map reduce internally. it takes very long time to generate the data set(map reduce to complete) what optimization mechanism for hive and Hadoop i can use to make these query faster, one more important question i have does the amount of disk available for map reduce or in /tmp directory is important for faster map reduce?
There is not too much you can do, but I can give a few direction what usually can be done with Hive:
You should select SQLs which cause less shuffling. For example you can try to cause map side joins when possible. You can also do some operations in a way that will lead to map-only queries.
Another way is to tune number of reducers - sometimes Hive defines much less reducers then needed - so you can set it manually to better utilize your cluster
If you have number of queries to run to do your transformation - you can define low replication factor for this temporary data in HDFS
More help can be provided if we have info what are you doing.