Apache Spark: map-side aggregation vs in-map combiner - hadoop

I'm currently migrating from the Hadoop MR paradigm to apache Spark, and there is a few doubts that come to my mind regarding advanced efficiency implementation patterns outside the usual "map and reduce" basic workflow.
In this well known book (Lin and Dier 2010) the "in-mapper combiner" pattern is introduced, which can significantly improve efficiency in many applications.
i.e, the canonical word count example in hadoop, where we normally emit (word, 1) tuples to be further combined, can be greatly improved if local aggregation of (word, n) tuples is performed and then emitted. Although combiners can fulfil this behaviour my experience is that using local variables for each mapper along with hadoop's functions like "setUp" and "cleanUp" can lead to higher computational savings (here is a nice tutorial).
Inside the Spark world I could not find anything similar, just the so called map-side aggregation, which is equivalent to the Hadoop's local combiner. Given the previous example, I wonder if it can be translated into Spark by using map functions.

Related

Performance: Pig vs Hive

I have discovered some (significant) performance differences (in terms of real time runtime as well as CPU time) between Pig and Hive and am looking for ways to come to the bottom of these differences. I have used both language's explain feature (i.e. Hive: EXPLAIN keyword, Pig: pig -e 'explain -script explain.pig') to contrast and compare the generated syntax tree, logical, physical and map-reduce plans. However both seem to do the same things. The job tracker however shows a difference in the number of map and reduce tasks launched (I consequently ensured that both use the same number of map and reduce tasks and the performance difference remains). My question therefore is: in what other ways can I analyze what is going on (possibly at a lower level / bytecode level)?
EDIT: I am running the TPC-H benchmarks by the TPC (available https://issues.apache.org/jira/browse/PIG-2397 and https://issues.apache.org/jira/browse/HIVE-600 ). However even simpler scripts show a quite large performance difference. For example:
SELECT (dataset.age * dataset.gpa + 3) AS F1,
(dataset.age/dataset.gpa - 1.5) AS F2
FROM dataset
WHERE dataset.gpa > 0;
I still need to fully evaluate the TPC-H benchmarks (will update later), however the results for the simpler scripts are detailed in this document: https://www.dropbox.com/s/16u3kx852nu6waw/output.pdf
(jpg: http://i.imgur.com/1j1rCWS.jpg )
I have read some source codes of Pig and Hive before. I can share some opinions.
As I was focusing on the Join implementation, here I can provide some details of the Join implementation of Pig and Hive. Hive's Join implementation is less efficient than Pig. I have no idea why Hive needs to create so many objects (Such operations are very slow and should have been avoided) in the Join implementation. I think that's why Hive does Join more slowly than Pig. If you are interested in it, you can check the CommonJoinOperator code by yourself. So I guess that Pig usually more efficient as its high quality codes.

Map-Reduce - only applicable to key-value NoSql data models?

(I only have conceptual knowledge of NoSQL, no working experience)
I am aware of the following types of NoSQL databases:
key-value, column family, document databases (Aggregates)
graph databases
Is the Map-Reduce paradigm applicable to all? My guess would be no since Map-Reduce is often discussed in terms of keys and values, but since the distinction between different NoSQL stores isn't so clean-cut, I am wondering where Map-Reduce is and isn't applicable. And since I'm in the process of evaluating which DB to use for a few app ideas I have, I should think whether it's possible to achieve large scale processing regardless of which store I use.
Support for map reduce probably shouldn't be the thing on which to base your choice of a datastore.
Firstly, map reduce isn't the only way to do large-scale data processing. For example, MongoDB implemented map reduce support early (in v1), but later added their Aggregation Framework which was much more general, subsuming many tasks that would make use of map reduce.
Map reduce is just one paradigm for processing large data sets. Use it only if your application needs to process a large number of data records with a mapper and then needs to combine results together with a reducer. That's all it really does. As to when the paradigm is applicable and when it is not, simply look at your use case. Do you need to manipulate all of your records consistently and then combine the results? Or is there another way to phrase your problem?
Take a look at the Mongo aggregation framework for examples of where aggregation is used as a simpler alternative to many problems for which forcing them into a map-reduce problem would be overkill.
It should also help give you insight into your question of whether you can do large-scale data processing without map-reduce, to which the answer is yes. Clearly map-reduce is good for making search indexes, but many problems on large data sets benefit from other paradigms.
A web search on "alternatives to map reduce" will also be helpful.

practical usage of hadoop map reduce hive pig hbase

Hello,
I am learning Hadoop and after reading the material found on the net (tutorials, map reduce concepts, Hive, Ping and so on) and developed some small application with those I would like to learn the real world usages of these technologies.
What are the everyday software we use that are based upon Hadoop stack?
If you use the internet, there are good changes that you are indirectly impacted from Hadoop/MapReduce from Google Search to FaceBook to LinkedIn etc. Here are some interesting links to find how widespread Hadoop/MR usage is
Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
10 ways big data changes everything
One thing to note is Hadoop/MR is not an efficient solution for every problem. Consider other distributed programming models like those based on BSP also.
Happy Hadooping !!!
Here are some sample mapreduce examples which will be helpful for beginners..
1.Word Count
2.SQL Aggregation using Map reduce
3.SQL Aggregation on multiple fields using Map reduce
URL - http://hadoopdeveloperguide.blogspot.in/

Efficient MapReduce when dealing with streams to queries to the same dataset

I have a massive, static dataset and I've a function to apply to it.
f is in the form reduce(map(f, dataset)), so I would use the MapReduce skeleton. However, I don't want to scatter the data at each request (and ideally I want to take advantage of indexing in order to speedup f). There is a MapReduce implementation that address this general case?
I've taken a look at IterativeMapReduce and maybe it does the job, but seems to address a slightly different case, and the code isn't available yet.
Hadoop's MapReduce (and all the others map-reduce skeleton inspired by Google) doesn't scatter the data all the time.

What default reducers are available in Elastic MapReduce?

I hope I'm asking this in the right way. I'm learning my way around Elastic MapReduce and I've seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows.
In Amazon's "Introduction to Amazon Elastic MapReduce" PDF it states "Amazon Elastic MapReduce has a default reducer called aggregrate"
What I would like to know is: are there other default reducers availiable?
I understand that I can write my own reducer, but I don't want to end up writing something that already exists and "reinvent the wheel" because I'm sure my wheel won't be as good as the original.
The reducer they refer to is documented here:
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
That's a reducer that is built into the streaming utility. It provides a simple way of doing common calculation by writing a mapper that output keys that are formatted in a special way.
For example, if your mapper outputs:
LongValueSum:id1\t12
LongValueSum:id1\t13
LongValueSum:id2\t1
UniqValueCount:id3\tval1
UniqValueCount:id3\tval2
The reducer will calculate the sum of each LongValueSum, and count the distinct values for UniqValueCount. The reducer output will therefore be:
id1\t25
id2\t12
id3\t2
The reducers and combiners in this package are very fast compared to running streaming combiners and reducers, so using the aggregate package is both convenient and fast.
I'm in a similar situation. I infer from Google results etc that the answer right now is "No, there are no other default reducers in Hadoop", which kind of sucks, because it would be obviously useful to have default reducers like, say, "average" or "median" so you don't have to write your own.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html shows a number of useful aggregator uses but I cannot find documentation for how to access other functionality than the very basic key/value sum described in the documentation and in Erik Forsberg's answer. Perhaps this functionality is only exposed in the Java API, which I don't want to use.
Incidentally, I'm afraid Erik Forsberg's answer is not a good answer to this particular question. Another question for which it could be a useful answer can be constructed, but it is not what the OP is asking.

Resources