I am currently reading Hadoop in Action .The book is very good, however it uses hadoop 1.2.1 to explain and showcase all the examples. But, I am using hadoop 2.2.0.
Does anybody know where I can find a full documentation about hadoop api changes ? and a simple mapping between 1.2.1 and 2.2.0 ?
For examples
DataJoinMapperBase, DataJoinReducerBase, and TaggedMapOutput
Does not present in 2.2.0 and I am looking for there counterparts in 2.2.0 :)
Thanks
"Hadoop: The Definitive Guide, Third Edition" by Tom White (Buy Here)
supports hadoop v2.2.
The source code is give on github https://github.com/tomwhite/hadoop-book
as mentioned on github, the code of the book is tested with
This version of the code has been tested with:
* Hadoop 1.2.1/0.22.0/0.23.x/2.2.0
* Avro 1.5.4
* Pig 0.9.1
* Hive 0.8.0
* HBase 0.90.4/0.94.15
* ZooKeeper 3.4.2
* Sqoop 1.4.0-incubating
* MRUnit 0.8.0-incubating
Regarding your question
Hadoop 2.2 use mapreduce api v2 while Hadoop 1.x use old mapreduce api. Check this book, it clearly explain the mapreduce code difference between 1.x and 2.2.
hope it helps..!!!
Related
I was trying to find up any major difference between storm 1.1 and storm 2.0.
Is there any difference while setting up cluster for either of the versions?
(read on official website about new Java-based implementation but has anyone seen any difference between these two versions).
In addition to reading the changelog at https://www.apache.org/dist/storm/apache-storm-2.0.0/RELEASE_NOTES.html, you can look at https://issues.apache.org/jira/browse/STORM-2306?focusedCommentId=16291947&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16291947 for some performance numbers. You can also run your own benchmarks of course.
Does anyone has worked on this configuration: Apache Hive on Apache Spark?
What is the latest version compatibility for this configuration?
I want to implement this in my production systems. Kindly help with the compatibility matrix for Apache Hadoop, Apache Hive, Apache Spark and Apache Zeppelin.
You have to use hive2 (0.11+) and SPARK 2.2.0 and in hive-site.xml. And you have to set Spark as executor engine so you can easily run your queries on top of Spark.
In hive2 there are some options like Tez, llap etc. For more information kindly check the document Hive on Spark: Getting Started.
follow the tutorial
apache hive installation
and then just copy the hive-site.xml to $APACHE_HOME/conf
Hive is moving to rely only on the Tez execution engine. Please build all new workloads on MapReduce or Tez.
I am using hadoop with a database from ElasticSearch (no hdfs).
Do you know if elasticsearch-hadoop can work together?
Else do you know how using analytics for my project?
Yes, there is a connector for Elasticsearch and Hadoop that is built and released by Elasticsearch:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html
They just released the GA version 2.0 - here's the blog post about it:
http://www.elasticsearch.org/blog/es-hadoop-2-0-g/
I have found some page saying that hadoop 2.0 has a built-in benchmark testtool for shuffle.
But I'm unable to find it!
Could somebody guide me where to look for the same? I know in hadoop 0.20.* there is a test jar. But I can't find it in hadoop 2.0.
I am referring to the Combine step mentioned on the Hadoop wiki. I have been unable to find a reference to it in the AWS documentation, and I'd like to utilize this step.
The documentation for Combiner will be in the Apache documentation and not in the AWS documentation. Amazon Elastic MapReduce supports 0.18.3 and 0.20.2 versions of Hadoop with custom patches. Apache MR Tutorial has reference to how the combiner function should be used. Call the Job.setCombinerClass() to set the combiner class.