We are planning to learn and use Hadoop for small prototyping. Not in production for now.
Which distribution of Hadoop is good to start with?
We have already installed Hadoop 0.20.2 and practicing. But Apache distribution is not compatible with Sqoop and some other things. To learn security we have to go for Hadoop 1.0.0 which is currently in beta version.
Could you, please, tell us which distribution and version of Hadoop is good to learn and prototyping.
Download the code from trunk (or 0.23 branch) and do a build, it should have all the features. Or else download the 0.23 release, it should have all the features till date except MRv1.
Related
I've visited the website to download the latest version and I found that 2.8.4 was released after 2.9.1. Why does that happen? And which one should I download?
Why are companies still running Java 6 and 7 while they are end of life? Why is Java 8 still updated when Java 9 and 10 are available?
My point is that at one point, Hadoop 2.7.x was the stable branch. 2.8, 2.9 introduce some potentially breaking or otherwise major, possibly unstable change. The previous releases still need support to address bugs and backport useful features. You're welcome to read the release notes to see what those may be.
It's worth mentioning that the Hadoop vendors like Hortonworks and Cloudera are currently using some version 2.7 with some patches applied on top of what you'd get on the Apache site.
Meanwhile, if you want the latest and greatest, and don't care about stability, you can use Hadoop 3.x, but if you want other things like Spark, Sqoop, HBase, Hive, then I'd suggest staying at 2.7 for now. Or at least read over the documentation for each component and see if you can find installation requirements.
I have a cluster with Hadoop 2.0.0-cdh4.4.0, and I need to run Spark on it with YARN as resource manager. I got following information from http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version
You can enable the yarn profile and optionally set the yarn.version property if it is different from hadoop.version. Spark only supports YARN versions 2.2.0 and later.
I don't want to upgrade the whole Hadoop package to support YARN version 2.2.0, as my HDFS have massive data and upgrade it will cause too long break of service and be too risky to me.
I think the best choice to me may be use a higher version of YARN than 2.2.0 while keep the version of other parts of my Hadoop unchanged. If it's the way, what steps should I follow to get such a YARN package and to deploy it on my cluster?
Or are there other approach to run Spark on Hadoop 2.0.0-cdh4.4.0 with YARN as resource manager?
While you could theoretically upgrade just your YARN component my experience suggests that you run a large risk of library and other component incompatibilities if you do that. Hadoop consists of a lot of components but they're generally not as decoupled as they should be, which is one of the main reasons CDH, HDP and other Hadoop distributions bundle only certain versions known to work together and if you have commercial support with them but change the version of something they generally won't support you because things tend to break when you do this.
In addition, CDH4 reached End of Maintenance last year and is no longer being developed by Cloudera so if you find anything wrong you're going to find it hard to get fixes (generally you'll be told to upgrade to a newer version). I can also speak from experience that if you want to use newer versions of Spark (e.g. 1.5 or 1.6) then you also need a newer version of Hadoop (be it CDH, HDP or another one) as Spark has evolved so quickly and YARN support was bolted on later so there are loads of bugs and issues in earlier versions of both Hadoop and Spark.
Sorry, I know it's not the answer you're looking for but upgrading Hadoop to a newer version is probably the only way forward if you actually want stuff to work and don't want to spend a lot of time debugging version incompatibilities.
I recently installed hadoop v_2 with the YARN Configuration. I am planning to install Hadoop ecosystem stack such as Pig,Hive,Hbase,Oozie,Zookeeper etc. I would like to know if I should install the tools from the same link that I did for Hadoop 1.0 Configuration. If not, Could anyone please send me the link for the Hadoop 2 Configuration for these tools ?. I heard that Pig and Hive are more faster in Hadoop 2.0. Therefore would like to know if there are better versions.
Thanks,
Gautham
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-1/CDH4-Installation-Guide/cdh4ig_topic_16_2.html
this may be useful
also i think that the configuration isn't different from v 1
I was looking to migrate my EMR implementation from an older version to the latest versions because I am primarily facing a lot of issues.
My current implementation uses Hadoop 0.20.2.
I wanted to understand how much effort in terms of code change would be required for migrating from 0.20.2 to -
0.20.205
1.0.1
Are the APIs very different and require a lot of recoding? Any basic idea would be highly helpful.
0.20.205 was just renamed to 1.0 so it is esentially the same release. The APIs have hardly any difference. 1.0 is similar to 0.20.2 with append & security features which basically means it supports HBase integration and can be used in enterprises.
We ported our jobs running on EMR on 0.20.2 to directly run on 1.0. All our jobs, whether they were using the new or old API did not have a single issue but ran correctly without us having to change anything. So I believe you should not face any issues.
I am new to HBASE and HADOOP and would require available compatible versions of hbase and hadoop to run my experiments.
The current stable version of at "http://apache.techartifact.com/mirror/hbase/" is hbase-0.94.1 . Can anybody kindly tell which version of hadoop should I use so that there is no
compatibility issue and no future data loss.
Please suggest from the hadoop and hbase releases that are currently available online.
below are the sites I am using for downloading these releases
http://apache.techartifact.com/mirror/hadoop/common/ (hadoop)
http://apache.techartifact.com/mirror/hbase/ (hbase)
If you want to be sure about the compatibility of the Hadoop and HBase distribution you are using, you might consider using the Apache Bigtop project or the Cloudera CDH package.
BigTop :
The primary goal of Bigtop is to build a community around the
packaging and interoperability testing of Hadoop-related projects.
This includes testing at various levels (packaging, platform, runtime,
upgrade, etc...) developed by a community with a focus on the system
as a whole, rather than individual projects.
Cloudera :
CDH consists of 100% open source Apache Hadoop plus nine other open
source projects from the Hadoop ecosystem. CDH is thoroughly tested
and certified to integrate with the widest range of operating systems
and hardware, databases and data warehouses, and business intelligence
and ETL systems.
Download both from stable folder. I do not know if some version is not compatible with other.
i use hadoop-0.20.203.0 with hbase-0.94.1 with out any problems.