Does Sqoop 2 work with Hadoop 0.20.2?
What version of sqoop is best to download?
1.4.2 or 1.99.1 ?
Thanks!)
Sqoop have currently two main branches. Sqoop 1 is older fully functional and mature project supporting Hadoop 0.20, 1.x, 0.23 and 2.0.x You can download the bits from here. Please make sure that you download file ending with "_hadoop-0.20", otherwise you will be getting weird exceptions.
Second branch is Sqoop2 which is redesign of the project. There is available first cut with version 1.99.3. This branch is supporting only Hadoop 1.x and 2.x and can be downloaded from here. Again you need to make sure to download version that matches your hadoop distribution. There is a probability that the build for Hadoop 1.x will be working on 0.20.2 as well as those versions are not that different, however nobody has verified that.
Related
I have questions concerning cdh and how it is maintained:
when I go to the packaging info related to a specific cdh version, I can check the package version of each component (for instance for cdh 5.5.5 : https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_55.html#cdh_555 ). However I don't understand what does the "package version" refers to exactly. For instance, for the component Apache Parquet, the "package version" is parquet-1.5.0+cdh5.5.5+181 . How can I find out exactly what source code is packaged ? Does this correspond to a label on a specific repo? If I go to the "official" apache parquet repo, there is no "cdh5.5.5" branch, the closest thing I have is a tag called "1.5.0" ( https://github.com/apache/parquet-mr/tree/parquet-1.5.0 ) . How do the people from cdh know what parquet-1.5.0+cdh5.5.5+181 exactly refers to ?
Still concerning Apache Parquet, how come even the most recent cdh versions are still using the Apache Parquet on tag is 22 May 2014, ie more than 3 years ago. Why don't they upgrade to a newer version, like 1.6.0 ? The reason I'm asking is that there is a bug in 1.5.0 that was fixed more than 3 years ago in parquet 1.6.0, yet the latest cdh version is still using the 1.5.0 version. Is there a reason why they keep using a really old, bugged, version?
thanks !
You are correct in assuming parquet-1.5.0+cdh5.5.5+181 is closest to parquet 1.5.0. However the code will not be identical to parquet 1.5.0
upstream because:
CDH enforces cross component compatibility. Code and applications using parquet-1.5.0 must also work with all the other Hadoop services (HDFS, Hive, Oozie, YARN, Spark, Solr, HBase). Incompatibilities would have to be fixed so parquet's code would include those bug fixes.
CDH enforces major version compatibility. This means an application written on CDH5.1 should still work on CDH5.5 and CDH5.7, all CDH5.x versions. This also would alter the codebase.
The best way to interpret this is to say that parquet-1.5.0+cdh5.5.5+181 will support all features provided in parquet 1.5.0 and will also work with the corresponding Hadoop services packaged with CDH5.5.
Version compatibility is also the reason why CDH Hadoop service versions run older versions of the related upstream projects. It's much harder to maintain backwards compatibility especially if APIs change between versions.
On Apache's distribution page, Hadoop seems to exist in 0.x, 1.x, and 2.x. However, when discussing MapReduce/Yarn, and deciding on a version of Hive and Hbase, there only seems to be discussion of Hadoop 1 and 2. Why is this? Is 0.x just a beta release?
The 1.X and 2.X versions derive from the 0.X line, which is still being continued (as far as I know). The version numbering is quite confusing. A helpful chart can be found at https://blogs.apache.org/bigtop/entry/all_you_wanted_to_know . Even if it's quite outdated, you can see the relevant branches and what derives from what.
Also check Hadoop release version confusing for more explanation.
Hello I am new to Hadoop and pretty confused with the version names and which one should I use among 1.x ( great support and learning resources ), 2.2 or 0.23.
I have read that hadoop is moving to YARN completely from v0.23 ( link1 ). But at the same time its all over the web that hadoop v2.0 is moving to YARN ( link2 ) and I can see the YARN configuration files in Hadoop 2.2 itself.
But since 0.23 seems to be the latest version to me, Does 2.2 also
support YARN ? ( Refer link 1, it says hadoop will support YARN from
v0.23 )
And as a beginner which version should I go for 1.x or 2.x for
learning perspective of hadoop.
Are other technologies that works with hadoop like pig, hive etc.
available with the latest version of hadoop?
Thanks.
UPDATE
Thankyou all for replying.
I ended up using hadoop2.2 and since all famous tutorials and resources are outdated, though I found one good book to get started with v2.2.
"Hadoop: The Definitive Guide, Third Edition" by Tom White (Buy Here)
supports hadoop v2.2.
The source code is give on github https://github.com/tomwhite/hadoop-book
as mentioned on github, the code of the book is tested with
This version of the code has been tested with:
* Hadoop 1.2.1/0.22.0/0.23.x/2.2.0
* Avro 1.5.4
* Pig 0.9.1
* Hive 0.8.0
* HBase 0.90.4/0.94.15
* ZooKeeper 3.4.2
* Sqoop 1.4.0-incubating
* MRUnit 0.8.0-incubating
hope it helps..!!!
There are a few active release series. The 1.x release series is a continuation of the 0.20
release series. A few weeks after 0.23 released, the 0.20 branch formerly known as 0.20.205 was renumbered 1.0. There is next to no functional difference between 0.20.205 and 1.0. This is just a renumbering.
The 0.23 includes several major new features includes a new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN (Yet Another Resource Negotiator), which is a general resource management system for running distributed applications. Similarly, 2.x release is a continuation of the 0.23 release series. So the 2.2 also support YARN.
According to Hadoop 2.2 release note
1.2.X - current stable version, 1.2 release
2.2.X - current stable 2.x version
0.23.X - similar to 2.X.X but missing NN HA.
I would suggest starting with Cloudera distribution since you just start learning. The CDH 4.5 includes the YARN feature you are looking for. You can also try HortonWorks distribution. The advantage of going with these vendors is that you do not need to worry about which version of components such as Hive, Pig to work with your Hadoop installation.
I recommended you to start with hadoop-2.2.0 which gives good knowledge. Industry prefers YARN itself and in production 2.x only exists
What is the difference between hadoop version 0.x, 1.x and 2.x Also can someone tell me how cdh 3 and 4 differ.
Cloudera provides an extensive list of new features and changes in each release:
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Release-Notes/CDH4-Release-Notes.html
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Release-Notes/cdh4rn_topic_2.html
You can also look at the list of individual changes from the Vanilla Apache Hadoop releases:
http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.7.0.CHANGES.txt
There are so many Hadoop versions and different distributions which make me confused. I have a few questions.
Apache Hadoop 1.x is from 0.20.205?
Apache Hadoop 2.0 is from 0.22 or 0.23?
According to this blogpost from Cloudera:
There is next to no functional difference between 0.20.205 and 1.0.
This is just a renumbering.
Hadoop's Yarn site states:
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now
have, what we call, MapReduce 2.0 (MRv2) or YARN
It's also worth to have a look at this diagram too. It shows the tree of different Hadoop versions as well as the 3rd party distributions on top of them.
updated answer
http://elephantscale.com/hadoop2_handbook/Hadoop_Versions.html
(disclaimer : I am a co-author of this online book)
hadoop release 1.0.0 is avalable from 0.20.x
As a rule of thumb,remember
1.xx is = 0.20.0
2.xx is > 0.20.0
We can easily remember and choose the correct apache distribution for hadoop cluster setup.