How are cdh package defined? - hadoop

I have questions concerning cdh and how it is maintained:
when I go to the packaging info related to a specific cdh version, I can check the package version of each component (for instance for cdh 5.5.5 : https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_55.html#cdh_555 ). However I don't understand what does the "package version" refers to exactly. For instance, for the component Apache Parquet, the "package version" is parquet-1.5.0+cdh5.5.5+181 . How can I find out exactly what source code is packaged ? Does this correspond to a label on a specific repo? If I go to the "official" apache parquet repo, there is no "cdh5.5.5" branch, the closest thing I have is a tag called "1.5.0" ( https://github.com/apache/parquet-mr/tree/parquet-1.5.0 ) . How do the people from cdh know what parquet-1.5.0+cdh5.5.5+181 exactly refers to ?
Still concerning Apache Parquet, how come even the most recent cdh versions are still using the Apache Parquet on tag is 22 May 2014, ie more than 3 years ago. Why don't they upgrade to a newer version, like 1.6.0 ? The reason I'm asking is that there is a bug in 1.5.0 that was fixed more than 3 years ago in parquet 1.6.0, yet the latest cdh version is still using the 1.5.0 version. Is there a reason why they keep using a really old, bugged, version?
thanks !

You are correct in assuming parquet-1.5.0+cdh5.5.5+181 is closest to parquet 1.5.0. However the code will not be identical to parquet 1.5.0
upstream because:
CDH enforces cross component compatibility. Code and applications using parquet-1.5.0 must also work with all the other Hadoop services (HDFS, Hive, Oozie, YARN, Spark, Solr, HBase). Incompatibilities would have to be fixed so parquet's code would include those bug fixes.
CDH enforces major version compatibility. This means an application written on CDH5.1 should still work on CDH5.5 and CDH5.7, all CDH5.x versions. This also would alter the codebase.
The best way to interpret this is to say that parquet-1.5.0+cdh5.5.5+181 will support all features provided in parquet 1.5.0 and will also work with the corresponding Hadoop services packaged with CDH5.5.
Version compatibility is also the reason why CDH Hadoop service versions run older versions of the related upstream projects. It's much harder to maintain backwards compatibility especially if APIs change between versions.

Related

Hadoop Versions seem to fall under 0.x, 1.x, and 2.x, but when discussing YARN/MapReduce, every page Refers to Hadoop 1 and Hadoop 2.0

On Apache's distribution page, Hadoop seems to exist in 0.x, 1.x, and 2.x. However, when discussing MapReduce/Yarn, and deciding on a version of Hive and Hbase, there only seems to be discussion of Hadoop 1 and 2. Why is this? Is 0.x just a beta release?
The 1.X and 2.X versions derive from the 0.X line, which is still being continued (as far as I know). The version numbering is quite confusing. A helpful chart can be found at https://blogs.apache.org/bigtop/entry/all_you_wanted_to_know . Even if it's quite outdated, you can see the relevant branches and what derives from what.
Also check Hadoop release version confusing for more explanation.

Which hadoop version should I choose among 1.x, 2.2 and 0.23

Hello I am new to Hadoop and pretty confused with the version names and which one should I use among 1.x ( great support and learning resources ), 2.2 or 0.23.
I have read that hadoop is moving to YARN completely from v0.23 ( link1 ). But at the same time its all over the web that hadoop v2.0 is moving to YARN ( link2 ) and I can see the YARN configuration files in Hadoop 2.2 itself.
But since 0.23 seems to be the latest version to me, Does 2.2 also
support YARN ? ( Refer link 1, it says hadoop will support YARN from
v0.23 )
And as a beginner which version should I go for 1.x or 2.x for
learning perspective of hadoop.
Are other technologies that works with hadoop like pig, hive etc.
available with the latest version of hadoop?
Thanks.
UPDATE
Thankyou all for replying.
I ended up using hadoop2.2 and since all famous tutorials and resources are outdated, though I found one good book to get started with v2.2.
"Hadoop: The Definitive Guide, Third Edition" by Tom White (Buy Here)
supports hadoop v2.2.
The source code is give on github https://github.com/tomwhite/hadoop-book
as mentioned on github, the code of the book is tested with
This version of the code has been tested with:
* Hadoop 1.2.1/0.22.0/0.23.x/2.2.0
* Avro 1.5.4
* Pig 0.9.1
* Hive 0.8.0
* HBase 0.90.4/0.94.15
* ZooKeeper 3.4.2
* Sqoop 1.4.0-incubating
* MRUnit 0.8.0-incubating
hope it helps..!!!
There are a few active release series. The 1.x release series is a continuation of the 0.20
release series. A few weeks after 0.23 released, the 0.20 branch formerly known as 0.20.205 was renumbered 1.0. There is next to no functional difference between 0.20.205 and 1.0. This is just a renumbering.
The 0.23 includes several major new features includes a new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN (Yet Another Resource Negotiator), which is a general resource management system for running distributed applications. Similarly, 2.x release is a continuation of the 0.23 release series. So the 2.2 also support YARN.
According to Hadoop 2.2 release note
1.2.X - current stable version, 1.2 release
2.2.X - current stable 2.x version
0.23.X - similar to 2.X.X but missing NN HA.
I would suggest starting with Cloudera distribution since you just start learning. The CDH 4.5 includes the YARN feature you are looking for. You can also try HortonWorks distribution. The advantage of going with these vendors is that you do not need to worry about which version of components such as Hive, Pig to work with your Hadoop installation.
I recommended you to start with hadoop-2.2.0 which gives good knowledge. Industry prefers YARN itself and in production 2.x only exists

Where to put JDBC drivers when using CDH4+Cloudera Manager?

I am trying to get JDBC jars recognized by Sqoop2 (CDH 4.4.0), but no matter where I place them, they do not seem to be recognized.
I have followed advice:
here,
here,
and asked a similar question here.
Can someone please provide a definitive answer to this?
I would strongly recommend to follow the official installation guide for your Hadoop distribution and it's associated version. It seems that you are using CDH 4.4.0, but looking into CDH 4.2.1 installation instructions. Whereas in CDH 4.2.1 the JDBC driver jar files were expected in /usr/lib/sqoop2, since CDH 4.3.0 they are expected in /var/lib/sqoop2 instead (documentation).

Which version of Sqoop work with Hadoop 0.20.2?

Does Sqoop 2 work with Hadoop 0.20.2?
What version of sqoop is best to download?
1.4.2 or 1.99.1 ?
Thanks!)
Sqoop have currently two main branches. Sqoop 1 is older fully functional and mature project supporting Hadoop 0.20, 1.x, 0.23 and 2.0.x You can download the bits from here. Please make sure that you download file ending with "_hadoop-0.20", otherwise you will be getting weird exceptions.
Second branch is Sqoop2 which is redesign of the project. There is available first cut with version 1.99.3. This branch is supporting only Hadoop 1.x and 2.x and can be downloaded from here. Again you need to make sure to download version that matches your hadoop distribution. There is a probability that the build for Hadoop 1.x will be working on 0.20.2 as well as those versions are not that different, however nobody has verified that.

Which hadoop version to use?

Both hadoop in action & the definitive guide, both have built their foundation from the mapred classes. And most of those classes have been deprecated in 0.20.2. The signatures of the new classes are different. Can anyone tell me about the various changes done. E.g. the partitioner class has been deprecated. How is the new reducer going to provide its feature. Concept changes that happened in 0.20.2
What should i use? On the hadoop wiki, i see
Download
1.0.X - current stable version, 1.0 release
1.1.X - current beta version, 1.1 release
2.X.X - current alpha version
0.23.X - simmilar to 2.X.X but missing NN HA.
0.22.X - does not include security
0.20.203.X - legacy stable version
0.20.X - legacy version
Does that means the mapred classes were deprecated & have been reintroduced. Which hadoop version should i use? 0.20.2 or 1.0.x ?
Please check this out, it explains the version control of Hadoop development: http://www.cloudera.com/blog/2012/04/apache-hadoop-versions-looking-ahead-3/
So you can get idea of why it has quite complex versions.
p/s: I'm using v1.0.3 for my system :)
That is an April Fools Day post. :)
But anyone can agree the versions are misleading at best.

Resources