Where to put JDBC drivers when using CDH4+Cloudera Manager? - jdbc

I am trying to get JDBC jars recognized by Sqoop2 (CDH 4.4.0), but no matter where I place them, they do not seem to be recognized.
I have followed advice:
here,
here,
and asked a similar question here.
Can someone please provide a definitive answer to this?

I would strongly recommend to follow the official installation guide for your Hadoop distribution and it's associated version. It seems that you are using CDH 4.4.0, but looking into CDH 4.2.1 installation instructions. Whereas in CDH 4.2.1 the JDBC driver jar files were expected in /usr/lib/sqoop2, since CDH 4.3.0 they are expected in /var/lib/sqoop2 instead (documentation).

Related

How do we install Apache BigTop with Ambari?

I am trying to find out how to deploy a hadoop cluster using ambari by using apache big top
According to the latest release bigtop 1.5:
https://blogs.apache.org/bigtop/
my understanding is that Bigtop Mpack was added as a new feature, that enables users to
deploy Bigtop components via Apache Ambari.
I am able to install the Bigtop components via command line, but do not find any documentation on how to install these bigtop hadoop components via ambari.
Can someone please help redirect me into some documentation that tells me how to install various hadoop components(bigtop packages) via ambari?
Thanks,
I'm from Bigtop community. Though I don't have a comprehensive answer. The Bigtop user mailing list had a discussion recently that has several tech details can answer your question:
https://lists.apache.org/thread.html/r8c5d8dfdee9b7d72164504ff2f2ea641ce39aa02364d60917eaa9fa5%40%3Cuser.bigtop.apache.org%3E
OTOH, you are always welcome to join the mailing list and ask questions. Our community is active and happy to answer questions.
Build a repo of Big Top
To install that repo with Ambari, you have to register the stack/version. You will need to create a version file. I found an example of one here.
Complete installation like you would with a normal build
This is highly theoretical (..haven't done this before..) I have worked with a BIGTOP Mpack before that took care of some of this work but it's not production ready yet, and works with an old version of Ambari, not the newest. (I was able to install/stop/start HDFS/Hive). These instruction above should work with any version of Ambari.
I have been able to test Matt Andruff's theory with a VM. Here was my process and where I stopped;
Built a repo of Apache BigTop 1.5.0
Built BigTop using Gradlew
Installed Apache Ambari 2.6.1 on my system
Enabled BigInsights build version xml file and modified the package version numbers to match my Bigtop build
Note: You can also build your own version file if you want as Matt mentioned
Setup a webserver to host your package repo
Point your xml version file repo to your local webserver for packages
From there you can complete the installation of your packages as you would normally.
I have only done this with a single VM thus far and will be trying to spin up a small cluster using AWS in the coming weeks.

Building Amabari HDP stacks from sources

I am trying to setup Ambari + HDP from sources (since Cloudera closed off Hortonworks package repos). Can anyone share experience / howto on this? Documentation is very scarce in this regard.
#alfheim the documentation is here:
https://cwiki.apache.org/confluence/display/AMBARI/Installation+Guide+for+Ambari+2.7.5
And a post with all the details:
Ambari 2.7.5 installation failure on CentOS 7
Be sure to get the correct versions of npm, maven, node, etc. There are some manual changes you may need to make inside of the source files. You can find quite a few posts solving those issues here on the ambari tag. Go back to pages 2 or 3 to find most recent posts for Building Ambari from Source or just search any errors you may have during build.

How are cdh package defined?

I have questions concerning cdh and how it is maintained:
when I go to the packaging info related to a specific cdh version, I can check the package version of each component (for instance for cdh 5.5.5 : https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_55.html#cdh_555 ). However I don't understand what does the "package version" refers to exactly. For instance, for the component Apache Parquet, the "package version" is parquet-1.5.0+cdh5.5.5+181 . How can I find out exactly what source code is packaged ? Does this correspond to a label on a specific repo? If I go to the "official" apache parquet repo, there is no "cdh5.5.5" branch, the closest thing I have is a tag called "1.5.0" ( https://github.com/apache/parquet-mr/tree/parquet-1.5.0 ) . How do the people from cdh know what parquet-1.5.0+cdh5.5.5+181 exactly refers to ?
Still concerning Apache Parquet, how come even the most recent cdh versions are still using the Apache Parquet on tag is 22 May 2014, ie more than 3 years ago. Why don't they upgrade to a newer version, like 1.6.0 ? The reason I'm asking is that there is a bug in 1.5.0 that was fixed more than 3 years ago in parquet 1.6.0, yet the latest cdh version is still using the 1.5.0 version. Is there a reason why they keep using a really old, bugged, version?
thanks !
You are correct in assuming parquet-1.5.0+cdh5.5.5+181 is closest to parquet 1.5.0. However the code will not be identical to parquet 1.5.0
upstream because:
CDH enforces cross component compatibility. Code and applications using parquet-1.5.0 must also work with all the other Hadoop services (HDFS, Hive, Oozie, YARN, Spark, Solr, HBase). Incompatibilities would have to be fixed so parquet's code would include those bug fixes.
CDH enforces major version compatibility. This means an application written on CDH5.1 should still work on CDH5.5 and CDH5.7, all CDH5.x versions. This also would alter the codebase.
The best way to interpret this is to say that parquet-1.5.0+cdh5.5.5+181 will support all features provided in parquet 1.5.0 and will also work with the corresponding Hadoop services packaged with CDH5.5.
Version compatibility is also the reason why CDH Hadoop service versions run older versions of the related upstream projects. It's much harder to maintain backwards compatibility especially if APIs change between versions.

Hadoop Versions seem to fall under 0.x, 1.x, and 2.x, but when discussing YARN/MapReduce, every page Refers to Hadoop 1 and Hadoop 2.0

On Apache's distribution page, Hadoop seems to exist in 0.x, 1.x, and 2.x. However, when discussing MapReduce/Yarn, and deciding on a version of Hive and Hbase, there only seems to be discussion of Hadoop 1 and 2. Why is this? Is 0.x just a beta release?
The 1.X and 2.X versions derive from the 0.X line, which is still being continued (as far as I know). The version numbering is quite confusing. A helpful chart can be found at https://blogs.apache.org/bigtop/entry/all_you_wanted_to_know . Even if it's quite outdated, you can see the relevant branches and what derives from what.
Also check Hadoop release version confusing for more explanation.

Getting started with Hadoop and Eclipse

I'm following a couple of tutorials for setting up Hadoop with Eclipse.
This one is from Cloudera : http://v-lad.org/Tutorials/Hadoop/05%20-%20Setup%20SSHD.html
But this seems to focus on checking out the latest code from Hadoop and tweaking it.
This is rare although, usually the latest release of Hadoop will suffice most users needs?
Whereas this tutorial seems to focus on setting up and running hadoop :
http://v-lad.org/Tutorials/Hadoop/05%20-%20Setup%20SSHD.html
I just want to run some basic map reduce jobs to get started. I don't think I should be using the latest code from Hadoop as cloudera specifies in above first link to get started ?
Here is a blog entry and screencast on developing/debugging applications in Eclipse. The procedure works with most versions of Hadoop.
You may try this tutorial on installing Hadoop plugin for eclipse: http://bigsonata.com/?p=168

Resources