Conflicting jars while using Unirest on CDH - hadoop

I'm trying to use Unirest to send a POST request from a MapReduce job on a Cloudera Hadoop 5.2.1 cluster.
One of Unirest's dependencies is httpcore-4.3.3.jar. The CDH package includes httpcore-4.2.5.jar in the classpath. While trying to run my code, I got a "ClassNotFound" exception.
I added a line in my code to check where it's getting a different class from and the answer was troubling: /opt/cloudera/parcels/CDH/jars/httpcore-4.2.5.jar.
I've looked everywhere online and tried everything I found. Needless to say, nothing seems to work.
I tried setting HADOOP_CLASSPATH environment variable, I tried setting HADOOP_USER_CLASSPATH_FIRST, and I tried using the -libjars parameter in the hadoop jar command.
Anyone have any idea how to solve this?

Related

Spark/Spring validation-api dependencies conflict

I run Spring/Spark app and face this problem
The following method did not exist:
javax.validation.BootstrapConfiguration.getClockProviderClassName()Ljava/lang/String;
The method's class, javax.validation.BootstrapConfiguration, is available from the following locations:
***validation-api-1.1.0.Final.jar!/javax/validation/BootstrapConfiguration.class
***/BOOT-INF/lib/validation-api-2.0.1.Final.jar!/javax/validation/BootstrapConfiguration.class
It was loaded from the following location:
file:/usr/hdp/2.6.3.0-235/spark2/jars/validation-api-1.1.0.Final.jar
How do I make spark read my dependency first and then look at the system lib?
I tried to specify in Oozie
I tried to specify in spark-submit
Nothing worked so far.
Had encountered a similar situation. I finally ended up doing as below. ie i copied the required jars to a directory, and used the extraClasspath option
spark-submit --conf spark.driver.extraClassPath="C:\sparkjars\validation-api-2.0.1.Final.jar;C:\sparkjars\gson-2.8.6.jar" myspringbootapp.jar
From the documentaion, spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.

Running Spark from a local IDE

I've been spending some time banging my head over trying to run a complex spark application locally in order to test quicker (without having to package and deploy to a cluster).
Some context:
This spark application interfaces with Datastax Enterprise version of Cassandra and their distributed file system, so it needs some explicit jars to be provided (not available in Maven)
These jars are available on my local machine, and to "cheese" this, I tried placing them in SPARK_HOME/jars so they would be automatically added to the classpath
I tried to do something similar with the required configuration settings by putting them in spark-defaults.conf under SPARK_HOME/conf
When building this application, we do not build an uber jar, but rather do a spark-submit on the server using --jars
The problem I'm facing, is when I run the Spark Application through my IDE, it seems like it doesn't pick up any of these additional items from the SPARK_HOME director (config or jars). I spent a few hours trying to get the config items to work and ended up setting them as System.property values in my test case before starting the spark session in order for Spark to pick them up, so the configuration settings can be ignored.
However, I do not know how to reproduce this for the vendor specific jar files. Is there an easy way I can emulate the --jars behavior that spark-submit does and some home set up my spark session with this jar value? Note: I am using in my code the following command to start a spark session:
SparkSession.builder().config(conf).getOrCreate()
Additional information, in case it helps:
The Spark version I have locally in SPARK_HOME is the same version that my code is compiling with using Maven.
I asked another question similar to this related to configs: Loading Spark Config for testing Spark Applications
When I print the SPARK_HOME environment variable in my application, I am getting the correct SPARK_HOME value, so I'm not sure why neither the configs or jar files are being picked up from here. Is it possible that when running the application from my IDE, it's not picking up the SPARK_HOME environment variable and using all defaults?
You can make use of .config(key, value) while building the SparkSession by passing "spark.jars" as the key and a comma separated list of paths to the jar like so:
SparkSession.builder().config("spark.jars", "/path/jar1.jar, /path/jar2.jar").config(conf).getOrCreate()

Mapreduce with third party API

I am using some third party APIs in my Map-Reduce code. I tried few things but didn't worked out for me. (I am using Cloudera 5.9 version)
1) I tried with fat jar approach (building jar with all dependencies included), and my code worked for me, but this is not an good way to use (my jar size in very heavy).
so i thought of separating the third party jars and share them using distributed cache.
2) I tried using the some options like below.
i) I used hadoop jar <jar_file_name> <main_class_name> <arguments> -libjars <list of jar files> command. --> Didn't worked and getting ClassNotFoundException
i.a) I Kept the third party jars in local folder and used the path in -libjars option. --> Didn't worked and getting ClassNotFoundException
i.b) I Kept the third party jars in HDFS and used the path in -libjars option. --> Didn't worked and getting ClassNotFoundException
ii) I Updated the code to use DistributedCache.addFileToClassPath() --> Didn't worked and getting ClassNotFoundException
iii) I Updated the code to use job.addCacheFile() --> Didn't worked and getting ClassNotFoundException
iv) I Updated the /etc/hadoop/conf/hadoop-env.sh and added a new line export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/mani/test/lib/* and then restarted the cluster and but still didn't worked.
v) I tried running the command like hadoop jar <jar_file_name> <main_class_name> <arguments> -D mapred.child.env="LD_LIBRARY_PATH=path/to/my/lib/*" --> Didn't worked and getting same exception.
vi) I Also tried -Dyarn.application.classpath but still same issue.
I have gone through some forums and Cloudera blog posts, and some other websites. Everyone telling the same story but i am not able to get the output even after following their posts correctly, am i missing something here.
Can someone please help me to find out the solution.
Thanks,
Manindar.

Shouldn't Oozie/Sqoop jar location be configured during package installation?

I'm using HDP 2.4 in CentOS 6.7.
I have created the cluster with Ambari, so Oozie was installed and configured by Ambari.
I got two errors while running Oozie/Sqoop related to jar file location. The first concerned postgresql-jdbc.jar, since the Sqoop job is incrementally importing from Postgres. I added the postgresql-jdbc.jar file to HDFS and pointed to it in workflow.xml:
<file>/user/hdfs/sqoop/postgresql-jdbc.jar</file>
It solved the problem. But the second error seems to concern kite-data-mapreduce.jar. However, doing the same for this file:
<file>/user/hdfs/sqoop/kite-data-mapreduce.jar</file>
does not seem to solve the problem:
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.SqoopMain], main() threw exception,
org/kitesdk/data/DatasetNotFoundException
java.lang.NoClassDefFoundError:
org/kitesdk/data/DatasetNotFoundException
It seems strange that this is not automatically configured by Ambari and that we have to copy jar files into HDFS as we start getting errors.
Is this the correct methodology or did I miss some configuration step?
This is happening due to the missing jars in the classpath. I would suggest you to use the property oozie.use.system.libpath=true in the job.properties file. All the sqoop related jars will be added automatically in the classpath. Then add only custom jar you need to the lib directory of the workflow application path., all the sqoop related jars will be added from the /user/oozie/share/lib/lib_<timestamp>/sqoop/*.jar.

How to find jar dependencies when running Apache Pig script?

I am having some difficulties running a simple pig script to import data into HBase using HBaseStorage
The error I have encountered is given by:
Caused by: <file demo.pig, line 14, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.backend.hadoop.hbase.HBaseStorage' with arguments '[rdf:predicate rdf:object]'
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.initScan(HBaseStorage.java:427)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:368)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:239) 13_21.51.28.tar.gz
... 29 more
According to other questions and threads, the main response/answer to this issue would be to register the appropriate jars required for the HBaseStorage references. What I am stumped by is how am I supposed to identify the required JAR given the appropriate Pig function.
I even tried to open the various jar files under the hbase and pig folders to ensure the appropriate classes are registered in the pig script.
For example, since java.lang.NoSuchMethodError was caused by org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
I imported specifically the jar that contains org.apache.hadoop.hbase.client.Scan, to no avail.
Pig's documentation does not provide any obvious links and help that I can refer to.
I am using Hadoop 2.7.0, HBase 1.0.1.1., Pig 0.15.0.
If you need any other clarification, feel free to ask me again. Would really appreciate it if someone could help me out with this issue.
Also, is it better to install Hadoop and the relevant softwares from scratch, or is it better to directly get one of the Hadoop bundles available?
There is something wrong with the released jar: hbase-client-1.0.1.1.jar
you can test it with this code, the error will show up:
Scan scan = new Scan();
scan.setCacheBlocks(true);
I've tried other set functions, like setCaching, it throws the same error. While I checked the source code, those functions exist. Maybe just compile hbase-client-1.0.1.1.jar manually, I'm still looking for better solution...
============
Update for above, found the root cause is hbase-client-1.0.1.1.jar incompatibility with older versions.
https://issues.apache.org/jira/browse/HBASE-10841
https://issues.apache.org/jira/browse/HBASE-10460
There is a change of return value for set functions, jars compiled with old version won't work with current.
For your question, you can modify the pig script $PIG_HOME/bin/pig, set debug=true, then it will just print running info.
Did you register required jars.
Most important jars habse,zookeeper and guava
I solved the similar kind of issue by registering zookeeper jar in my pigscript

Resources