Creating thin jar for submitting spark applications

Creating thin jar for submitting spark applications - maven

Any insights on how can we use thin jar to submit spark applications?
The scenario is such that if some specific dependency is not present in the classpath of the project or is specific to some distribution cloudera or hortonworks it throws an exception if the appropriate version of jars are not used.
How can we avoid such scenarios?

The only thin jar you can make is one that doesn't compile the Spark core libraries into the JAR. For example Spark SQL and Spark Streaming don't need included, but unless Spark was compiled with Hive support during installation, you'll still need that one.
You'll need to contact your Hadoop cluster administrators to know what version of Spark is available, how it was built, and what libraries are available in $SPARK_HOME out of the box.
In my experience, I've never ran into a specific dependency to HDP or CDH as I've ran a Spark 2.3 job submitted to YARN fine, while neither vendor officially supports that version. The only thing you need is to match the Spark version with your code, not necessarily Hadoop/YARN/Hive versions. Kafka, Cassandra, other connectors are all extra anyway and they can't be in a thin jar

Related

How HBase add its dependency jars and use HADOOP_CLASSPATH

48. HBase, MapReduce, and the CLASSPATH
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.
To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory. You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references. It also requires you restart the Hadoop cluster before Hadoop can use the HBase data.
The recommended approach is to let HBase add its dependency jars and use HADOOP_CLASSPATH or -libjars.
I'm learning how HBase interacts with MapReduce
I know what the above two ways mean, but I don't know how to configure the recommended way
Could anyone tell me how to configure it in the recommended way?

As the docs show, prior to running hadoop jar, you can export HADOOP_CLASSPATH=$(hbase classpath) and you can use hadoop jar ... -libjars [...]
The true recommended way would be to bundle your HBase dependencies as an Uber JAR in your mapreduce application
The only caveat is that you need to ensure that your project uses the same/compatible hbase-mapreduce client versions as the server.
That way, you don't need any extra configuration, except maybe specifying the hbase-site.xml

Apache Spark 2.0.1 and Spring Integration

So, i would like to create an apache spark integration in my spring application by following this guide provided by spring (http://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html). Now i have a few questions as it seems that sparks 2.0.1 does not include the spark-assembly jar.
What are my options in proceeding with this as it seems that the integration is dependant on the jar?
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Is there a way to get the jar with apache 2.0.1?

Yes you are right - spark 2.0.1 does not include uber jar with itself like in 1.6.x and below (eg. spark-1.6.2-bin-hadoop2.6\lib\spark-assembly-1.6.2-hadoop2.6.0.jar)
Spark 2.0.0+ spark-release-2-0-0.html doesn't require a fat assembly uber jar. However when you compare content of spark-assembly-1.6.2-hadoop2.6.0 and libs (content of jar files) in spark-2.0.0-bin-hadoop2.7\jars\ you can see almost the same content with same classes, packages etc.
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Personally I dont think so. There might be potentionally some problems with backward compatibility and it is weird to have something that was removed in latest version.
You are right that SparkYarnTasklet need assembly jar because there is some postPropertiesSet validation:
#Override
public void afterPropertiesSet() throws Exception {
Assert.hasText(sparkAssemblyJar, "sparkAssemblyJar property was not set. " +
"You must specify the path for the spark-assembly jar file. " +
"It can either be a local file or stored in HDFS using an 'hdfs://' prefix.");
But, this sparkAssemblyJar is only used in sparkConf.set("spark.yarn.jar", sparkAssemblyJar);
when you will use SparkYarnTasklet, the program will probably fail on validation (You can try to extend SparkYarnTasklet and Override afterPropertiesSet without validation)
And documentation about "spark.yarn.jar:"
To make Spark runtime jars accessible from YARN side, you can specify
spark.yarn.archive or spark.yarn.jars. For details please refer to
Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is
specified, Spark will create a zip file with all jars under
$SPARK_HOME/jars and upload it to the distributed cache.
so take a look into properties: spark.yarn.jars and spark.yarn.archive.
So compare what is spark.yarn.jar in 1.6.x- and 2.0.0+
spark.yarn.jar in 1.6.2 :
The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to hdfs:///some/path.
spark.yarn.jar in 2.0.1:
List of libraries containing Spark code to distribute to YARN
containers. By default, Spark on YARN will use Spark jars installed
locally, but the Spark jars can also be in a world-readable location
on HDFS. This allows YARN to cache it on nodes so that it doesn't need
to be distributed each time an application runs. To point to jars on
HDFS, for example, set this configuration to hdfs:///some/path. Globs
are allowed.
but this seems to set all jars one by one.
But in 2.0.0+ there is spark.yarn.archive that replaces spark.yarn.jars and provide a way how to avoid passing jars one by one - create archive with all jars in root "dir".
I think spring-hadoop will reflect changes in 2.0.0+ in a few weeks, but for "quick fix" I will probably try to override SparkYarnTasklet and reflect changes for 2.0.1 - as I saw exactly execute and afterPropertiesSet methods.

Deployment of artifacts to Hadoop cluster

Is there any pattern how to deploy applications (jar-files) to an Hadoop-Custer ? I am not talking about map-reduce jobs but to deploy applications for Spark, Flume etc.
Within the Hadoop ecosystem deployment alone is not sufficient. You need to restart services, deploy configurations (e.g. via Ambari) and so forth.
I have not found any specific tools. Is my assumption correct that you use standard automation tools like maven/jenkins and do the missing parts by yourself ?
Just wondering if I have overseen something. Just do not want to reinvent the wheel ;)

If you are managing the Hadoop ecosystem you can use Ambari and Cloudera's manager. But you will need to stop and restart their services for configuration and library changes. If the ecosystem is managed outside of this then you have the option of managing the jars with outside tools like Puppet and Salt. Currently, we use Salt because of the push/pull abilities.
If you are talking about applications, like jobs running on Spark, you will just provide the Hadoop URL in the file path. For example:
spark-submit --class my.dev.org.SparkDriver --properties-file mySparkProps.conf wordcount-shaded.jar hdfs://servername/input/file/sample.txt hdfs://servername/output/sparkresults
For applications have dependencies on third party jar files. Then you do have the option of shading the job's jar file to prevent other application libraries from interfering with each other. The down side is the application jar file will get big. I use maven, so I added the maven-shade-plugin artifact and use the default scope (compile) for the dependencies.

hbase-client or hbase-common for HBase 0.94.6-cdh4.5.0

We are using Cloudera CDH 4.5.0 for HBase and Storm 0.9.3 uses hbase-client. Unfortunately, it seems Cloudera did not provide an hbase-client maven artifact, and I cannot figure out how to satisfy the dependency for org.apache.hadoop.hbase.security.UserProvider. According to the Maven search site, it can be provided by either hbase-client or hbase-common. Can someone tell me if there is a comparable version of either of these that I can use with cdh 4.5.0?

Are you using cdh4.x or cdh5.x? the hbase-client/hbase-common jars are only in cdh5 (hbase 96+). The cdh4 release has only one big hbase jar containing everything. Also UserProvider doesn't seems to be present in 4.5.0 but is present from 4.6.x
hbase-client depends on hbase-common, so in general you need both if you want to use client.
(if you are looking only for the UserProvider class that is in hbase-common)

Bare minimum of dependencies to work with HDFS

I need to put some files into HDFS from my client application. I am not planning to schedule a job to hadoop, just need to drop something into HDFS.
Maven dependency on hadoop-core brings a lot of stuff like jersey-core etc, which I don't need at all.
Is there any simple client library to work with HDFS without getting a full stack of hadoop dependencies? What is the minimal set of maven dependencies I can use?
Is webhdfs the only option?

They introduced hadoop-client which is much better then hadoop-core as a client library.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio