Send arbitrary local jars to YARN container classpath - hadoop

I'm using Apache Twill (v 0.10) to build YARN application. I've observed that the jars which are not referenced by my application code are not picked up and sent to containers' classpath. I checked YarnTwillPreparer class to see how the dependencies are decided. However, I'm still not very clear what I need to do to force some additional jars to be sent to each of the YARN containers.
I think there must be a simple and elegant way to achieve that. A precise code snippet is more welcome. Any pointer would also be good.

Related

Spring boot and javascript node_modules

I´m currently building a spring-boot application, which also uses some javascript-stuff. I use yarn as a package-manager to manage the different js-libraries.
Now I wonder, how I would include these resources into my spring-boot-project? Simply including the whole node_module-folder as a resource seems to be overhead for me, as this doesn´t neccessarily contain only the required sources (for me it is more like my local maven-repo-path). How do I identify, which java-script-resources should be included in my jar in the end, so that I can also reference them in my Thymeleaf-HTML-templates.
I already found the 'frontend-maven-plugin' (https://github.com/eirslett/frontend-maven-plugin) which helps me to install all my yarn-dependencies during build, but it doesn´t care about the build-process, as far as I can see.
Thanks for your help!
Perhaps you should consider using webpack or some other javascript bundler/task runner to bundle your javascript and required dependencies into a single file. Then you can simply include that bundled file in your jar. For example: http://justincalleja.com/2016/04/17/serving-a-webpack-bundle-in-spring-boot/

Apache Spark 2.0.1 and Spring Integration

So, i would like to create an apache spark integration in my spring application by following this guide provided by spring (http://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html). Now i have a few questions as it seems that sparks 2.0.1 does not include the spark-assembly jar.
What are my options in proceeding with this as it seems that the integration is dependant on the jar?
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Is there a way to get the jar with apache 2.0.1?
Yes you are right - spark 2.0.1 does not include uber jar with itself like in 1.6.x and below (eg. spark-1.6.2-bin-hadoop2.6\lib\spark-assembly-1.6.2-hadoop2.6.0.jar)
Spark 2.0.0+ spark-release-2-0-0.html doesn't require a fat assembly uber jar. However when you compare content of spark-assembly-1.6.2-hadoop2.6.0 and libs (content of jar files) in spark-2.0.0-bin-hadoop2.7\jars\ you can see almost the same content with same classes, packages etc.
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Personally I dont think so. There might be potentionally some problems with backward compatibility and it is weird to have something that was removed in latest version.
You are right that SparkYarnTasklet need assembly jar because there is some postPropertiesSet validation:
#Override
public void afterPropertiesSet() throws Exception {
Assert.hasText(sparkAssemblyJar, "sparkAssemblyJar property was not set. " +
"You must specify the path for the spark-assembly jar file. " +
"It can either be a local file or stored in HDFS using an 'hdfs://' prefix.");
But, this sparkAssemblyJar is only used in sparkConf.set("spark.yarn.jar", sparkAssemblyJar);
when you will use SparkYarnTasklet, the program will probably fail on validation (You can try to extend SparkYarnTasklet and Override afterPropertiesSet without validation)
And documentation about "spark.yarn.jar:"
To make Spark runtime jars accessible from YARN side, you can specify
spark.yarn.archive or spark.yarn.jars. For details please refer to
Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is
specified, Spark will create a zip file with all jars under
$SPARK_HOME/jars and upload it to the distributed cache.
so take a look into properties: spark.yarn.jars and spark.yarn.archive.
So compare what is spark.yarn.jar in 1.6.x- and 2.0.0+
spark.yarn.jar in 1.6.2 :
The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to hdfs:///some/path.
spark.yarn.jar in 2.0.1:
List of libraries containing Spark code to distribute to YARN
containers. By default, Spark on YARN will use Spark jars installed
locally, but the Spark jars can also be in a world-readable location
on HDFS. This allows YARN to cache it on nodes so that it doesn't need
to be distributed each time an application runs. To point to jars on
HDFS, for example, set this configuration to hdfs:///some/path. Globs
are allowed.
but this seems to set all jars one by one.
But in 2.0.0+ there is spark.yarn.archive that replaces spark.yarn.jars and provide a way how to avoid passing jars one by one - create archive with all jars in root "dir".
I think spring-hadoop will reflect changes in 2.0.0+ in a few weeks, but for "quick fix" I will probably try to override SparkYarnTasklet and reflect changes for 2.0.1 - as I saw exactly execute and afterPropertiesSet methods.

Hadoop - submit a job with lots of dependencies (jar files)

I want to write some sort of "bootstrap" class, which will watch MQ for incoming messages and submit map/reduce jobs to Hadoop. These jobs use some external libraries heavily. For the moment I have the implementation of these jobs, packaged as ZIP file with bin,lib and log folders (I'm using maven-assembly-plugin to tie things together).
Now I want to provide small wrappers for Mapper and Reducer, which will use parts of the existing application.
As far as I learned, when a job is submitted, Hadoop tries to find out JAR file, which has the mapper/reducer classes, and copy this jar over network to data node, which will be used to process the data. But it's not clear how do I tell Hadoop to copy all dependencies?
I could use maven-shade-plugin to create an uber-jar with the job and dependencies, And another jar for bootstrap (which jar would be executed with hadoop shell-script).
Please advice.
One way could be to put the required jars in distributed cache. Another alternative would be to install all the required jars on the Hadoop nodes and tell TaskTrackers about their location. I would suggest you to go through this post once. Talks about the same issue.
Use maven to manage the dependencies and ensure the correct versions are used during builds and deployment. Popular IDE's have maven support that makes it so you don't have to worry about building class paths for edit and build. Finally, you can instruct maven to build a single jar (a "jar-with-dependencies") containing your app and all dependencies, making deployment very easy.
As for dependencies, like hadoop, which are guaranteed to be in the runtime class path, you can define them with a scope of "provided" so they're not included in the uber jar.
Use -libjars option of hadoop launcher script for specify dependencies for jobs running on remotes JVMs;
Use $HADOOP_CLASSPATH variable for set dependencies for JobClient running on local JVM
Detailed discussion is here: http://grepalex.com/2013/02/25/hadoop-libjars/

hbase and osgi - can't find hbase-default.xml

as hbase is not available as osgi-ified bundle yet I managed to create the bundle with the maven felix plugin (hbase 0.92 and the corresponding hadoop-core 1.0.0), and both bundles are starting up in OSGi :)
also the hbase-default.xml is added to the resulting bundle. in the resulting osgi-jar, when I open it, the structure looks like this:
org/
META-INF
hbase-default.xml
This was achieved with <Include-Resource>#${pkgArtifactId}-${pkgVersion}.jar!/hbase-default.xml</Include-Resource>
The problem comes up when I actually want to connect to hbase. hbase-default.xml can not be found and thus I can not create any configuration file.
The hbase osgi bundle is used from within another osgi-bundle that should be used to get an hbase connection and query the database. This osgi-bundle is used by an RCP application.
My question is, where do I have to put my hbase-default.xml so that it will be found when the bundle is started? or why does it not realize that the file is existing?
Thank you for any hints.
-- edit
I found a decompiler so I could view the source where the loading of the configuration is executed (hadoop-core which does not provide any sources via maven) and I now see that the Threads contextClassLoader is used (and if not available the classLoader of the Configuration class itself), so it seems to me that it can't find the resource, but, it should, according to the description, also check the parents (but who is the parent in an OSGi environment?)?
I tested to get the resource from the OSGi-bundle that should use hbase, where I added hbase-default.xml to the created jar file (see above), and there I get a resource when I get the contextClassLoader of the thread. When I explored the code a bit more I realized that there is no way to set the classloader for the HBaseConfiguration (although it would be possible to set the classloader for a "simple" hadoop-Configuration, HBaseConfiguration inherits from, but the creation procedure of HBaseConfiguration does not allow it, as it simply creates a new object within the create() method.
I really hope you have some idea how to get this up and running :)
Thread.currentThread().setContextClassLoader(HBaseConfiguration.class.getClassLoader());
Make sure the HBaseConfiguration class loaded in your OSGI bundle.hbase will make use of the thread context classloader, in order to load resources (hbase-default.xml and hbase-site.xml). Setting the TCCL will allow you to load the defaults and override them later.
If hbase-default.xml is in the .jar file which is in the CLASSPATH, that file normally can be find by java program.
I have read the hbase mailing list.
check your pom.xml:
in 'process-resource' phase, hbase-default.xml's '###VERSION###' would be replaced with the actual version string. however, if this phase configuration is set to be 'target', not 'tasks', the replacement would not occur.
You could have a look at your pom.xml, ant correct the label to if so.
faced this issue, actually fixed it by putting hbase-site.xml in the bundle which I was calling hbase from, found advise here:
Using this component in OSGi: This component is fully functional in an OSGi environment however, it requires some actions from the user. Hadoop uses the thread context class loader in order to load resources. Usually, the thread context classloader will be the bundle class loader of the bundle that contains the routes. So, the default configuration files need to be visible from the bundle class loader. A typical way to deal with it is to keep a copy of core-default.xml in your bundle root. That file can be found in the hadoop-common.jar.

How to add jars into the classpath and get effected without restarting the hadoop cluster?

I wrote some mapreduce jobs that reference a few external jars.
so I added them into the CLASSPATH of the "running" cluster in order to run jobs.
Once I tried to run them, I got class not found exceptions.
I Googled for ways to fix it and I found that I needed to restart the cluster for applying
the changed CLASSPATH, and it actually worked.
Oh, yuck!
Should I really need to restart a cluster every time I add new jars into the CLASSPATH?
I don't think that it makes sense.
Does anyone know how to apply the changes without restarting them?
I think I need to add some detail to beg your advice.
I wrote a custom hbase filter class and packed it in a jar.
And I wrote a mapreduce job that uses the custom filter class and packed it in an another jar.
Because the filter class jar wasn't in the class path of my "running" cluster, I added it.
But I couldn't succeed to run the job until I restarted the cluster.
Of course, I know I could packed the filter class and the job in a single jar together.
But I didn't mean it.
And I'm curious I should restart the cluster again if I need to add new external jars?
Check the Cloudera article for including 3rd party libraries required for the Job. Option (1) and (2) don't require the Cluster to be restarted.
You could have such a system that dynamically resolve class names to an interface type to process your data.
Just my 2 cents.

Resources