Where can I find spark.hadoop.yarn.* properties? - hadoop

I was trying to run spark(1.6.0) application which was using com.databricks.spark.csv jar to load csv file on yarn client mode from eclipse. It was throwing
CSVRelatio$annonfunc$func not found exception. That was resolved by setting
spark.hadoop.yarn.application.classpath
property in SparkConf.
My question is spark.hadoop.yarn.application.classpathproperty was not
listed in any of the spark official documents. So where can I find all such
properties? I know it is silly questions but there are many beginners who
refer official
documents (https://spark.apache.org/docs/1.6.0/configuration.html) and they
are not at all aware about these properties.

There are not listed because there are not Spark properties. spark. prefix is used only, so Spark recognizes, that these should be parsed, and put into org.apache.hadoop.conf.Configuration.
Where to look for a documentation? You should check Hadoop documentation for a corresponding component. For example for YARN: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
You should also not that Spark has its own classpath related properties including:
spark.jars
spark.packages
spark.driver.extraClassPath / spark.executor.extraClassPath
....

Related

How do I programmatically install Maven libraries to a cluster using init scripts?

Have been trying for a while now and Im sure the solution is simple enough, just struggling to find it. Im pretty new so be easy on me..!
Its a requirement to do this using a premade init-script, which is then selected in the UI when configuring the cluster.
I am trying to install com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.18 to a cluster on Azure Databricks. Following the documentations example (it is installing a postgresql driver) they produce an init script using the following command:
dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)```
My question is, what is the /mnt/driver-daemon/jars/postgresql-42.2.2.jar section of this code? And what would I have to do to make this work for my situation?
Many thanks in advance.
/mnt/driver-daemon/jars/postgresql-42.2.2.jar here is the output path where the jar file will be put. But it makes no sense as this jar won't be put into CLASSPATH and won't be found by Spark. Jars need to be put into /databricks/jars/ directory, where they will be picked up by Spark automatically.
But this method with downloading of jars works only for jars without dependencies, and for libraries like EventHubs connector this is not a case - they won't work if dependencies aren't downloaded as well. Instead it's better to use Cluster UI or Libraries API (or Jobs API for jobs) - with these methods, all dependencies will be fetched as well.
P.S. But really, instead of using EventHubs connector, it's better to use Kafka protocol that is supported by EventHubs as well. There are several reasons for that:
It's better from performance standpoint
It's better from stability standpoint
Kafka connector is included into DBR, so you don't need to install anything extra
You can read how to use Spark + EventHubs + Kafka connector in the EventHubs documentation.

Apache Spark 2.0.1 and Spring Integration

So, i would like to create an apache spark integration in my spring application by following this guide provided by spring (http://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html). Now i have a few questions as it seems that sparks 2.0.1 does not include the spark-assembly jar.
What are my options in proceeding with this as it seems that the integration is dependant on the jar?
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Is there a way to get the jar with apache 2.0.1?
Yes you are right - spark 2.0.1 does not include uber jar with itself like in 1.6.x and below (eg. spark-1.6.2-bin-hadoop2.6\lib\spark-assembly-1.6.2-hadoop2.6.0.jar)
Spark 2.0.0+ spark-release-2-0-0.html doesn't require a fat assembly uber jar. However when you compare content of spark-assembly-1.6.2-hadoop2.6.0 and libs (content of jar files) in spark-2.0.0-bin-hadoop2.7\jars\ you can see almost the same content with same classes, packages etc.
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Personally I dont think so. There might be potentionally some problems with backward compatibility and it is weird to have something that was removed in latest version.
You are right that SparkYarnTasklet need assembly jar because there is some postPropertiesSet validation:
#Override
public void afterPropertiesSet() throws Exception {
Assert.hasText(sparkAssemblyJar, "sparkAssemblyJar property was not set. " +
"You must specify the path for the spark-assembly jar file. " +
"It can either be a local file or stored in HDFS using an 'hdfs://' prefix.");
But, this sparkAssemblyJar is only used in sparkConf.set("spark.yarn.jar", sparkAssemblyJar);
when you will use SparkYarnTasklet, the program will probably fail on validation (You can try to extend SparkYarnTasklet and Override afterPropertiesSet without validation)
And documentation about "spark.yarn.jar:"
To make Spark runtime jars accessible from YARN side, you can specify
spark.yarn.archive or spark.yarn.jars. For details please refer to
Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is
specified, Spark will create a zip file with all jars under
$SPARK_HOME/jars and upload it to the distributed cache.
so take a look into properties: spark.yarn.jars and spark.yarn.archive.
So compare what is spark.yarn.jar in 1.6.x- and 2.0.0+
spark.yarn.jar in 1.6.2 :
The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to hdfs:///some/path.
spark.yarn.jar in 2.0.1:
List of libraries containing Spark code to distribute to YARN
containers. By default, Spark on YARN will use Spark jars installed
locally, but the Spark jars can also be in a world-readable location
on HDFS. This allows YARN to cache it on nodes so that it doesn't need
to be distributed each time an application runs. To point to jars on
HDFS, for example, set this configuration to hdfs:///some/path. Globs
are allowed.
but this seems to set all jars one by one.
But in 2.0.0+ there is spark.yarn.archive that replaces spark.yarn.jars and provide a way how to avoid passing jars one by one - create archive with all jars in root "dir".
I think spring-hadoop will reflect changes in 2.0.0+ in a few weeks, but for "quick fix" I will probably try to override SparkYarnTasklet and reflect changes for 2.0.1 - as I saw exactly execute and afterPropertiesSet methods.

where does hadoop use yarn.log.file in java src code

I see that when starting a NodeManager,
the -Dyarn.log.file=yarn-hadoop-nodemanager-hostname1.log parameter is passed to the NodeManager's main method,
but I can't find where this yarn.log.file is used in java code so that log message can write into the yarn.log.file
wish for some help, thanks
Hadoop uses log4j behind the scenes. Log4j supports different configurable appenders, but when the logging system is configured, you will not see any reference at the file that is just one of many possible appenders ( ie output for the log ). You will probably dig on various log4j configuration file in the hadoop sources looking for *log4j.properties and you will eventually find your referenced file.

Where do I set the configuration mapreduce.job.jvm.numtasks?

I am reading in a book (Professional Hadoop Solutions) that JVM reuse can be enabled by specifying the job configuration mapreduce.job.jvm.numtasks. My question is do we need to set this in the Driver class?
I tried looking for this configuration in mapreduce.Job object, and I don't find it. Could this API be replaced elsewhere in the version of Hadoop I am using? Or am I not looking in the right place? I am using Hadoop version 1.0.3.
I also tried to look for the older property mapred.job.reuse.jvm.num.tasks, and I couldn't.
Thanks!
Your source is referring to the newer Hadoop configuration API for Hadoop 2.x (YARN). Within the shift to YARN a lot of configuration names have been revised. The changes are documented here on the the offical site for the related Hadoop release (in this case the by Amazon's Elastic MapReduce adopted version 2.4.0).
It explicitly mentions the old configuration name mapred.job.reuse.jvm.num.tasks has been replaced by the new name mapreduce.job.jvm.numtasks.
Furthermore the documentation for the MapReduce default configuration says this for mapreduce.job.jvm.numtasks:
How many tasks to run per jvm. If set to -1, there is no limit.
The default configuration for Hadoop 1.2.1 (compatible configuration API to 1.0.3) can be found on GrepCode for example.
Regarding your question, where to set this property. It either can be set
for the whole cluster in ${HADOOP_CONF_DIR}/mapred-site.xml,
or you specify it in the configuration of your Job (or JobContext), as long it is not declared final within your cluster:
job.getConfiguration().set("mapred.job.reuse.jvm.num.tasks","-1");
You could define it in mapred-site.xml:
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>-1</value>
</property>
Use it when you have shorter task that run for a definite period of time.

hbase and osgi - can't find hbase-default.xml

as hbase is not available as osgi-ified bundle yet I managed to create the bundle with the maven felix plugin (hbase 0.92 and the corresponding hadoop-core 1.0.0), and both bundles are starting up in OSGi :)
also the hbase-default.xml is added to the resulting bundle. in the resulting osgi-jar, when I open it, the structure looks like this:
org/
META-INF
hbase-default.xml
This was achieved with <Include-Resource>#${pkgArtifactId}-${pkgVersion}.jar!/hbase-default.xml</Include-Resource>
The problem comes up when I actually want to connect to hbase. hbase-default.xml can not be found and thus I can not create any configuration file.
The hbase osgi bundle is used from within another osgi-bundle that should be used to get an hbase connection and query the database. This osgi-bundle is used by an RCP application.
My question is, where do I have to put my hbase-default.xml so that it will be found when the bundle is started? or why does it not realize that the file is existing?
Thank you for any hints.
-- edit
I found a decompiler so I could view the source where the loading of the configuration is executed (hadoop-core which does not provide any sources via maven) and I now see that the Threads contextClassLoader is used (and if not available the classLoader of the Configuration class itself), so it seems to me that it can't find the resource, but, it should, according to the description, also check the parents (but who is the parent in an OSGi environment?)?
I tested to get the resource from the OSGi-bundle that should use hbase, where I added hbase-default.xml to the created jar file (see above), and there I get a resource when I get the contextClassLoader of the thread. When I explored the code a bit more I realized that there is no way to set the classloader for the HBaseConfiguration (although it would be possible to set the classloader for a "simple" hadoop-Configuration, HBaseConfiguration inherits from, but the creation procedure of HBaseConfiguration does not allow it, as it simply creates a new object within the create() method.
I really hope you have some idea how to get this up and running :)
Thread.currentThread().setContextClassLoader(HBaseConfiguration.class.getClassLoader());
Make sure the HBaseConfiguration class loaded in your OSGI bundle.hbase will make use of the thread context classloader, in order to load resources (hbase-default.xml and hbase-site.xml). Setting the TCCL will allow you to load the defaults and override them later.
If hbase-default.xml is in the .jar file which is in the CLASSPATH, that file normally can be find by java program.
I have read the hbase mailing list.
check your pom.xml:
in 'process-resource' phase, hbase-default.xml's '###VERSION###' would be replaced with the actual version string. however, if this phase configuration is set to be 'target', not 'tasks', the replacement would not occur.
You could have a look at your pom.xml, ant correct the label to if so.
faced this issue, actually fixed it by putting hbase-site.xml in the bundle which I was calling hbase from, found advise here:
Using this component in OSGi: This component is fully functional in an OSGi environment however, it requires some actions from the user. Hadoop uses the thread context class loader in order to load resources. Usually, the thread context classloader will be the bundle class loader of the bundle that contains the routes. So, the default configuration files need to be visible from the bundle class loader. A typical way to deal with it is to keep a copy of core-default.xml in your bundle root. That file can be found in the hadoop-common.jar.

Resources