Loading properties file in storm cluster mode - apache-storm

In my topology there is a small piece of code loading configurations from properties in classpath
InputStream is=getClass().getClassLoader().getResourceAsStream("dev.properties");
p.load(is);
It works great when I run jar in local mode storm, but when I try it in cluster mode, it fails with NullPointerException.
The properties file is in src/main/resources(Maven structure) and properly included in jar file.
Is there any possible reason?
Besides, I face a lot of trouble when I run some topology with outbound interaction for example ElasticSearch in cluster mode storm. Even though it works perfect in local mode storm.
What should I think before using cluster mode storm?

Load your properties object while building topology and then pass it to your bolts/spouts via constructor where necessary.

you have to configure a Network file system in your storm cluster and then place that property file in NFS location, read property file from this location.

Related

Set Log Level of Storm Topology from Start

I have a bug that occurs in my Storm topology during initialization. I would like to set the log level to DEBUG from when the topology is started.
I realize there is a mechanism to dynamically set the log level for a running topology using either the Storm UI or CLI, but I am not able to dynamically change this setting before the bug occurs in my topology during initialization.
How can I statically set the log level to DEBUG so that I can see more detailed logs when my topology is initialized?
The following only applies to Storm 2.0.0 and later.
You can include a log4j2 config file in your topology jar. You then need to set the topology.logging.config property in your topology configuration.
I'll include the documentation here for convenience:
Log file the user can use to configure Log4j2. Can be a resource in the jar (specified with classpath:/path/to/resource) or a file. This configuration is applied in addition to the regular worker log4j2 configuration. The configs are merged according to the rules here: https://logging.apache.org/log4j/2.x/manual/configuration.html#CompositeConfiguration
See https://github.com/apache/storm/blob/885ca981fc52bda6552be854c7e4af9c7a451cd2/storm-client/src/jvm/org/apache/storm/Config.java#L735
The "regular worker log4j2 configuration" is the log4j2/worker.xml file in your Storm deployment, assuming default settings.

Test framework for Spark Application validations

I am looking for your suggestions/help in testing framework for one of our Spark application.
We have a spark application which processes the input data from HDFS and pushes the processed output data to HDFS. We are planning to automate the process of testing this spark application.
I would appreciate any suggestions on how to automate the testing or whether any framework available for testing spark applications/jobs.
-Sri
Spark code can be checked without any additional Spark-related frameworks. Just set configuration master to "local":
val config = new SparkConf().setMaster("local")
Computer file system is used as HDFS by default. And such approach will work in usual test frameworks (ScalaTest, etc).
Note: SparkContext must be declared as singleton for all tests.

Apache Spark 2.0.1 and Spring Integration

So, i would like to create an apache spark integration in my spring application by following this guide provided by spring (http://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html). Now i have a few questions as it seems that sparks 2.0.1 does not include the spark-assembly jar.
What are my options in proceeding with this as it seems that the integration is dependant on the jar?
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Is there a way to get the jar with apache 2.0.1?
Yes you are right - spark 2.0.1 does not include uber jar with itself like in 1.6.x and below (eg. spark-1.6.2-bin-hadoop2.6\lib\spark-assembly-1.6.2-hadoop2.6.0.jar)
Spark 2.0.0+ spark-release-2-0-0.html doesn't require a fat assembly uber jar. However when you compare content of spark-assembly-1.6.2-hadoop2.6.0 and libs (content of jar files) in spark-2.0.0-bin-hadoop2.7\jars\ you can see almost the same content with same classes, packages etc.
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Personally I dont think so. There might be potentionally some problems with backward compatibility and it is weird to have something that was removed in latest version.
You are right that SparkYarnTasklet need assembly jar because there is some postPropertiesSet validation:
#Override
public void afterPropertiesSet() throws Exception {
Assert.hasText(sparkAssemblyJar, "sparkAssemblyJar property was not set. " +
"You must specify the path for the spark-assembly jar file. " +
"It can either be a local file or stored in HDFS using an 'hdfs://' prefix.");
But, this sparkAssemblyJar is only used in sparkConf.set("spark.yarn.jar", sparkAssemblyJar);
when you will use SparkYarnTasklet, the program will probably fail on validation (You can try to extend SparkYarnTasklet and Override afterPropertiesSet without validation)
And documentation about "spark.yarn.jar:"
To make Spark runtime jars accessible from YARN side, you can specify
spark.yarn.archive or spark.yarn.jars. For details please refer to
Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is
specified, Spark will create a zip file with all jars under
$SPARK_HOME/jars and upload it to the distributed cache.
so take a look into properties: spark.yarn.jars and spark.yarn.archive.
So compare what is spark.yarn.jar in 1.6.x- and 2.0.0+
spark.yarn.jar in 1.6.2 :
The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to hdfs:///some/path.
spark.yarn.jar in 2.0.1:
List of libraries containing Spark code to distribute to YARN
containers. By default, Spark on YARN will use Spark jars installed
locally, but the Spark jars can also be in a world-readable location
on HDFS. This allows YARN to cache it on nodes so that it doesn't need
to be distributed each time an application runs. To point to jars on
HDFS, for example, set this configuration to hdfs:///some/path. Globs
are allowed.
but this seems to set all jars one by one.
But in 2.0.0+ there is spark.yarn.archive that replaces spark.yarn.jars and provide a way how to avoid passing jars one by one - create archive with all jars in root "dir".
I think spring-hadoop will reflect changes in 2.0.0+ in a few weeks, but for "quick fix" I will probably try to override SparkYarnTasklet and reflect changes for 2.0.1 - as I saw exactly execute and afterPropertiesSet methods.

Why Sling Configuration has two different formats

in Sling, configuration can be deployed either via a sling:osgiConfig node and via a nt:file node having the configuration values.
When i make some changes in Felix Console in some configuration deployed via sling:osgiConfig node, it gets converted to nt:file format.
Why there are these two different formats for configurations in Sling. is there any significant difference between the two?
I'd say this is mostly for historical reasons, in some cases it's more convenient to provide configurations as hierarchical resources (sling:OsgiConfig) and if the config is coming from a filesystem for example, files are more convenient.
#Shashi sling:osgiConfig changing to nt:file when you make changes in felix console is expected behaviour. This will not cause any issue when you try to read the config value from java class. You will just not be able to edit the run mode config via crxde when it has changed to nt:file as it stores data as binary content.
However there is a way to disable this behaviour, you will have to uncheck "Enable Write Back" at /system/console/configMgr/org.apache.sling.installer.provider.jcr.impl.JcrInstaller as mentioned in this thread.
OSGi config best practices

How to add jars into the classpath and get effected without restarting the hadoop cluster?

I wrote some mapreduce jobs that reference a few external jars.
so I added them into the CLASSPATH of the "running" cluster in order to run jobs.
Once I tried to run them, I got class not found exceptions.
I Googled for ways to fix it and I found that I needed to restart the cluster for applying
the changed CLASSPATH, and it actually worked.
Oh, yuck!
Should I really need to restart a cluster every time I add new jars into the CLASSPATH?
I don't think that it makes sense.
Does anyone know how to apply the changes without restarting them?
I think I need to add some detail to beg your advice.
I wrote a custom hbase filter class and packed it in a jar.
And I wrote a mapreduce job that uses the custom filter class and packed it in an another jar.
Because the filter class jar wasn't in the class path of my "running" cluster, I added it.
But I couldn't succeed to run the job until I restarted the cluster.
Of course, I know I could packed the filter class and the job in a single jar together.
But I didn't mean it.
And I'm curious I should restart the cluster again if I need to add new external jars?
Check the Cloudera article for including 3rd party libraries required for the Job. Option (1) and (2) don't require the Cluster to be restarted.
You could have such a system that dynamically resolve class names to an interface type to process your data.
Just my 2 cents.

Resources