How do you provide a custom configuration to a storm topology? For example, if I have a topology that I built that connects to a MySQL cluster and I want to be able to change which servers I need to to connect to without recompiling, how would I do that? My preference would be to use a config file, but my concern is that the file itself is not deployed to the cluster, therefore it won't be run (unless my understanding of how a cluster works is flawed). The only way I've seen so far to pass configuration options into a storm topology at runtime is via a command-line parameter, but that is messy when you get a good number of parameters.
One thought did have is to leverage a shell script to read the file into a variable and pass the contents of that variable in as a string to the topology, but I'd like something a little cleaner if possible.
Has anyone else encountered this? If so, how did you solve it?
EDIT:
It appears to need to provide more clarification. My scenario is that I have a topology that I want to be able to deploy in different environments without having to recompile it. Normally, I'd create a config file that contains things like database connection parameters and have that passed in. I'd like to know how to do something like that in Storm.
You can specify a configuration (via a yaml file typically) which you submit with your topology. How we manage this ourselves in our own project is we have separate config files for development and one for production, and inside it we store our server, redis and db IPs and Ports etc. Then when we run our command to build the jar and submit the topology to storm it includes the correct config file depending on your deployment environment. The bolts and spouts simply read the configuration they require from the stormConf map which is passed to them in your bolt's prepare() method.
From http://storm.apache.org/documentation/Configuration.html :
Every configuration has a default value defined in defaults.yaml in the Storm codebase. You can override these configurations by defining a storm.yaml in the classpath of Nimbus and the supervisors. Finally, you can define a topology-specific configuration that you submit along with your topology when using StormSubmitter. However, the topology-specific configuration can only override configs prefixed with "TOPOLOGY".
Storm 0.7.0 and onwards lets you override configuration on a per-bolt/per-spout basis.
You'll also see on http://nathanmarz.github.io/storm/doc/backtype/storm/StormSubmitter.html that submitJar and submitTopology is passed a map called conf.
Hope this gets you started.
I solved this problem by just providing the config in code:
config.put(Config.TOPOLOGY_WORKER_CHILDOPTS, SOME_OPTS);
I tried to provide a topology-specific storm.yaml but it doesn't work. Correct me if you make it work to use a storm.yaml.
Update:
For anyone who wants to know what SOME_OPTS is, this is from Satish Duggana on the Storm mailing list:
Config.TOPOLOGY_WORKER_CHILDOPTS: Options which can override
WORKER_CHILDOPTS for a topology. You can configure any java options
like memory, gc etc
In your case it can be
config.put(Config.TOPOLOGY_WORKER_CHILDOPTS, "-Xmx1g");
What might actually serve you best is to store the configuration in a mutable key value store (s3, redis, etc.) and then pull that in to configure a database connection that you then use (I assume here you are already planning to limit how often you talk to the database so that the overhead of getting this config is not a big deal). This design allows you to change the database connection on-the-fly, with no need to even redeploy the topology.
The idea is that when you build your topology, you create instances of your spouts and bolts (among other things) and these instances are serialized and distributed to the right places in the cluster. If you want to configure the behavior of a spout or bolt, you do so when creating the topology before submitting it and you do so by setting instance variables on the bolt or spout that, in turn, drive the configurable behavior you want.
i also faced the same issue.I solved it by configuring NFS in my cluster and i put that configuration file in shared location so that it would be available for all cluster machines.Its very easy to configure NFS in linux system link.
I face the same problem as u did, and here is my tricky solution:
Use a simple java file as the configure file, say topo_config.java, it looks like:
package com.xxx
public class topo_config {
public static String zk_host = "192.168.10.60:2181";
public static String kafka_topic = "my_log_topic";
public static int worker_num = 2;
public static int log_spout_num = 4;
// ...
}
This file is put in my configure folder, and then write a script, say compile.sh which will copy it to the right package and do the compilation stuff, looks like:
cp config/topo_config.java src/main/java/com/xxx/
mvn package
The configuration is achieved directly:
Config conf = new Config();
conf.setNumWorkers(topo_config.worker_num);
we have seen the same issue and solved it by adding the below per topology
config.put(Config.TOPOLOGY_WORKER_CHILDOPTS, "-Xmx4096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:NewSize=128m -XX:CMSInitiatingOccupancyFraction=70 -XX:-CMSConcurrentMTEnabled -Djava.net.preferIPv4Stack=true");
Also verified using Nimbus UI it show like below.
topology.worker.childopts -Xmx4096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:NewSize=128m -XX:CMSInitiatingOccupancyFraction=70 -XX:-CMSConcurrentMTEnabled -Djava.net.preferIPv4Stack=true
Related
We have a Configuration driven ETL framework built on top of Apache Spark and we are in the process of designing a common shell script that can be used for doing the spark-submit.
We looked at various blog posts and spark documentation to get some direction for our work but all we see is the specification for spark-submit ( Just like UNIX man page ) but what we are looking for is some kind of how-to article or best practices that can be followed while designing common shell script for spark-submit.
Here is our plan so far.
To set the context for our shell script, assume that we have many applications in our project and each of the applications will have many jobs.
Environment Details:
Spark Version: 2.3.2,
Deployment Mode: Cluster,
Programming Language: Scala,
Scala Version: 2.11.8,
Cluster Manage: YARN
The core of our script is using the --properties-file option that comes with spark-submit command. All the configuration that we set through "--conf" should come from the configuration file and we can make use of --properties-file option to pass the configuration file as an input. This way we are making the shell script immune to the configuration changes.
To enable the selection of configurations dynamically, we are planning to have many configuration files as listed below
Based on memory - we will have many files for each of the memory-related categories (x-small, small, medium, large, x-large). All the memory-related properties like cores and memory for driver and executor based on category go here.
Common - All the common properties like (jars, convertMetastoreParquet, and any other common conf go here.
Application-Specification specific properties or overrides can go here.
We will get the application name, memory sizing category as an input to the shell script, and based on the memory category, we choose the corresponding conf file. Contents of which are appended with the common conf file and then if there are application-specific override (optional) then Application-specific file also appended with the existing configuration. Finally, the consolidated file is passed as an input to the --properties-file option.
What I wanted to know is
Are there any blogs/videos that list down the best practices for designing the shell script.
We are planning to use dynamic memory allocation so we are not setting the instance parameter through the configuration files. Is there any downside to using dynamic memory allocation in production cluster other than the additional time taken for provisioning/decommissioning the resources during run time.
Thanks
Is there a way to replicate 2 or many couchbase buckets to elasticsearch using a single configuration file?
I actually use this version of the couchbase elasticsearch connector:
https://docs.couchbase.com/elasticsearch-connector/4.0/index.html
I do replicate my data correctly, but need to run a command per bucket using a different configuration file (.toml) each time.
Could not by the way run the cbes command multiple times on the same server as the metrics port 31415 is already in use.
Is there any way to handle many connector groups in one time?
In version 4.0 a single connector process can replicate from only one bucket. This is because the indexing rules and all of the underlying network connections to Couchbase Server are scoped to the bucket level.
The current recommendation is to create multiple config files and run multiple connector processes. It's understood that this can be complicated to manage if you're replicating a large number of buckets.
If you're willing to get creative, you could use the same config file template for multiple buckets. The idea is that you'd write a config file with some placeholders in it, and then generate the actual config file by running a script that replaces the placeholders with the correct values for each connector.
The next update to the connector will add built-in support for environment variable substitution in the config file. This could make the templating approach easier.
Here are some options for avoiding the metrics port conflict:
Disable metrics reporting by setting the httpPort key in the [metrics] section to -1.
OR Use a random port by setting it to 0.
OR Use the templating idea described above, and plug a unique port number into each generated config file.
It's worth mentioning that a future version of the connector will support something we're calling "Autonomous Operations Mode". When the connector runs in this mode, the configuration will be stored in a central location (probably a Consul server). It will be possible to reconfigure a connector group on-the-fly, and add or remove workers to the group without having to stop all the workers and edit their config files. Hopefully this will simplify the management of large deployments.
I'm using Hadoop (via Spark), and need to access S3N content which is requester-pays. Normally, this is done by enabling httpclient.requester-pays-buckets-enabled = true in jets3t.properties. Yet, I've set this and Spark / Hadoop are ignoring it. Perhaps I'm putting the jets3t.properties in the wrong place (/usr/share/spark/conf/). How can I get Hadoop / Spark / JetS3t to access requestor-pays buckets?
UPDATE: This is needed if you are outside Amazon EC2. Within EC2, Amazon doesn't require requester-pays. So, a crude workaround is to run out of EC2.
The Spark system is made up of several JVMs (application, master, workers, executors), so setting properties can be tricky. You could use System.getProperty() before the file operation to check if the JVM where the code runs has loaded the right config. You could even use System.setProperty() to directly set it at that point instead of figuring out the config files.
Environment variables and config files didn't work, but some manual code did: sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "PUTTHEKEYHERE")
I'm using spark-ec2 to run some Spark code. When I set master to
"local", then it runs fine. However, when I set master to $MASTER,
the workers immediately fail, with java.lang.NoClassDefFoundError for
the classes. The workers connect to the master, and show up in the UI, and try to run the task; but immediately raise that exception as soon as it loads its first dependency class (which is in the assembly jar).
I've used sbt-assembly to make a jar with the classes, confirmed using
jar tvf that the classes are there, and set SparkConf to distribute
the classes. The Spark Web UI indeed shows the assembly jar to be
added to the classpath:
http://172.x.x.x47441/jars/myjar-assembly-1.0.jar
It seems that, despite the fact that myjar-assembly contains the
class, and is being added to the cluster, it's not reaching the
workers. How do I fix this? (Do I need to manually copy the jar file?
If so, to which dir? I thought that the point of the SparkConf add
jars was to do this automatically)
My attempts at debugging have shown:
The assembly jar is being copied to /root/spark/work/app-xxxxxx/1/
(Determined by ssh to worker and searching for jar)
However, that path doesn't appear on the worker's classpath
(Determined from logs, which show java -cp but lack that file)
So, it seems like I need to tell Spark to add the path to the assembly
jar to the worker's classpath. How do I do that? Or is there another culprit? (I've spent hours trying to debug this but to no avail!)
NOTE: EC2 specific answer, not a general Spark answer. Just trying to round out an answer to a question asked a year ago, one that has the same symptom but often different causes and trips up a lot of people.
If I am understanding the question correctly, you are asking, "Do I need to manually copy the jar file? If so, to which dir?" You say, "and set SparkConf to distribute the classes" but you are not clear if this is done via spark-env.sh or spark-defaults.conf? So making some assumptions, the main one being your are running in cluster mode, meaning your driver runs on one of the workers and you don't know which one in advance... then...
The answer is yes, to the dir named in the classpath. In EC2 the only persistent data storage is /root/persistent-hdfs, but I don't know if that's a good idea.
In the Spark docs on EC2 I see this line:
To deploy code or data within your cluster, you can log in and use
the provided script ~/spark-ec2/copy-dir, which, given a directory
path, RSYNCs it to the same location on all the slaves.
SPARK_CLASSPATH
I wouldn't use SPARK_CLASSPATH because it's deprecated as of Spark 1.0 so a good idea is to use its replacement in $SPARK_HOME/conf/spark-defaults.conf:
spark.executor.extraClassPath /path/to/jar/on/worker
This should be the option that works. If you need to do this on the fly, not in a conf file, the recommendation is "./spark-submit with --driver-class-path to augment the driver classpath" (from Spark docs about spark.executor.extraClassPath and see end of answer for another source on that).
BUT ... you are not using spark-submit ... I don't know how that works in EC2, looking at the script I didn't figure out where EC2 let's you supply these parameters on a command line. You mention you already do this in setting up your SparkConf object so stick with that if that works for you.
I see in Spark-years this is a very old question so I wonder how you resolved it? I hope this helps someone, I learned a lot researching the specifics of EC2.
I must admit, as a limitation on this, it confuses me in the Spark docs that for spark.executor.extraClassPath it says:
Users typically should not need to set this option
I assume they mean most people will get the classpath out through a driver config option. I know most of the docs for spark-submit make it should like the script handles moving your code around the cluster but I think that's only in "standalone client mode" which I assume you are not using, I assume EC2 must be in "standalone cluster mode."
MORE / BACKGROUND ON SPARK_CLASSPATH deprecation:
More background that leads me to think SPARK_CLASSPATH is deprecated is this archived thread. and this one, crossing the other thread and this one about a WARN message when using SPARK_CLASSPATH:
14/07/09 13:37:36 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:
/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
You are require to register a jar with spark cluster while submitting your app, to make it possible you can edit your code as follows.
jars(0) = "/usr/local/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar"
val conf: SparkConf = new SparkConf()
.setAppName("Busigence App")
.setMaster(sparkMasterUrl)
.setSparkHome(sparkHome)
.setJars(jars);
I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)