Apache Storm 2.1.0 memory related configurations - apache-storm

We are in the process of migrating to 2.1.0 from 1.1.x.
In our current setup we have following memory configurations in storm.yaml
nimbus.childopts: -Xmx2048m
supervisor.childopts: -Xmx2048m
worker.childopts: -Xmx16384m
I see many other memory related configs in https://github.com/apache/storm/blob/master/conf/defaults.yaml, and have following questions regarding them.
what is the difference between worker.childopts and topology.worker.childopts? If we are setting worker.childopts in storm.yaml, do we still have to override topology.worker.childopts?
If we are setting worker.childopts in storm.yaml, do we still have to override worker.heap.memory.mb? Is there a relationship between these two configs?
Should topology.component.resources.onheap.memory.mb < worker.childopts? How should we decide the value of topology.component.resources.onheap.memory.mb ?
Appreciate if someone could explain these points.

I have recently fiddled with some of these configs myself, so I am sharing my insights here:
worker.childopts vs topology.worker.childopts - the first parameter sets childopts for all workers. The second parameter can be used to override those for individual topologies, e.g. by using conf.put(Config.TOPOLOGY_WORKER_CHILDOPTS, "someJvmArgsHere");
The default value for worker.childopts is "-Xmx%HEAP-MEM%m -XX:+PrintGCDetails -Xloggc:artifacts/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump" according to the storm git. Pay attention to the first argument, it includes a replacement pattern %HEAP-MEM%. This pattern is replaced with whatever you configure for worker.heap.memory.mb. You are able to override the latter parameter from inside a topology configuration in Java, thus I guess they build it that way to be able to quickly modify Java heap for individual topologies. One thing I noticed is that, when overriding, storm only seems to make use of the override value if at least one spout or bolt is configured with .setMemoryLoad(int heapSize).
this highly depends on the individual topology's needs, but in general it is most likely a very good idea to have topology.component.resources.onheap.memory.mb be smaller than whatever you have configured for -Xmx in worker.childopts. How to find a good value for topology.component.resources.onheap.memory.mb is up to testing and knowledge about the memory consumption of your topology's components. For instance, I have a topology which receives tuples from redis and emits them. If bolts are busy, tuples may pile up in the spout, thus I configure it with some headroom in terms of memory. However, I normally do not modify topology.component.resources.onheap.memory.mb but rather use the setMemoryLoad(int heapSize) method of a topology's component, as this allows to set different values for individual components of the topology. Storm docs for this and related topics are here.

Related

How to update the configuration of an apache nifi processor without stopping it?

Good morning, I'm using Apache Nifi, I wonder if anyone knows any way to change the setting of a processor without having to stop it. Or some viable alternative to prevent the loss of information.
Thanks
The configuration of a processor cannot be changed while the processor is running and this is done intentionally. This provides guarantees to the developer of a processor so that in the onTrigger method they can be guaranteed all the properties have the same values that passed validation when the processor was started.
If you can describe your use-case more we might be able to come up with alternative approaches.
there is an alternative solution. Duplicate the processor will update its configuration to the desired one. the output of the duplicate is connected to the next processor. the original processor is stopped and its queued connected to the duplicate and then turned on.
In one way or another the data flow has to be interrupted, but in this way the changes that take more time to make in the processor, can be made in the duplicate first, in order to reduce the impact of the interruption as much as possible.
regards

How to set configurations to make Spark/Yarn job faster?

I am new to Spark. I have been reading about Spark config and different properties to set so that we can optimize the job. But I am not sure how do I figure out what should I set ?
For example, I created a cluster of type r3.8x Large (1Master and 10 slaves)
How do I set :
spark.executor.memory
spark.driver.memory
spark.sql.shuffle.partitions
spark.default.parallelism
spark.driver.cores
spark.executor.cores
spark.memory.fraction
spark.executor.instances
Or should I just leave the default ? but leaving default makes my job very slow. My job has 3 group bas and 3 broadcasted maps.
Thanks
For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!

Which config is applied when number of matched run modes is the same

I am using OSGI config files to define configuration for different environments, as specified in OSGI Configuration. I have configurations for multiple run modes saved in the same repository. The documentation states
"If multiple configurations for the same PID are applicable, the
configuration with the highest number of matching run modes is
applied."
What is the mechanism if multiple configurations for the same PID are applicable and two or more configurations are tied for the highest number of matching run modes? Which one gets applied?
The order or OSGi configs is handled by Apache Sling. Sling has a system that determines priority for Installable Resources which includes OSGi configurations.
Out of the box, the most powerful component of calculating the priority is the root folder - /apps vs /libs. See the JcrInstaller and its configuration in your localhost at http://localhost:4502/system/console/configMgr/org.apache.sling.installer.provider.jcr.impl.JcrInstaller. The difference between the /libs and /apps "points" is large at 100 ({"/libs:100", "/apps:200"}).
After the root priority is determined, the Sling run modes are added up. See org.apache.sling.installer.provider.jcr.impl.FolderNameFilter#getPriority. Each run mode is valued at 1 "point" regardless of order. For example, at this point if you have run modes alpha and bravo, config.alpha.bravo is equal to config.bravo.alpha.
Priority then looks at certain things such as the Resource State and whether the resource is installed or not and whether the resource is a SNAPSHOT version which probably will apply more to bundles than configurations in your project. Ultimately, the comparison of the OSGi configs will come down to a lexicographically string comparison of the URLs. Going back to our example, at this point, config.alpha.bravo has a higher priority than config.bravo.alpha.
Should the OSGi configs be lexicographically equal, the final comparison is an MD5 hash of the Digest. See org.apache.sling.installer.provider.jcr.impl.ConfigNodeconverter#computeDigest.
See the full comparison function at org.apache.sling.installer.core.impl.RegisteredResourceImpl#compare.

Using Cassandra Secondary Index with Hadoop

Cassandra 1.1 documentation says that now it is possible to use secondary indexes to get a slice of rows for Hadoop processing. Does this mean that now it is possible to achieve this while using RandomPartitioner unlike earlier version where usage of OrderedPartitioner was required for this? However going through ColumnFamilyInputformat code I still see an assertion where it enforces that OrderedPartitioner need to be in place for this to happen. Any ideas on this?

How to Throttle DataStage

I work on a project where we run a number of DataStage sequences can be run in parallel, one in particular is poorly performing and takes a lot of resources, impacting the shared environment. Performance tuning initiative is in progress but will take time.
In the meantime I was hopeful that we could throttle DataStage to restrict the resources that could be used by this particular job/sequence - however I'm not personally experienced with DataStage specifically.
Can anyone comment if this facility exists in DataStage (v8.5 I believe), and point me in the direction of some further detail.
Secondly, I know that we can at the throttle based on the user (I think this ties into AIX 'ulimit', but not sure). Is it easy/possbile to run different jobs/sequences as different users?
In this type of situations resources for a particular job can be restricted by specifying number of nodes and resources in a config file. Possible in 8.5 and you may find something at www.datastagetips.com
Revolution_In_Progress is right.
Datastage PX has the notion of a configuration file. That file can be specified for all the jobs you run or it can be overridden on a job by job basis. The configuration file can be used to limit the physical resources that are associated with a job.
In this case, if you have a 4-node config file for most of your jobs, you may want to write a 2-node config file for the job with performance issue. That way, you'll get the minimum amount of parallelism (without going completely sequential) and use the minimum amount of resources.
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com.ibm.swg.im.iis.ds.parjob.tut.doc/module5/lesson5.1exploringtheconfigurationfile.html
Sequence is a collection of individual jobs.
In most cases, jobs in a sequence can be rearranged to run serially. Please check the organisation of the sequence and do a critical path analyis to remove the jobs that need not run in parallel to critical jobs.

Resources