deploy bolts/spout to a specific supervisor - apache-storm

We are running a storm application using a single type on instance in AWS and a single topology to run our system.
This is causing some resource limitation issues.
The way we want to address this is by splitting our IO intense bolts into a cluster of a few dozens t1.small machines (for example) and all our CPU intense bolts to two large machines with lots of cpu & memory.
Basically what i am asking is, is there a way to start all this supervisors and then deploy one topology that include cpu intense bolts on the big machines and to the small machines the deploy IO bolts?

You can implement a custom scheduler using interface IScheduler.
See
http://www.exogeni.net/2015/04/enabling-site-aware-scheduling-for-apache-storm-in-exogeni/
https://dcvan24.wordpress.com/2015/04/07/metadata-aware-custom-scheduler-in-storm/
https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/DemoScheduler.java

Related

NiFi Flow Files Stuck at Outbound Port - for remote connections

I maintain several high capacity NiFi clusters that requires regular maintenance. It's used alot.
Recently I have an odd problem I cannot fix. Need some help.
For this example I am using Cluster A sending files to Cluster B in separate network domains.
Cluster B pulls files from Cluster A outbound port C using a remote processor group.
Normally Flow files will arrive at C and hang there for only a few seconds at the most.
Recently however those same flow files are hanging at C for several hours.
What is causing the files to hang for such a long time? I recently upgraded the cluster VMs to something more powerful with more ram and cpu cores. Do I need to change something in nifi.properties? Any help is appreciated.

Deploy 2 different topologies on a single Nimbus with 2 different hardware

I have 2 sets of storm topologies in use today, one is up 24/7, and does it's own work.
The other, is deployed on demand, and handles a much bigger loads of data.
As of today, we have N supervisors instances, all from the same type of hardware (CPU/RAM), I'd like my on demand topology to run on stronger hardware, but as far as I know, there's no way to control which supervisor is assigned to which topology.
So if I can't control it, it's possible that the 24/7 topology would assign one of the stronger workers to itself.
Any ideas, if there is such a way?
Thanks in advance
Yes, you can control which topologies go where. This is the job of the scheduler.
You very likely want either the isolation scheduler or the resource aware scheduler. See https://storm.apache.org/releases/2.0.0-SNAPSHOT/Storm-Scheduler.html and https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html.
The isolation scheduler lets you prevent Storm from running any other topologies on the machines you use to run the on demand topology. The resource aware scheduler would let you set the resource requirements for the on demand topology, and preferentially assign the strong machines to the on demand topology. See the priority section at https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html#Topology-Priorities-and-Per-User-Resource.

Spark program to monitor the executors performance

I am working on a spark program that monitor each executors' performance such as mark down when one executor start to work and when it finishes its job. I am thinking two ways to do that:
First, develop programs so when the executor starts work, it mark down the current time to a file, when it finishes, mark down that time to the same file. In the ends, all "log" files will be spread the whole cluster networks except for the driver machine.
Second, since executors will report to driver periodically, each time the driver receives message from executors, if the message contains "start" and "finish" information, let the driver record everything.
Is that possible?
There are many ways to Monitor the executor performance as well as application performance
Best ways are to Monitor with the help of Spark Web UI and Other Monitoring tools available Open Source (Ganglia)
You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created.
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Hope this Helps!!!....

How to select CPU parameter for Marathon apps ran on Mesos?

I've been playing with Mesos cluster for a little bit, and thinking of utilizing Mesos cluster in our production environment. One problem I can't seem to find an answer to: how to properly schedule long running apps that will have varying load?
Marathon has "CPUs" property, where you can set weight for CPU allocation to particular app. (I'm planning on running Docker containers) But from what I've read, it is only a weight, not a reservation, allocation, or limitation that I am setting for the app. It can still use 100% of CPU on the server, if it's the only thing that's running. The problem is that for long running apps, resource demands change over time. Web server, for example, is directly proportional to the traffic. Coupled to Mesos treating this setting as a "reservation," I am choosing between 2 evils: set it too low, and it may start too many processes on the same host and all of them will suffer, with host CPU going past 100%. Set it too high, and CPU will go idle, as reservation is made (or so Mesos think), but there is nothing that's using those resources.
How do you approach this problem? Am I missing something in how Mesos and Marathon handle resources?
I was thinking of an ideal way of doing this:
Specify weight for CPU for different apps (on the order of, say, 0.1 through 1), so that when going gets tough, higher priority gets more (as is right now)
Have Mesos slave report "Available LA" with its status (e.g. if 10 minute LA is 2, with 8 CPUs available, report 6 "Available LA")
Configure Marathon to require "Available LA" resource on the slave to schedule a task (e.g. don't start on particular host if Available LA is < 2)
When available LA goes to 0 (due to influx of traffic at the same time as some job was started on the same server before the influx) - have Marathon move jobs to another slave, one that has more "Available LA"
Is there a way to achieve any of this?
So far, I gather that I can possible write a custom isolator module that will run on slaves, and report this custom metric to the master. Then I can use it in resource negotiation. Is this true?
I wasn't able to find anything on Marathon rescheduling tasks on different nodes if one becomes overloaded. Any suggestions?
As of Mesos 0.23.0 oversubscription is supported. Unfortunately it is not yet implemented in Marathon: https://github.com/mesosphere/marathon/issues/2424
In order to dynamically do allocation, you can use the Mesos slave metrics along with the Marathon HTTP API to scale, for example, as I've done here, in a different context. My colleague Niklas did related work with nibbler, which might also be of help.

How do you setup multiple Spark Streaming jobs with different batch durations?

We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like:
SparkStreamingEtlManager.addEtl(Source, Transformation*, Destination)
SparkStreamingEtlManager.streamEtl()
streamingContext.start()
The assumptions is that, since we should only have one SparkContext, we would deploy all of the ETL pipelines in one application/jar.
The problem with this is that the batchDuration is an attribute of the context itself and not of the ReceiverInputDStream (Why is this?). Do we need to therefore have multiple Spark Clusters, or, allow for multiple SparkContexts and deploy multiple applications? Is there any other way to control the batch duration per receiver?
Please let me know if any of my assumptions are naive or need to be rephrased. Thanks!
In my experience, different streams have different tuning requirements. Throughput, latency, capacity of the receiving side, SLAs to be respected, etc.
To cater for that multiplicity, we require to configure each Spark Streaming job to address said specificity. So, not only batch interval but also resources like memory and cpu, data partitioning, # of executing nodes (when the loads are network bound).
It follows that each Spark Streaming job becomes a separate job deployment on a Spark Cluster. That will also allow for monitoring and management of separate pipelines independently of each other and help in the further fine-tuning of the processes.
In our case, we use Mesos + Marathon to manage our set of Spark Streaming jobs running 3600x24x7.

Resources