About the effect of parallelism in StormCrawler - apache-storm

I am currently working on a Storm Crawler based project. We have a fixed and limited amount of bandwidth for fetching page from the web. We have 8 worker with a large value for parallelism hint for different Bolt in the topology (i.e. 50). So lots of thread created for fetching the page. Is there any relation between increasing number of fetch_error and increasing parallelism_hint in the project? How can I determine the good value for the parallelism_hint in the Storm Crawler?

The parallelism hint is not something that should be applied to all bolts indiscriminately.
Ideally, you need one instance of FetcherBolt per worker, so in your case 8. As you've probably read in the WIKI or seen in the conf, the FetcherBolt handles internal threads for fetching. This is determined by the config fetcher.threads.number which is set to 50 in the archetypes' configurations (assuming this is what you used as a starting point).
Using too many FetcherBolt instances is counterproductive. It is better to change the value of fetcher.threads.number instead. If you have 50 Fetcher instances with a default number of threads of 50, that would give you 2500 fetching threads which might be too much for your available bandwidth.
As I mentioned before you want 1 FetcherBolt per worker, the number of internal fetching threads per bolt depends on your bandwidth. There is no hard rule for this, it depends on your situation.
One constant I have observed however is the ratio of parsing bolts to Fetcher bolts; usually, 4 parsers per fetcher works fine. Run Storm in deployed mode and check the capacity value for the parser bolts in the UI. If the value is 1 or above, try using more instances and see if it affects the capacity.
In any case, not all bolts need the same level of parallelism.

Related

Flink: how data is split in parallel tasks

I have a job with parallelism 2; it gets data from a kafka topic and, after keying, it handles timers in a stateful function.
I observed that sometimes one parallelized instance gets stuck: as a result timers do not trigger until a new message arrives, moving forward the current watermark for that parallel instance.
How does Flink split data between parallel instances?
Is there a metric to explore to get a quick view of how messages are split? (in percent or a count)
A part from reducing parallelism to 1, is there any other tip to solve this issue?
Thanks
With the Kafka source, it depends on the number of partitions. So setting the parallelism higher than the number of partitions will stop the watermark moving forward. In your case, as you mentioned it only gets stuck sometimes, probably one of the partitions didn't receive data for a bit which again stops the watermark.
To solve this issue, you can use withIdleness with your watermark strategy, more details can be found in the docs.

Is there a particular reason that the pendingEmits queue is limited to 1024 elements

I am working with Storm v1.2.1 on a cluster of 5 r4.xlarge EC2 nodes. Currently, I am crunching a network dataset that involves queries time-based sliding windows. After numerous trial-and-error cycles for figuring out a good-enough configuration for my use-case, I came across the Executor class, which maintains a member named pendingEmits of type MpscChunkedArrayQueue<AddressedTuple> (line 119 in storm-client module, class: org.apache.storm.executor.Executor). This queue has a hard-coded upper-bound of 1024 elements.
Every time I tried a configuration with my dataset, I would receive an IllegalStateException when Storm would attempt to add an acknowledgement tuple to pendingEmits with full capacity. In order to avoid getting the exception, I increased the hard-coded size of pendingEmits to 16534. This seems to be working (for now).
Why is pendingEmits's maximum size set to 1024? Is it because of performance, or was it a random decision?
I am skeptical about this decision, because if a window consists of more than 1024 tuples (in my case each window is about 2700 tuples), the queue will become full, and the IllegalStateException will be thrown.
By increasing pendingEmits maximum size, do I jeopardize other aspects (components) of Storm?
Thank you!
I'm not sure why 1024 exactly was picked (likely for performance as you mention), but if you pull the latest version of Storm, it should be fixed https://github.com/apache/storm/pull/2676.

Apache NiFi tuning issues

I've developed a NiFi flow prototype for data ingestion in HDFS. Now I would like to improve the overall performances but it seems I cannot really move forward.
​
​The flow takes in input csv files (each row has 80 fields), split them at row level, applies some transformations to the fields (using 4 custom processors executed sequentially), buffers the new rows into csv files, outputs them into HDFS. I've developed the processors in such a way the content of the flow file is accessed only once when each individual record is read and its fields are moved to flowfile attributes. Tests have been performed on a amazon EC2 m4.4xlarge instance (16 cores CPU, 64 GB RAM).
​​This is what I tried so far:
​​Moved the flowfile repository and the content repository on different SSD drives
Moved the provenance repository in memory (NiFi could not keep up with the events rate)
Configuring the system according to the ​configuration best practices
I've tried assigning multiple threads to each of the processors in order to reach different numbers of total threads
I've tried increasing the nifi.queue.swap.threshold and setting backpressure to never reach the swap limit
Tried different JVM memory settings from 8 up to 32 GB (in combination with the G1GC)
I've tried increasing the instance specifications, nothing changes
From the monitoring I've performed it looks like disks are not the bottleneck (they are basically idle a great part of the time, showing the computation is actually being performed in-memory) and the average CPU load is below 60%.
​The most I can get is 215k rows/minute, which is 3,5k rows/second. In terms of volume, it's just 4,7 MB/s. I am aiming to something definitely greater than this.
​
​Just as a comparison, I created a flow that reads a file, splits it in rows, merges them together in blocks and outputs on disk. Here I get 12k rows/second, or 17 MB/s. Doesn't look surprisingly fast too and let me think that probably I am doing something wrong.
​
​Does anyone has suggestions about how to improve the performances? How much will I benefit from running NiFi on cluster instead of growing with the instance specs? Thank you all
It turned out the poor performances were a combination of both the custom processors developed, and the merge content built-in processor. The same question mirrored on the hortonworks community forum got interesting feedback.
Regarding the first issue, a suggestion is to add the SupportsBatching annotation to the processors. This allows the processors to batch together several commits, and allows the NiFi user to favor latency or throughput with the processor execution from the configuration menu. Additional info can be found on the documentation here.
The other finding was that the MergeContent built-in processor doesn't seem to have optimal performances itself, therefore if possible one should consider modifying the flow and avoid the merging phase.

Impact of more executors than cpu/cores in Storm Cluster

I have started using Apache Storm recently. Right now focusing on some performance testing and tuning for one of my applications (pulls data from a NoSQL database, formats, and publishes to a JMS Queue for consumption by the requester) to enable more parallel request processing at a time. I have been able to tune the topology in terms of changing no. of bolts, MAX_SPENDING_SPOUT etc. and to throttle data flow within topology using some ticking approach.
I wanted to know what happens when we define more parallelism than the no of cores we have. In my case I have a single node, single worker topology and the machine has 32 cores. But total no of executors (for all the spouts and bolts) = 60. So my questions are:
Does this high number really helps processing requests or is it actually degrades the performance, since I believe there will more context switch between bolt tasks to utilize cores.
If I define 20 (just a random selection) executors for a Bolt and my code flow never needs to utilize the Bolt, will this be impacting performance? How does storm handles this situation?
This is a very general question, so the answer is (as always): it depends.
If your load is large and a single executor fully utilizes a core completely, having more executors cannot give you any throughput improvements. If there is any impact, it might be negative (also with regard to contention of internally used queues to which all executers need to read from and write into for tuple transfer).
If you load is "small" and does not fully utilize your CPUs, it wound matter either -- you would not gain or loose anything -- as your cores are not fully utilized you have some left over head room anyway.
Furthermore, consider that Storm spans some more threads within each worker. Thus, if your executors fully utilize your hardware, those thread will also be impacted.
Overall, you should not run your topologies to utilize core completely anyway but leave form head room for small "spikes" etc. In operation, maybe 80% CPU utilization might be a good value. As a rule of thumb, one executor per core should be ok.

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Resources