Running multiple topologies on stormcrawler with status index - apache-storm

My use case:
I have a number of domains to crawl, with each having its own filter configuration. Each domain is now running as a topology.
I see that a few domains have crawled about 10M urls and have another 50M on the status queue.
The other topologies are sitting idle with just fetching the seed url.
Each topology is has been given 2GB of RAM,10 threads per queue, max bucket is 50 and url per bucket is 100.
What could be the reason for the topologies to sit idle ..? I suspect the high number of urls in status queue in "discovered" state.

What do you use as a backend? If it is ES, then you should be able to use Kibana to inspect the status index and see what happens to the seeds for those idle crawls. It could be that they are blocked by robots.txt and can't progress any further.
I'd use a single status index and a single topology for all the domains - which would be easier to manage and monitor. The URL filtering can be easily set for each seed e.g. by defining a separate URL filter file per domain within the filter's configuration or even all within the same URL filter file.

Related

Is there anyway to check current bulk queue size Opensearch?

My Opensearch sometimes reaches the error "429 Too Many Requests" when writing data. I know there is a queue, when the queue is full it will show that error. So is there any Api to check that bulk queue status, current size...? Example: queue 150/200 (nearly full)
Yes, you can use the following API call
GET _cat/thread_pool?v
You will get something like this, where you can see the node name, the thread pool name (look for write), the number of active requests currently being carried out, the number of requests waiting in the queue and finally the number of rejected requests.
node_name name active queue rejected
node01 search 0 0 0
node01 write 8 2 0
The write queue can handle as many requests as 1 + number of CPUs, i.e. as many can be active at the same time. If active is full and new requests come in, they go directly in the queue (default size 10000). If active and queue are full, requests start to be rejected.
Your mileage may vary, but when optimizing this, you're looking at:
keeping rejected at 0
minimizing the number of requests in the queue
making sure that active requests get carried out as fast as possible.
Instead of increasing the queue, it's usually preferable to increase the number of CPU. If you have heavy ingest pipelines kicking in, it's often a good idea to add ingest nodes whose goal will be to execute that pipeline instead of on the data node.

NiFi - data stuck in queues when load balancing is used

In Apache NiFi, dockerized version 1.15, a cluster of 3 NiFi nodes is created. When load balancing is used via default port 6342, flow files get stuck in some of the queues, in the queue in which load balancing is enabled. But, when "List queue" is tried, the message "The queue has no FlowFiles." is issued:
The part of the NiFi processor group where the issue happens:
Configuration of NiFi queue in which flow files seem to be stuck:
Another problem, maybe not related, is that after this happens, some of the flow files reach the subsequent NiFi processors, but get stuck before the MergeContent processors. This time, the queues can be listed:
The part of code when the second issue occurs:
The part of code when the second issue occurs
The configuration of the queue:
The listing of the FlowFiles in the queue:
The MergeContent processor configuration. The parameter "max_num_for_merge_smxs" is set to 100:
Load balancing is used because data are gathered from the SFTP server, and that processor runs only on the Primary node.
If you need more information, please let me know.
Thank you in advance!
Edited:
I put the load-balancing queues between the ConsumeMQTT (working on the Primary node only) and UpdataAttribute processors, but Flow files are seemingly staying in the load-balancing queue, but when the listing is done, the message is "The queue has no FlowFiles.". Please check:
Changed position of the load-balancing queue:
The message that there are no flow files in the queues:
Take notice that the processors before and after the queue are stopped while doing "List queue".
Edit 2:
I changed the configuration in the nifi.properties to the following:
nifi.cluster.load.balance.connections.per.node=20
nifi.cluster.load.balance.max.thread.count=60
nifi.cluster.load.balance.comms.timeout=30 sec
I also restarted the NiFi containers, so I will monitor the behaviour. For now, there are no stuck Flow files in the load-balancing queues, they go to the processor that follows the queue.
"The queue has no FlowFiles" is normal behaviour of a queue that is feeding into a Merge - the flowfiles are pending to be merged.
The most likely cause of them being "stuck" before a Merge is that you have Round Robin distributed the FlowFiles across many nodes, and then you are setting a Minimum count on the Merge. This minimum is per node and there are not enough FlowFiles on each node to hit the Minimum, so they are stuck waiting for more FlowFiles to trigger the Merge.
-- Edit
"The queue has no FlowFiles" is also expected on a queue that is active - in your flow, the load balancing queue is drained immediately into the output queue of your merge PGs Input port - so there are no FFs sitting around in the load balancing queue. If you were to STOP the Input ports inside the merge PG, you should be able to list them on the LB queue.
It sounds like you are doing GetSFTP (Primary) and then distributing the files. The better approach would be to use ListSFTP (Primary) -> Load Balance -> FetchSFTP - this would avoid shuffling large files, and would instead load balance the file names between all nodes, with each node then fetching a subset of the files.
Secondly, I would review your Merge config - you have a parameter #{max_num_for_merge_xmsx} defined, but this set in the Minimum Number of Entries for the Merge - so you are telling Merge to only ever merge when at least #{max_num_for_merge_xmsx} amount of FlowFiles is reached.

What happens if I give a large value to server.tomcat.max-threads to handle load on my application?

There are around 1000+ jobs running through our service in a day and around 70-80 jobs starting at the same time and running parallelly.
To handle this, we looked that increasing the number of max threads to a large number to server.tomcat.max-threads property of our Spring application should work but I do not have full confidence as to what all can be the side effects of having a huge number like 800 to this property.
Can you please help here.
The default installation of Tomcat sets the maximum number of HTTP servicing threads at 200. Effectively, this means that the system can handle a maximum of 200 simultaneous HTTP requests. When the number of simultaneous HTTP requests exceeds this count, the unhandled requests are placed in a queue, and the requests in this queue are serviced as processing threads become available. This default queue length is 100. At these default settings, a large web load that can generate over 300 simultaneous requests will surpass the thread availability, resulting in service unavailable (HTTP 503).
More reference: https://docs.bmc.com/docs/brid91/en/tomcat-container-workload-configuration-825210082.html
How to run multiple servlets execution in parallel for Tomcat?
If this is a batch job like configuration, you can use spring batch.

Prevent ElasticSearch to Crash at Large Number of Concurrent Request

I am using ElasticSearch6.2.1. I am using single node cluster. It is working fine with my small size indices and medium traffic. But when I test for large number of concurrent request to handle using Apache JMeter, ES is going down with error message like below.
My requirement is to prevent ES to not crash even in such high traffic situation. It should discard requests after a certain time but not to stop working. Is there any option by which I can achieve it? Please advise.
if the requests are going up for just few seconds, you can increase the queue size of requested thread_pool (for example search thread-pool). otherwise you should add some node to cluster.
(please add some log of elastic crashing. do you have any out of memory exception?)
Are you sure elasticsearch is crashing? here, it's saying the search thread pool is full.
More read at https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html.

Validate newly created server support the same load

We are creating a new hosted server for one of our APIs on managed containers (Kubernetes) and we're trying to validate that it can handle at least the same amount of traffic load requests.
We've started with one of the APIs, where we would need to handle at least 140k requests per minute, all endpoints combined.
To verify this, I created a simple JMeter test as follows:
-Test Plan
---Thread Group Endpoint1
-----HTTP Request -> a GET request with query params for /path1
---Thread Group Endpoint2
-----HTTP Request -> a GET request with query params for /path2
For a local test, I used the following setup:
Thread Groups Endpoint1 and Endpoint2 are set to 200 threads (users), ramp-up period of 1s, loop count = forever and duration 60s.
Using a Summary Report listener when running the test gets me a total of ~9300 # Samples.
Using this approach, is it safe to just increase the number of threads (users) for the Thread Groups until I reach the desired 140k requests per minute?
Note: I only used JMeter a little before, so I'm aware that the entire approach may be wrong, therefore any suggestions and steering to the right path are more than welcomed.
Your approach is viable as long as it represents real-life application usage. If it has 2 endpoints with equally/evenly distributed load - your setup is just fine. If there are more endpoints and some of them are used more than the others - consider defining the workload correspondingly either using different Thread Groups or other distribution mechanism such as Throughput Controller
Increasing the number of threads is also fine, however consider increasing the load gradually, to wit increase ramp-up time so your test could have:
Arrivals phase
Time to hold the load
Ramp-down phase
This way you will be able to correlate various metrics like increasing response time, throughput, number of errors, etc. with the increasing load. Also you will be able to state what was the number of threads/requests per second when the system reached saturation point/breaking point and does it recover when the load gets back.
Also make sure you're following JMeter Best Practices as 2300/2500 requests per second is not something JMeter can support out of the box and you will need to do some tuning, at least increase JVM Heap size allocated to JMeter.
You may not be able to achieve the desired 140k requests per minute using a single Jmeter Machine, in that case you'll need Distributed Load Testing approach here.
refer: http://jmeter.apache.org/usermanual/jmeter_distributed_testing_step_by_step.html
Also keeping the ramp-up period of 1 second will lead to spike and unrealistic load in the system which will not give proper result unless you've pre-warmed your server, you should gradually increase the load as per real/estimated traffic pattern.

Resources