Long running scrapy crawl aborts - debugging

I'm running several spiders over different websites. Most runs take 2 - 3 days and many work fine. But sometimes it happens, that the crawl just stops or crashes?
With:
scrapy crawl myspider > logs/myspider.log 2>&1 &
I'm writing the output into a file and for one crawl for instance the last entry is:
[scrapy.extensions.logstats] INFO: Crawled 1975 pages (at 1 pages/min), scraped 1907 items (at 1 items/min)
and it simply stops there. Not dumped any stats and it didn't get to the end of everything.
Now I assume that could be a network issue or similar?
The machine has an average load of 0.10, I'm scrapying with a 40 sec delay and running 5 - 10 spiders. The hardware is old but RAM and CPU are usually bored in htop. I didn't change the LOG_LEVEL so it should by default be DEBUG.
How can I find out what happens?

Related

While using Instaloader via command line, how can I force 429 errors to cause requests to be retried after a longer period of time?

I am using Instaloader via command line on Windows 11, with the following command:
.\instaloader --login=MYUSERNAME :saved --dirname-pattern="Saved_Posts\{profile}" --filename-pattern="{profile}-{shortcode}" --no-resume --no-metadata-json --slide 1 --no-captions --no-video-thumbnails --no-iphone
This attempts to download approximately 12,000 saved posts from a profile. Instaloader behaves as expected for several thousand posts, occasionally giving the following error:
Too many queries in the last time. Need to wait 15 seconds, until 13:19.
The process then resumes successfully for several hundred more posts. Eventually, however, I start encountering 429 errors:
JSON Query to graphql/query: 429 Too Many Requests [retrying; skip with ^C]
Number of requests within last 10/11/20/22/30/60 minutes grouped by type:
d6f4427fbe92d846298cf93df0b937d3: 0 0 0 0 0 0
f883d95537fbcd400f466f63d42bd8a1: 0 0 0 1 1 11
* 2b0673e0dc4580674a88d426fe00ea90: 59 64 121 134 191 709
Instagram responded with HTTP error "429 - Too Many Requests". Please
do not run multiple instances of Instaloader in parallel or within
short sequence. Also, do not use any Instagram App while Instaloader
is running.
The request will be retried in 7 seconds, at 14:01.
This error then repeats over and over again, I believe until the default maximum connection attempts limit is reached and it moves onto the next post — which also receives the same error. Importantly, this error does not go away after several hours of these 'slower' requests being made; it seems to persist as long as Instaloader stays open. I have seen these 429 errors with very few requests in the last 60 minutes (i.e. <100), which makes me think I am hitting quite a long shadowban.
I have tried setting the maximum connection attempts to 0 (i.e. retry indefinitely), but this time limit appears to be capped at 666 seconds, or 11 minutes. The error does not seem to clear even leaving Instaloader to send requests every 11 minutes in this way; it is as though each individual request 'resets' the ban or something.
I am looking for a way of resolving this issue, which could include:
Adding a command to force 429 errors to be retried after subsequently longer periods of time (instead of the number of seconds being capped at 666 seconds)
Adding a command that 'preserves' wait times after each 429 error. e.g. if downloading Post 456 fails and retries after 5, then 10, then 15 seconds before successfully downloading, and then downloading Post 457 immediately fails... start the wait for a retry on Post 457 at at LEAST 15 seconds, rather than going back to 5!
Avoiding the 429 errors in the first place, if there appears to be an issue with my command line prompt
Breaking down the request into 'batches' and running one of those prompts every few days. e.g. is there a way to download Saved Posts 1-500, then 500-1000, and so on? (The Saved Posts are not necessarily in chronological order of the post date, which is what I've tried so far)
I have looked at several other posts on 429 errors but the general theme seems to be either:
Wait some time for the issue to clear — have tried this for up to 48 hours, but running the command again starts from post #1 and never makes it to the latter half of posts
Disable iPhone API requests — already done, which helps but does not solve the issue
The 429 errors simply should not be encountered during normal behaviour – well, they are!

load testing results now serves lesser users

I was testing website with 30 users/second and it was working fine. But now it is not even serving 25 users/second. The website is a search engine kind of site. In between these two tests of 30 users/second and 25 users/seconds, we have started the crawler to get some sites crawled and then again stopped it before the load testing. 30 users/second was working fine before crawler turned on. Elastic search is used as DB for the website and it went down saying no nodes available.
I've used standard thread group. Below is the configuration.
Total Samples: 250
Ramp up (sec) : 10
Loop count : 1
When I checked the results in table, it shows all green signal but when we hit the site, it gives nonodesavailable exception

Grafana dashboard panel is taking 5 to 8 seconds to load

I am using Grafana version 5.1.3 (commit: 087143285) ,InfluxDB shell version: 1.5.2 along with jmeter.
There are 13 panels. Panel is taking 5 to 8 seconds to load.
Below query is running for panel:(When I run the same query on db server it is running very fast )
SELECT mean(“startedThreads”) FROM “virtualUsers” WHERE time >= 1537865329564ms and time <= 1537867129564ms GROUP BY time(60s) fill(null);
EXPLAIN ANALYZE
execution_time: 157.341µs
planning_time: 626.44µs
total_time: 783.781µs
SELECT count(“responseTime”)/60 FROM “requestsRaw” WHERE time >= 1537865329564ms and time <= 1537867129564ms GROUP BY time(60s) fill(null)"
execution_time: 535.011µs
planning_time: 1.805892ms
total_time: 2.340903ms
Below is memory and cpu details.Influx db and Grafans are hosted on same server.
free -g
total used free shared buff/cache available
Mem: 15 3 11 0 1 12
Swap: 7 0 6
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
And as per my initial understanding Grafana minimum memory requirement is 249MB.So memory is not problem for Grafana.
Please let me if you need more details.
It is odd that the query runs fast while Grafana needs a long time. Panels should be diplayed as soon as Grafana gets a response.
Since rendering is done in the Browser AFAIK this could be the bottleneck. So if your Browser runs on a Raspberry Pi 1, please try using a different computer.
It is not clear if all Panels need a long time to load or if it is just one Panel that needs a long time. You should try to find out if the loading time is related to just one Panel.
Lastly consider that all queries are sent at the same time, so making just one query to the server from CLI may not be representative. You could try to spliting the Dashboard in multiple Dashboards to improve the loading time.

java.net.SocketException:connection reset in JMeter [duplicate]

I have the next test plan in JMeter:
on the screenshot you can see the settings for the 1st ThreadGroup, wich has 50% of common amout of request in test plan (in each Thread Group are 10 different subrequests placed).
So, +1 request per second is added in average using these settings.
Then I ran this test and saw this picture (Error % column):
I save errors in file and all these errors have the same text:
<sample t="30129" lt="0" ts="1356710138314" s="false" lb="WebService(SOAP) Request 1" rc="000" rm="**Connection reset**" tn="jp#gc - Stepping Thread Group1 3-247" dt="text" by="0"/>
Server's cpu screenshot:
and for database:
After the errors have appeared my comp started work slowly and slowly (although the errors stopped to appear further)...
And in the same time the server's cpu progressively dropped to 0.
Could you tell me, please,
What is the reason of this error?
Have I reached the server timeout? (Because Max is more than 30s in the table).
UPD. I have rerun test with next settings: 1000 users per 02:46:40 (+1 Thread Group per 10 second and 10 requests inside each new Thread in the Loop).
I.e. I have reduced the time of test and total Thread Groups by 2 times, but save intensivity of Thead's adding.
The results are the same (including cpu usage on the server).
I've received the error «Connection reset» after 990 thread started. There are screenshots:
Any idea?
First, WebService(SOAP) Request is not the best way to test Webservices in JMeter, it will be deprecated in upcoming 2.9 version.
HTTP Sampler is the one to choose as it performs much better.
Second, Connection Reset means your server has cut connection. It could be coming from the CPU which seems high but it's not sure.
If what you call "my comp" is the computer hosting JMeter started working slowly then your JMeter instance is overwhelmed by the number of threads (2003 or more?) you've configured. It can come from a lot of factors, read this:
http://www.dzone.com/links/see_how_to_make_jmeter_run_thousands_of_threads_w.html

Error "Connection reset" in JMeter (SOAP XML web-service)

I have the next test plan in JMeter:
on the screenshot you can see the settings for the 1st ThreadGroup, wich has 50% of common amout of request in test plan (in each Thread Group are 10 different subrequests placed).
So, +1 request per second is added in average using these settings.
Then I ran this test and saw this picture (Error % column):
I save errors in file and all these errors have the same text:
<sample t="30129" lt="0" ts="1356710138314" s="false" lb="WebService(SOAP) Request 1" rc="000" rm="**Connection reset**" tn="jp#gc - Stepping Thread Group1 3-247" dt="text" by="0"/>
Server's cpu screenshot:
and for database:
After the errors have appeared my comp started work slowly and slowly (although the errors stopped to appear further)...
And in the same time the server's cpu progressively dropped to 0.
Could you tell me, please,
What is the reason of this error?
Have I reached the server timeout? (Because Max is more than 30s in the table).
UPD. I have rerun test with next settings: 1000 users per 02:46:40 (+1 Thread Group per 10 second and 10 requests inside each new Thread in the Loop).
I.e. I have reduced the time of test and total Thread Groups by 2 times, but save intensivity of Thead's adding.
The results are the same (including cpu usage on the server).
I've received the error «Connection reset» after 990 thread started. There are screenshots:
Any idea?
First, WebService(SOAP) Request is not the best way to test Webservices in JMeter, it will be deprecated in upcoming 2.9 version.
HTTP Sampler is the one to choose as it performs much better.
Second, Connection Reset means your server has cut connection. It could be coming from the CPU which seems high but it's not sure.
If what you call "my comp" is the computer hosting JMeter started working slowly then your JMeter instance is overwhelmed by the number of threads (2003 or more?) you've configured. It can come from a lot of factors, read this:
http://www.dzone.com/links/see_how_to_make_jmeter_run_thousands_of_threads_w.html

Resources