Wait between wget downloads - shell

I'm trying to download a directory (incl subs) from a website.
I'm using:
wget -r -e robots=off --no-parent --reject "index.html*" http://example.com/directory1/
Problem is, the server refuses connection after a bit, I think there's too many connections within a short amount of time. So what I'd like to do is insert a wait time (5 seconds) between each download/lookup. Is that possible? If so, how?

You can use --wait. From wget(1):
-w seconds
--wait=seconds
Wait the specified number of seconds between the retrievals. Use
of this option is recommended, as it lightens the server load by
making the requests less frequent. Instead of in seconds, the time
can be specified in minutes using the "m" suffix, in hours using
"h" suffix, or in days using "d" suffix.
Specifying a large value for this option is useful if the network
or the destination host is down, so that Wget can wait long enough
to reasonably expect the network error to be fixed before the
retry. The waiting interval specified by this function is
influenced by "--random-wait", which see.
I didn't know this either, but I found this answer in 15 seconds by using the wget manpage:
Type man wget.
You can use / to search, so I used /wait.
It's the first hit!
Press q to quit.

Related

Time Limit for a command to run in cmd

I'm writing a batch file which is having several commands and giving output as given by each command. But, some commands are not working which is taking the code longer to execute. I want every command to be run for 10 seconds and if output does not come, then abort this command and run next command in batch file.
curl URL1
curl URL2
curl URL3
curl URL4
If URL2 is not working, it is taking a longer time to execute. I want every curl command to be checked for 10 seconds and abort and run next curl command.
Since you say you're writing a batch file I'm going to assume that you're using the Windows port of the cURL commandline utility, not the alias curl for the PowerShell cmdlet Invoke-WebRequest.
The cURL utility has 2 parameters that control timeouts:
--connect-timeout <seconds>
Maximum time in seconds that you allow the connection to the server to take. This only limits the connection phase, once curl has connected this option is of no more use. See also the -m/--max-time option.
[...]
-m/--max-time <seconds>
Maximum time in seconds that you allow the whole operation to take. This is useful for preventing your batch jobs from hanging for hours due to slow networks or links going down. See also the --connect-timeout option.
So you should be able to run your statements like this:
curl --connect-timeout 10 --max-time 10 URL

Bash script for do loop that detects a time period elapsed and then continues

I have a script that runs through a list of servers to connect to and grabs by SCP over files to store
Occasionally due to various reasons one of the servers crashes and my script gets stuck for around 4 hours before moving on through the list.
I would like to be able to detect a connection issue or a period of time elapsed after script has started and kill that command and move on to next.
I suspect that this would involve a wait or sleep and continue but I am new to loops and bash
#!/bin/bash
#
# Generate a list of backups to grab
df|grep backups|awk -F/ '{ print $NF }'>/tmp/backuplistsmb
# Get each backup in turn
for BACKUP in `cat /tmp/backuplistsmb`
do
cd /srv/backups/$BACKUP
scp -o StrictHostKeyChecking=no $BACKUP:* .
sleep 3h
done
The above script works fine but does get stuck for 4 hours should there be a connection issue. It is worth noting that some of the transfers take 10 mins and some 2.5 hours
Any ideas or help would very appreciated
Try to use the timeout program for that:
Usage:
timeout [OPTION] DURATION COMMAND [ARG]...
E.g. time (timeout 3 sleep 5) will run for 3 secs.
So in your code you can use:
timeout 300 scp -o StrictHostKeyChecking=no $BACKUP:* .
This limits the copy to 5 minutes.

How to resume a bash session after an ssh disconnection?

I connected via ssh to remoute server and started very long wget downloading. Then ssh session was broke, and after reconnect new copy of interpretator was created. Now I see wget process in ps, but is it possible to return control to old interpretator? I know that better solution is to use screen for long commands, but is there other way?
No, there is no way to reattach a process to a different terminal if it's not set up to do that (by way of screen / tmux / what have you) in the first place.
As a crude approximation, connecting a debugger to the running process may allow you to interact with it in some limited ways, but in this particular scenario, I don't think it will be beneficial.
If you want to know the progress of your currently running wget, check out the size of the downloaded file, it should be growing. If it doesn't, run killall wget and start over.
Next time, consider running wget --background to prevent the problem from happening. See the wget info page.
This command let Wget to work in the background, and write its progress to log file my.log.
Еhe number of retries 45 (-t options)
wget -t 45 -o my.log http://upload.wikimedia.org/wikipedia/commons/5/51/Google.png &

Is there a way to create a bash script that will only run for X hours?

Is there a way to create a bash script that will only run for X hours? I'm currently setting up a cron job to initiate a script every night. This script essentially runs until a certain condition is met, exporting it's status to a holding variable to keep track of 'where it is' after each iteration. The intention is to start-up the process every night, run for a few hours, and then stop, holding the status until the process starts up the next night.
Short of somehow collecting the start time, and checking it against the current time in each iteration of the loop, is there an easier way to do this? Bash scripting is not my forte (I know enough to get things done and be dangerous) and I have not done something like this before. Any help would be appreciated. Thanks.
Use GNU Coreutils
GNU coreutils contains an actual timeout binary, usually invoked like this:
# timeout after 5 seconds when sleeping for 30
/usr/bin/timeout 5s /bin/sleep 30
In your case, you'd want to specify hours instead of seconds, so to timeout in 2 hours use something like 2h instead of 5s. See timeout(1) or info coreutils 'timeout invocation' for additional options.
Hacks and Workarounds
Native timeouts or the GNU timeout command are really the best options. However, see the following for some ideas if you decide to roll your own:
How do I run a command, and have it abort (timeout) after N seconds?
The TMOUT variable using read and process or command substitution.
Do it as you described - it is the cleanest way.
But if for some strange reason want kill the process after a time, can use the next
./long_runner &
(sleep 5; kill $!; wait; exit 0) &
will kill the long_runner after 5 secs.
By using the SIGALRM facility you can rig a signal to be sent after a certain time, but traditionally, this was not easily accessible from shell scripts (people would write small custom C or Perl programs for this). These days, GNU coreutils ships with a timeout command which does this by wrapping your command:
timeout 4h yourprogram

Need Help with Unix bash "timer" for mp3 URL Download Script

I've been using the following Unix bash script:
#!/bin/bash
mkdir -p ~/Desktop/URLs
n=1
while read mp3; do
curl "$mp3" > ~/Desktop/URLs/$n.mp3
((n++))
done < ~/Desktop/URLs.txt
to download and rename a bunch mp3 files from URLs listed in "URLs.txt". It works well (thanks to StackOverflow users), but due to a suspected server quantity/time download limit, it's only allowing me to access a range of 40 - 50 files from my URL list.
Is there a way to work around this by adding a "timer" inside the while loop so it downloads 1 file per "X" seconds?
I found another related question, here:
How to include a timer in Bash Scripting?
but I'm not sure where to add the "sleep [number of seconds]"... or even if "sleep" is really what I need for my script...?
Any help enormously appreciated — as always.
Dave
curl has some pretty awesome command-line options (documentation), for example, --limit-rate will limit the amount of bandwidth that curl uses, which might completely solve your problem.
For example, replace the curl line with:
curl --limit-rate 200K "$mp3" > ~/Desktop/URLs/$n.mp3
would limit the transfers to an average of 200K per second, which would download a typical 5MB MP3 file in 25 seconds, and you could experiment with different values until you found the maximum speed that worked.
You could also try a combination of --retry and --retry-delay so that when and if a download fails, curl waits and then tries again after a certain amount of time.
For example, replace the curl line with:
curl --retry 30 "$mp3" > ~/Desktop/URLs/$n.mp3
This will transfer the file. If the transfer fails, it will wait a second and try again. If it fails again, it will wait two seconds. If it fails again, it will wait four seconds, and so on, doubling the waiting time until it succeeds. The "30" means it will retry up to 30 times, and it will never wait more than 10 minutes. You can learn this all at the documentation link I gave you.
#!/bin/bash
mkdir -p ~/Desktop/URLs
n=1
while read mp3; do
curl "$mp3" > ~/Desktop/URLs/$n.mp3 &
((n++))
if ! ((n % 4)); then
wait
sleep 5
fi
done < ~/Desktop/URLs.txt
This will spawn at most 4 instances of curl and then wait for them to complete before it spawns 4 more.
A timer?
Like your crontab?
man cron
You know what they let you download, just count the disk usage of your files that you did get.
There is the transfer you are allowed. You need that, and you will need the PID of your script.
ps aux | grep $progname | print awk '{print $1}'
or something like that. The secret sauce here is that you can suspend with
kill -SIGSTOP PID
and resume with
kill -SIGCONT PID
So the general method would be
Urls on an array or queue or whatever
bash lets you have
Process an url.
increment transfer counter
When transfer counter gets close
kill -SIGSTOP MYPID
You are suspended.
in your crontab foreground your script after a minute/hour/day whatever
Continue processing
Repeat until done.
just don't log out or you'll need to do the whole thing over again, although if you used perl it would be trivial.
Disclaimer, I am not sure if this is an exercise in bash or whatnot, I confess freely that I see the answer in perl, which is always my choice outside of a REPL. Code in Bash long enough , or heaven forbid, Zsh ( my shell ) and you will see why Perl was so popular. Ah memories...
Disclaimer 2: Untested, drive by , garbage methodology here only made possible because you've an idea what that transfer might be. Obviously, if you have ssh , use ssh -D PORT you#host and pull the mp3's out of the proxy half the time.
In my own defense, if you slow pull the urls with sleep you'll be connected for a while. Perhaps "they" might notice that. Suspend and resume and you only should be connected while grabbing tracks, and gone otherwise.
Not so much an answer as an optimization. If you can consistently get the first few URLs, but it times out on the later ones, perhaps you could trim your URL file as the mp3s were successfully received?
That is, as 1.mp3 is successfully downloaded, strip it from the list:
tail url.txt -n +2 > url2.txt; mv -f url2.txt url.txt
Then the next time the script runs, it'll begin from 2.mp3
If that works, you might just set up a cron job to periodically execute the script over and over, taking bites at a time.
It just occurred to me that you're programatically numbering the mp3s, and curl might clobber some of them on restart, since every time it runs it'll start counting at 1.mp3 again. Something to be aware of, if you go this route.

Resources