Need Help with Unix bash "timer" for mp3 URL Download Script - bash

I've been using the following Unix bash script:
#!/bin/bash
mkdir -p ~/Desktop/URLs
n=1
while read mp3; do
curl "$mp3" > ~/Desktop/URLs/$n.mp3
((n++))
done < ~/Desktop/URLs.txt
to download and rename a bunch mp3 files from URLs listed in "URLs.txt". It works well (thanks to StackOverflow users), but due to a suspected server quantity/time download limit, it's only allowing me to access a range of 40 - 50 files from my URL list.
Is there a way to work around this by adding a "timer" inside the while loop so it downloads 1 file per "X" seconds?
I found another related question, here:
How to include a timer in Bash Scripting?
but I'm not sure where to add the "sleep [number of seconds]"... or even if "sleep" is really what I need for my script...?
Any help enormously appreciated — as always.
Dave

curl has some pretty awesome command-line options (documentation), for example, --limit-rate will limit the amount of bandwidth that curl uses, which might completely solve your problem.
For example, replace the curl line with:
curl --limit-rate 200K "$mp3" > ~/Desktop/URLs/$n.mp3
would limit the transfers to an average of 200K per second, which would download a typical 5MB MP3 file in 25 seconds, and you could experiment with different values until you found the maximum speed that worked.
You could also try a combination of --retry and --retry-delay so that when and if a download fails, curl waits and then tries again after a certain amount of time.
For example, replace the curl line with:
curl --retry 30 "$mp3" > ~/Desktop/URLs/$n.mp3
This will transfer the file. If the transfer fails, it will wait a second and try again. If it fails again, it will wait two seconds. If it fails again, it will wait four seconds, and so on, doubling the waiting time until it succeeds. The "30" means it will retry up to 30 times, and it will never wait more than 10 minutes. You can learn this all at the documentation link I gave you.

#!/bin/bash
mkdir -p ~/Desktop/URLs
n=1
while read mp3; do
curl "$mp3" > ~/Desktop/URLs/$n.mp3 &
((n++))
if ! ((n % 4)); then
wait
sleep 5
fi
done < ~/Desktop/URLs.txt
This will spawn at most 4 instances of curl and then wait for them to complete before it spawns 4 more.

A timer?
Like your crontab?
man cron
You know what they let you download, just count the disk usage of your files that you did get.
There is the transfer you are allowed. You need that, and you will need the PID of your script.
ps aux | grep $progname | print awk '{print $1}'
or something like that. The secret sauce here is that you can suspend with
kill -SIGSTOP PID
and resume with
kill -SIGCONT PID
So the general method would be
Urls on an array or queue or whatever
bash lets you have
Process an url.
increment transfer counter
When transfer counter gets close
kill -SIGSTOP MYPID
You are suspended.
in your crontab foreground your script after a minute/hour/day whatever
Continue processing
Repeat until done.
just don't log out or you'll need to do the whole thing over again, although if you used perl it would be trivial.
Disclaimer, I am not sure if this is an exercise in bash or whatnot, I confess freely that I see the answer in perl, which is always my choice outside of a REPL. Code in Bash long enough , or heaven forbid, Zsh ( my shell ) and you will see why Perl was so popular. Ah memories...
Disclaimer 2: Untested, drive by , garbage methodology here only made possible because you've an idea what that transfer might be. Obviously, if you have ssh , use ssh -D PORT you#host and pull the mp3's out of the proxy half the time.
In my own defense, if you slow pull the urls with sleep you'll be connected for a while. Perhaps "they" might notice that. Suspend and resume and you only should be connected while grabbing tracks, and gone otherwise.

Not so much an answer as an optimization. If you can consistently get the first few URLs, but it times out on the later ones, perhaps you could trim your URL file as the mp3s were successfully received?
That is, as 1.mp3 is successfully downloaded, strip it from the list:
tail url.txt -n +2 > url2.txt; mv -f url2.txt url.txt
Then the next time the script runs, it'll begin from 2.mp3
If that works, you might just set up a cron job to periodically execute the script over and over, taking bites at a time.
It just occurred to me that you're programatically numbering the mp3s, and curl might clobber some of them on restart, since every time it runs it'll start counting at 1.mp3 again. Something to be aware of, if you go this route.

Related

Bash script for do loop that detects a time period elapsed and then continues

I have a script that runs through a list of servers to connect to and grabs by SCP over files to store
Occasionally due to various reasons one of the servers crashes and my script gets stuck for around 4 hours before moving on through the list.
I would like to be able to detect a connection issue or a period of time elapsed after script has started and kill that command and move on to next.
I suspect that this would involve a wait or sleep and continue but I am new to loops and bash
#!/bin/bash
#
# Generate a list of backups to grab
df|grep backups|awk -F/ '{ print $NF }'>/tmp/backuplistsmb
# Get each backup in turn
for BACKUP in `cat /tmp/backuplistsmb`
do
cd /srv/backups/$BACKUP
scp -o StrictHostKeyChecking=no $BACKUP:* .
sleep 3h
done
The above script works fine but does get stuck for 4 hours should there be a connection issue. It is worth noting that some of the transfers take 10 mins and some 2.5 hours
Any ideas or help would very appreciated
Try to use the timeout program for that:
Usage:
timeout [OPTION] DURATION COMMAND [ARG]...
E.g. time (timeout 3 sleep 5) will run for 3 secs.
So in your code you can use:
timeout 300 scp -o StrictHostKeyChecking=no $BACKUP:* .
This limits the copy to 5 minutes.

Wait between wget downloads

I'm trying to download a directory (incl subs) from a website.
I'm using:
wget -r -e robots=off --no-parent --reject "index.html*" http://example.com/directory1/
Problem is, the server refuses connection after a bit, I think there's too many connections within a short amount of time. So what I'd like to do is insert a wait time (5 seconds) between each download/lookup. Is that possible? If so, how?
You can use --wait. From wget(1):
-w seconds
--wait=seconds
Wait the specified number of seconds between the retrievals. Use
of this option is recommended, as it lightens the server load by
making the requests less frequent. Instead of in seconds, the time
can be specified in minutes using the "m" suffix, in hours using
"h" suffix, or in days using "d" suffix.
Specifying a large value for this option is useful if the network
or the destination host is down, so that Wget can wait long enough
to reasonably expect the network error to be fixed before the
retry. The waiting interval specified by this function is
influenced by "--random-wait", which see.
I didn't know this either, but I found this answer in 15 seconds by using the wget manpage:
Type man wget.
You can use / to search, so I used /wait.
It's the first hit!
Press q to quit.

How to create an anonymous pipe between 2 child processes and know their pids (while not using files/named pipes)?

Please note that this questions was edited after a couple of comments I received. Initially I wanted to split my goal into smaller pieces to make it simpler (and perhaps expand my knowledge on various fronts), but it seems I went too far with the simplicity :). So, here I am asking the big question.
Using bash, is there a way one can actually create an anonymous pipe between two child processes and know their pids?
The reason I'm asking is when you use the classic pipeline, e.g.
cmd1 | cmd2 &
you lose the ability to send signals to cmd1. In my case the actual commands I am running are these
./my_web_server | ./my_log_parser &
my_web_server is a basic web server that dump a lot of logging information to it's stdout
my_log_parser is a log parser that I wrote that reads through all the logging information it receives from my_web_server and it basically selects only certain values from the log (in reality it actually stores the whole log as it received it, but additionally it creates an extra csv file with the values it finds).
The issue I am having is that my_web_server actually never stops by itself (it is a web server, you don't want that from a web server :)). So after I am done, I need to stop it myself. I would like for the bash script to do this when I stop it (the bash script), either via SIGINT or SIGTERM.
For something like this, traps are the way to go. In essence I would create a trap for INT and TERM and the function it would call would kill my_web_server, but... I don't have the pid and even though I know I could look for it via ps, I am looking for a pretty solution :).
Some of you might say: "Well, why don't you just kill my_log_parser and let my_web_server die on its own with SIGPIPE?". The reason why I don't want to kill it is when you kill a process that's at the end of the pipeline, the output buffer of the process before it, is not flushed. Ergo, you lose stuff.
I've seen several solutions here and in other places that suggested to store the pid of my_web_server in a file. This is a solution that works. It is possible to write the pipeline by fiddling with the filedescriptors a bit. I, however don't like this solution, because I have to generate files. I don't like the idea of creating arbitrary files just to store a 5-character PID :).
What I ended up doing for now is this:
#!/bin/bash
trap " " HUP
fifo="$( mktemp -u "$( basename "${0}" ).XXXXXX" )"
mkfifo "${fifo}"
<"${fifo}" ./my_log_parser &
parser_pid="$!"
>"${fifo}" ./my_web_server &
server_pid="$!"
rm "${fifo}"
trap '2>/dev/null kill -TERM '"${server_pid}"'' INT TERM
while true; do
wait "${parser_pid}" && break
done
This solves the issue with me not being able to terminate my_web_server when the script receives SIGINT or SIGTERM. It seems more readable than any hackery fiddling with file descriptors in order to eventually use a file to store my_web_server's pid, which I think is good, because it improves the readability.
But it still uses a file (named pipe). Even though I know it uses the file (named pipe) for my_web_server and my_log_parser to talk (which is a pretty good reason) and the file gets wiped from the disk very shortly after it's created, it's still a file :).
Would any of you guys know of a way to do this task without using any files (named pipes)?
From the Bash man pages:
! Expands to the process ID of the most recently executed back-
ground (asynchronous) command.
You are not running a background command, you are running process substitution to read to file descriptor 3.
The following works, but I'm not sure if it is what you are trying to achieve:
sleep 120 &
child_pid="$!"
wait "${child_pid}"
sleep 120
Edit:
Comment was: I know I can pretty much do this the silly 'while read i; do blah blah; done < <( ./my_proxy_server )'-way, but I don't particularly like the fact that when a script using this approach receives INT or TERM, it simply dies without telling ./my_proxy_server to bugger off too :)
So, it seems like your problem stems from the fact that it is not so easy to get the PID of the proxy server. So, how about using your own named pipe, with the trap command:
pipe='/tmp/mypipe'
mkfifo "$pipe"
./my_proxy_server > "$pipe" &
child_pid="$!"
echo "child pid is $child_pid"
# Tell the proxy server to bugger-off
trap 'kill $child_pid' INT TERM
while read
do
echo $REPLY
# blah blah blah
done < "$pipe"
rm "$pipe"
You could probably also use kill %1 instead of using $child_pid.
YAE (Yet Another Edit):
You ask how to get the PIDS from:
./my_web_server | ./my_log_parser &
Simples, sort of. To test I used sleep, just like your original.
sleep 400 | sleep 500 &
jobs -l
Gives:
[1]+ 8419 Running sleep 400
8420 Running | sleep 500 &
So its just a question of extracting those PIDS:
pid1=$(jobs -l|awk 'NR==1{print $2}')
pid2=$(jobs -l|awk 'NR==2{print $1}')
I hate calling awk twice here, but anything else is just jumping through hoops.

Is there a way to create a bash script that will only run for X hours?

Is there a way to create a bash script that will only run for X hours? I'm currently setting up a cron job to initiate a script every night. This script essentially runs until a certain condition is met, exporting it's status to a holding variable to keep track of 'where it is' after each iteration. The intention is to start-up the process every night, run for a few hours, and then stop, holding the status until the process starts up the next night.
Short of somehow collecting the start time, and checking it against the current time in each iteration of the loop, is there an easier way to do this? Bash scripting is not my forte (I know enough to get things done and be dangerous) and I have not done something like this before. Any help would be appreciated. Thanks.
Use GNU Coreutils
GNU coreutils contains an actual timeout binary, usually invoked like this:
# timeout after 5 seconds when sleeping for 30
/usr/bin/timeout 5s /bin/sleep 30
In your case, you'd want to specify hours instead of seconds, so to timeout in 2 hours use something like 2h instead of 5s. See timeout(1) or info coreutils 'timeout invocation' for additional options.
Hacks and Workarounds
Native timeouts or the GNU timeout command are really the best options. However, see the following for some ideas if you decide to roll your own:
How do I run a command, and have it abort (timeout) after N seconds?
The TMOUT variable using read and process or command substitution.
Do it as you described - it is the cleanest way.
But if for some strange reason want kill the process after a time, can use the next
./long_runner &
(sleep 5; kill $!; wait; exit 0) &
will kill the long_runner after 5 secs.
By using the SIGALRM facility you can rig a signal to be sent after a certain time, but traditionally, this was not easily accessible from shell scripts (people would write small custom C or Perl programs for this). These days, GNU coreutils ships with a timeout command which does this by wrapping your command:
timeout 4h yourprogram

Un*x shell script: what is the correct way to run a script for at most x milliseconds?

I'm not a scripting expert and I was wondering what was an acceptable way to run a script for at most x milliseconds (and yet finish before x milliseconds if the script is done before the timeout).
I solved that problem using Bash in a way that I think is very hacky and I wonder if there's a better way to do it.
Basically I've got one shell script called sleep_kill.sh that takes a PID as the first argument and a timeout as its second argument and that does this:
sleep $2
kill -9 $1 2> /dev/null 1> /dev/null
So if the PID corresponds to a script that finishes before timing out, nothing is going to be killed (I take it that the OS shall not have the time to be reusing this PID for another [unrelated] process seen that it's 'cycling' through all the process IDs before starting to reuse them).
Anyway, then I call my script that may "hang" or timeout:
command_that_may_hang.sh
PID=$!
sleep_kill.sh $PID .3
wait $PID > /dev/null 2>&1
And I'll be waiting at most 300 ms for command_that_may_hang.sh. Yet if command_that_may_hang.sh took only 10 ms to execute, I won't be "stuck" for 300 ms.
It would be great if some shell expert could explain the drawbacks of this approach and what should be done instead.
Have a look at this script: http://www.pixelbeat.org/scripts/timeout
Note timeouts of less that one second are pretty much nonsensical on most systems due to scheduling delays etc. Note also that newer coreutils has the timeout command included and it has a resolution of 1 second.

Resources