BASH script: Downloading consecutive numbered files with wget

BASH script: Downloading consecutive numbered files with wget - bash

I have a web server that saves the logs files of a web application numbered. A file name example for this would be:
dbsclog01s001.log
dbsclog01s002.log
dbsclog01s003.log
The last 3 digits are the counter and they can get sometime up to 100.
I usually open a web browser, browse to the file like:
http://someaddress.com/logs/dbsclog01s001.log
and save the files. This of course gets a bit annoying when you get 50 logs.
I tried to come up with a BASH script for using wget and passing
http://someaddress.com/logs/dbsclog01s*.log
but I am having problems with my the script.
Anyway, anyone has a sample on how to do this?
thanks!

#!/bin/sh
if [ $# -lt 3 ]; then
echo "Usage: $0 url_format seq_start seq_end [wget_args]"
exit
fi
url_format=$1
seq_start=$2
seq_end=$3
shift 3
printf "$url_format\\n" `seq $seq_start $seq_end` | wget -i- "$#"
Save the above as seq_wget, give it execution permission (chmod +x seq_wget), and then run, for example:
$ ./seq_wget http://someaddress.com/logs/dbsclog01s%03d.log 1 50
Or, if you have Bash 4.0, you could just type
$ wget http://someaddress.com/logs/dbsclog01s{001..050}.log
Or, if you have curl instead of wget, you could follow Dennis Williamson's answer.

curl seems to support ranges. From the man page:
URL
The URL syntax is protocol dependent. You’ll find a detailed descrip‐
tion in RFC 3986.
You can specify multiple URLs or parts of URLs by writing part sets
within braces as in:
http://site.{one,two,three}.com
or you can get sequences of alphanumeric series by using [] as in:
ftp://ftp.numericals.com/file[1-100].txt
ftp://ftp.numericals.com/file[001-100].txt (with leading zeros)
ftp://ftp.letters.com/file[a-z].txt
No nesting of the sequences is supported at the moment, but you can use
several ones next to each other:
http://any.org/archive[1996-1999]/vol[1-4]/part{a,b,c}.html
You can specify any amount of URLs on the command line. They will be
fetched in a sequential manner in the specified order.
Since curl 7.15.1 you can also specify step counter for the ranges, so
that you can get every Nth number or letter:
http://www.numericals.com/file[1-100:10].txt
http://www.letters.com/file[a-z:2].txt
You may have noticed that it says "with leading zeros"!

You can use echo type sequences in the wget url to download a string of numbers...
wget http://someaddress.com/logs/dbsclog01s00{1..3}.log
This also works with letters
{a..z} {A..Z}

Not sure precisely what problems you were experiencing, but it sounds like a simple for loop in bash would do it for you.
for i in {1..999}; do
wget -k http://someaddress.com/logs/dbsclog01s$i.log -O your_local_output_dir_$i;
done

You can use a combination of a for loop in bash with the printf command (of course modifying echo to wget as needed):
$ for i in {1..10}; do echo "http://www.com/myurl`printf "%03d" $i`.html"; done
http://www.com/myurl001.html
http://www.com/myurl002.html
http://www.com/myurl003.html
http://www.com/myurl004.html
http://www.com/myurl005.html
http://www.com/myurl006.html
http://www.com/myurl007.html
http://www.com/myurl008.html
http://www.com/myurl009.html
http://www.com/myurl010.html

Interesting task, so I wrote full script for you (combined several answers and more). Here it is:
#!/bin/bash
# fixed vars
URL=http://domain.com/logs/ # URL address 'till logfile name
PREF=logprefix # logfile prefix (before number)
POSTF=.log # logfile suffix (after number)
DIGITS=3 # how many digits logfile's number have
DLDIR=~/Downloads # download directory
TOUT=5 # timeout for quit
# code
for((i=1;i<10**$DIGITS;++i))
do
file=$PREF`printf "%0${DIGITS}d" $i`$POSTF # local file name
dl=$URL$file # full URL to download
echo "$dl -> $DLDIR/$file" # monitoring, can be commented
wget -T $TOUT -q $dl -O $file
if [ "$?" -ne 0 ] # test if we finished
then
exit
fi
done
At the beggiing of the script you can set URL, log file prefix and suffix, how many digits you have in numbering part and download directory. Loop will download all logfiles it found, and automaticaly exit on first non-existant (using wget's timeout).
Note that this script assumes that logfile indexing starts with 1, not zero, as you mentioned in example.
Hope this helps.

Here you can find a Perl script that looks like what you want
http://osix.net/modules/article/?id=677
#!/usr/bin/perl
$program="wget"; #change this to proz if you have it ;-)
my $count=1; #the lesson number starts from 1
my $base_url= "http://www.und.nodak.edu/org/crypto/crypto/lanaki.crypt.class/lessons/lesson";
my $format=".zip"; #the format of the file to download
my $max=24; #the total number of files to download
my $url;
for($count=1;$count<=$max;$count++) {
if($count<10) {
$url=$base_url."0".$count.$format; #insert a '0' and form the URL
}
else {
$url=$base_url.$count.$format; #no need to insert a zero
}
system("$program $url");
}

I just had a look at the wget manpage discussion of 'globbing':
By default, globbing will be turned on if the URL contains a globbing character. This option may be used to turn globbing on or off permanently.
You may have to quote the URL to protect it from being expanded by your shell. Globbing makes Wget look for a directory listing, which is system-specific. This is why it currently works only with Unix FTP servers (and the ones emulating Unix "ls" output).
So wget http://... won't work with globbing.

Check to see if your system has seq, then it would be easy:
for i in $(seq -f "%03g" 1 10); do wget "http://.../dbsclog${i}.log"; done
If your system has the jot command instead of seq:
for i in $(jot -w "http://.../dbsclog%03d.log" 10); do wget $i; done

Oh! this is a similar problem I ran into when learning bash to automate manga downloads.
Something like this should work:
for a in `seq 1 999`; do
if [ ${#a} -eq 1 ]; then
b="00"
elif [ ${#a} -eq 2 ]; then
b="0"
fi
echo "$a of 231"
wget -q http://site.com/path/fileprefix$b$a.jpg
done

Late to the party, but a real easy solution that requires no coding is to use the DownThemAll Firefox add-on, which has the functionality to retrieve ranges of files. That was my solution when I needed to download 800 consecutively numbered files.

Related

(again) What's wrong with this youtube-dl automatic download bash script

sometime ago I asked what was wrong about a bash script I was trying to do and I got a great solution: What's wrong with this youtube-dl automatic script?
I kept modifying the script to work with different youtube-dl command combinations and to deal with my very unstable Internet connection (that's the reason for the while/do loop) and it kept working flawlessly, but when I tried to use that same script structure to download Youtube playlists starting from a specific item in the list (e.g.: item number 15) that's when I get an error. I'm still pretty much a newbie in bash script (obviously), so I don't know what's wrong.
The script in question is this:
#!/bin/bash
function video {
youtube-dl --no-warnings -o '%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s' --socket-timeout 15 --hls-use-mpegts -R 64 --fragment-retries 64 --prefer-free-formats --all-subs --embed-subs -f 'bestvideo[height<=720]+bestaudio/best[height<=720]' "$#"
}
read -p "url: " url
video "$url"
while [ $? -ne 0 ]; do sleep 5 && video "$url" ; done
clear && echo completed!
So, for example, if I try to download a playlist, I just write in my Terminal:
printf https://www.youtube.com/playlist?list=PLS1QulWo1RIYmaxcEqw5JhK3b-6rgdWO_ | list720
("list720" is the name of the script, of course) The script runs without problems and does exactly what I expect it to do.
But if I run in the Terminal:
printf --playlist-start=15 https://www.youtube.com/playlist?list=PLS1QulWo1RIYmaxcEqw5JhK3b-6rgdWO_ | list720
I get the following error:
bash: printf: --: invalid option
printf: usage: printf [-v var] format [arguments]
ERROR: '' is not a valid URL. Set --default-search "ytsearch" (or run youtube-dl "ytsearch:" ) to search YouTube
If I invert the order (1st the youtube URL and then the --playlist-start=15 command), the script downloads the whole playlist and omits the "--playlist-start" command.
I tried just running the youtube-dl command string directly in the terminal and added the "--playlist-start" and URL at the end and it runs perfectly:
youtube-dl --no-warnings -o '%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s' --socket-timeout 15 --hls-use-mpegts -R 64 --fragment-retries 64 --prefer-free-formats --all-subs --embed-subs -f 'bestvideo[height<=720]+bestaudio/best[height<=720]' --playlist-start=15 https://www.youtube.com/playlist?list=PLS1QulWo1RIYmaxcEqw5JhK3b-6rgdWO_
...so I assume the problem is with the script.
Any help is welcome, thanks!

A much better design is to accept any options and the URL as command-line arguments. Scripts which require interactive I/O are pesky to include in bigger scripts and generally harder to use (you lose the ability to use your shell's tab completion and command-line history etc).
#!/bin/bash
# Don't needlessly use Bash-only syntax for declaring a function
# Indent the code
video () {
youtube-dl --no-warnings \
-o '%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s' \
--socket-timeout 15 --hls-use-mpegts -R 64 --fragment-retries 64 \
--prefer-free-formats --all-subs --embed-subs \
-f 'bestvideo[height<=720]+bestaudio/best[height<=720]' "$#"
}
until video "$#"; do
sleep 5
done
Clearing the screen after finishing seems hostile so I took that out, too.
Now if you want to pass additional parameters to youtube-dl just include them as parameters to your script:
list720 --playlist-start=15 'https://www.youtube.com/playlist?list=PLS1QulWo1RIYmaxcEqw5JhK3b-6rgdWO_'
You should also usually quote any URLs in case they contain shell metacharacters. See also When to wrap quotes around a shell variable?
Notice how we always take care to use double quotes around "$#"; omitting them in this case is simply an error.
Notice also how inside the function, "$#" refers to the function's arguments, whereas in the main script, it refers to the script's command-line arguments.
Tangentially, using printf without a format string is problematic, too. If you pass in a string which contains a per-cent character, that will get interpreted as a format string.
bash$ printf 'http://example.com/%7Efnord'
http://example.com/0.000000E+00fnord
The proper solution is to always pass a format string as the first argument.
bash$ printf '%s\n' 'http://example.com/%7Efnord'
http://example.com/%7Efnord
But you don't need printf to pass something as standard input. Bash has "here strings":
list720 <<<'http://example.com/%7Efnord'
(This would of course only work with your old version which read the URL from standard input; the refactored script in this answer doesn't work that way.)

SOLVED!
My brother (a "retired" programmer) took some time to evaluate how Bash script works and we figured a way of making the script work in a simpler way by just adding the youtube-dl commands and the Youtube URL as arguments.
The script changed a little bit, now it looks like this:
#!/bin/bash
function video() {
youtube-dl --no-warnings -o '%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s' --socket-timeout 15 --hls-use-mpegts -R 64 --fragment-retries 64 --prefer-free-formats --all-subs --embed-subs -f 'bestvideo[height<=720]+bestaudio/best[height<=720]' "$#"
}
video $#
while [ $? -ne 0 ]; do sleep 5 && video $# ; done
clear && echo completed!
Now I just have to write in my terminal:
list720 --playlist-start=15 https://www.youtube.com/playlist?list=PLS1QulWo1RIYmaxcEqw5JhK3b-6rgdWO_
And it works exactly as I want it to.
Thank you very much for your help and suggestions!

In bash the character '-' at the begin of a command is used to set an option, if you wants to print --playlist... you should use the escape character '\'.
Try something like printf "\-\-playlist..."
correction : printf '%s' '--playlist...'

Writing a script that uses agrep to loop through lines in a document one by one against lines in another document and getting a result

I am trying to write a script that uses agrep to loop through files in one document and match them against another document. I believe this might use a nested loop however, I am not completely sure. In the template document, I need for it to take one string and match it against other strings in another document then move to the next string and match it again
If unable to see images for some odd reason I have included the links at the bottom here as well. Also If you need me to explain more just let me know. This is my first post so I am not sure how this will be perceived or if I used the correct terminologies :)
Template agrep/highlighted- https://imgur.com/kJvySbW
Matching strings not highlighted- https://imgur.com/NHBlB2R
I have already looked on various websites regarding loops
#!/bin/bash
#agrep script
echo ${BASH_VERSION}
TemplateSpacers="/Users/kj/Documents/Research/Dr. Gage
Research/Thesis/FastA files for AGREP
test/Template/TA21_spacers.fasta"
MatchingSpacers="/Users/kj/Documents/Research/Dr. Gage
Research/Thesis/FastA files for AGREP test/Matching/TA26_spacers.fasta"
for * in filename
do
agrep -3 * to file im comparing to
#potentially may need to use nested loop but not sure

Ok, I get it now, I think. This should get you started.
#!/bin/bash
document="documentToSearchIn.txt"
grep -v spacer fileWithSearchStrings.txt | while read srchstr ; do
echo "Searching for $srchstr in $document"
echo agrep -3 "$srchstr" "$document"
done
If that looks correct, remove the echo before agrep and run again.
If, as you say in the comments, you want to store the script somewhere else, say in $HOME/bin, you can do this:
mkdir $HOME/bin
Save the script above as $HOME/bin/search. Now make it executable (only necessary one time) with:
chmod +x $HOME/bin/search
Now add $HOME/bin to your PATH. So, find the line starting:
export PATH=...
in your login profile, and change it to include the new directory:
export PATH=$PATH:$HOME/bin
Then start a new Terminal and you should be able to just run:
search
If you want to be able to specify the name of the strings file and the document to search in, you can change the code to this:
#!/bin/bash
# Pick up parameters, if supplied
# 1st param is name of file with strings to search for
# 2nd param is name of document to search in
str=${1:-""}
doc=${2:-""}
# Ensure name of strings file is valid
while : ; do
[ -f "$str" ] && break
read -p "Enter strings filename:" str
done
# Ensure name of document file is valid
while : ; do
[ -f "$doc" ] && break
read -p "Enter document name:" doc
done
echo "Search for strings from: $str, searching in document: $doc"
grep -v spacer "$str" | while read srchstr ; do
echo "Searching for $str in $doc"
echo agrep -3 "$str" "$doc"
done
Then you can run:
search path/to/file/with/strings path/to/document/to/search/in
or, if you run like this:
search
it will ask you for the 2 filenames.

Use first 3 characters of a filename as a variable in shell script

this is my first post so hopefully I will make my question clear.
I am new to shell scripts and my task with this one is to add a new value to every line of a csv file. The value that needs added is based on the first 3 digits of the filename.
I bit of background. The csv files I am receiving are eventually being loaded into partitioned oracle tables. The start of the file name (e.g. BATTESTFILE.txt) contains the partitioned site so I need to write a script that takes the first 3 characters of the filename (in this example BAT) and add this to the end of each line of the file.
The closest I have got so far is when I stripped the code to the bare basics of what I need to do:
build_files()
{
OLDFILE=${filename[#]}.txt
NEWFILE=${filename[#]}.NEW.txt
ABSOLUTE='path/scripts/'
FULLOLD=$ABSOLUTE$OLDFILE
FULLNEW=$ABSOLUTE$NEWFILE
sed -e s/$/",${j}"/ "${FULLOLD}" > "${FULLNEW}"
}
set -A site 'BAT'
set -A filename 'BATTESTFILE'
for j in ${site[#]}; do
for i in ${filename[#]}; do
build_files ${j}
done
done
Here I have set up an array site as there will be 6 'sites' and this will make it easy to add additionals sits to the code as the files come through to me. The same is to be siad for the filename array.
This codes works, but it isn't as automated as I need. One of my most recent attempts has been below:
build_files()
{
OLDFILE=${filename[#]}.txt
NEWFILE=${filename[#]}.NEW.txt
ABSOLUTE='/app/dss/dsssis/sis/scripts/'
FULLOLD=$ABSOLUTE$OLDFILE
FULLNEW=$ABSOLUTE$NEWFILE
sed -e s/$/",${j}"/ "${FULLOLD}" > "${FULLNEW}"
}
set -A site 'BAT'
set -A filename 'BATTESTFILE'
for j in ${site[#]}; do
for i in ${filename[#]}; do
trust=echo "$filename" | cut -c1-3
echo "$trust"
if ["$trust" = 'BAT']; then
${j} = 'BAT'
fi
build_files ${j}
done
done
I found the code trust=echo "$filename" | cut -c1-3 through another question on StackOverflow as I was researching, but it doesn't seem to work for me. I added in the echo to test what trust was holding, but it was empty.
I am getting 2 errors back:
Line 17 - BATTESTFILE: not found
Line 19 - test: ] missing
Sorry for the long winded questions. Hopefully It contains helpful info and shows the steps I have taken. Any questions, comment away. Any help or guidance is very much appreciated. Thanks.

When you are new with shells, try avoiding arrays.
In an if statement use spaces before and after the [ and ] characters.
Get used to surrounding your shell variables with {} like ${trust}
I do not know how you fill your array, when the array is hardcoded, try te replace with
SITE=file1
SITE="${SITE} file2"
And you must tell unix you want to have the rightside eveluated with $(..) (better than backtics):
trust=$(echo "${filename}" | cut -c1-3)
Some guidelines and syntax help can be found at Google

Just use shell parameter expansion:
$ var=abcdefg
$ echo "${var:0:3}"
abc
Assuming you're using a reasonably capable shell like bash or ksh, for example

Just in case it is useful for anyone else now or in the future, I got my code to work as desired by using the below. Thanks Walter A below for his answer to my main problem of getting the first 3 characters from the filename and using them as a variable.
This gave me the desired output of taking the first 3 characters of the filename, and adding them to the end of each line in my csv file.
## Get the current Directory and file name, create a new file name
build_files()
{
OLDFILE=${i}.txt
NEWFILE=${i}.NEW.txt
ABSOLUTE='/app/dss/dsssis/sis/scripts/'
FULLOLD=$ABSOLUTE$OLDFILE
FULLNEW=$ABSOLUTE$NEWFILE
## Take the 3 characters from the filename and
## add them onto the end of each line in the csv file.
sed -e s/$/";${j}"/ "${FULLOLD}" > "${FULLNEW}"
}
## Loop to take the first 3 characters from the file names held in
## an array to be added into the new file above
set -A filename 'BATTESTFILE'
for i in ${filename[#]}; do
trust=$(echo "${i}" | cut -c1-3)
echo "${trust}"
j="${trust}"
echo "${i} ${j}"
build_files ${i} ${j}
done
Hope this is useful for someone else.

wget jpg from url list keeping same structure

I have a 9000 urls list to scrap into a file.txt keeping same dir structure as writed in the url list.
Each url is composed by http://domain.com/$dir/$sub1/$ID/img_$ID.jpg where $dir and $sub1 are integer numbers from 0 to 9
I tried running
wget -i file.txt
but it takes any img_$ID.jpg in the same local dir where i'm, so i get any file in root loosing the $dir/%sub1/$ID folders structure.
I thought have to write a script which does
mkdir -p $dir/$sub1/$ID
wget -P $dir/$ #Correcting a typo in the message i left the full path pending, it was the same as previous mkdir command => "wget -P $dir/$sub1/$ID"
for each line in file.txt, but i have no idea on where to start.

I think simple shell loop with a bit of string processing should work for you:
while read line; do
line2=${line%/*} # removing filename
line3=${line2#*//} # removing "http://"
path=${line3#*/} # removing "domain.com/"
mkdir -p $path
wget -P$path $line
done <file.txt
(SO's editor mis-interprets # in the expression and colors the rest of the string as comment - don't mind it. The actual comments are on the very right.)
Notice that wget command is not as you described (wget -P $dir/$), but rather the one that seems more correct (wget -P $dir/$sub1/$ID). If you insist on your version, please clarify what do you mean by terminal $.
Also, for the purpose of debugging you might wanna verify the processing before you run the actual script (that copies the files) - you may do something like that:
while read line; do
echo $line
line2=${line%/*} # removing filename
echo $line2
line3=${line2#*//} # removing "http://"
echo $line3
path=${line3#*/} # removing "domain.com/"
echo $path
done <file.txt
You'll see all string processing steps and will make sure the resulting path is correct.

Shell scripts and the md5/md5sum command: need to decide when to use which one

First time poster, here, so go easy on me. :)
Pretty sure nobody's yet asked this in researching this question.
Short version: How can I tell a shell script to use one command versus the other, depending on which box I run the shell script on? Example: on Box 1, I want to run md5 file.txt. On Box 3, I want to run md5sum file.txt. I'm thining it's an IF command where if the output of md5 is a failure, use md5sum instead. Just don't know how to check and see whether the output of md5 is a failure or not
Long version: I have 3 boxes that I work with. Box 1 and 3 are the receivers of a file from Box 2, and they receive the file when I invoke a script on box 1/3 as follows: ftpget.sh file.txt
I have a shell script that does an FTP GET and grabs a file from Box 2. It then does an md5 on the source file from Box 2 and the destination file, which'll be on Box 1 or 3, depending on which one I executed the script from. The hashes must match, of course.
The problem is this: The code is written to use md5, and while Box 1 uses md5, Box 3 uses md5sum. So when I execute the script from Box 1, it works great. When I execute the script from Box 3, it fails because Box 3 uses md5sum, not md5.
So I was thinking: what's the best way to handle this? I can't install anything since I'm not an admin, and the people who manage the machine probably won't do it for me anyway. Could I just create an alias in my .profile which goes something like: alias md5="md5sum"? That way, when the script runs on Box 3, it'll execute md5 file.txt but the system will really execute md5sum file.txt since I created the alias.
Thoughts? Better ideas? :)

I don't know what shell you're using. This is for bash:
#!/bin/sh
md5=$(which md5)
if [ ! "${md5}" ] ; then
md5=$(which md5sum)
if [ ! "${md5}" ] ; then
echo "neither md5 nor md5sum found"
exit 1
fi
fi
${md5} $0

There are many possible solutions, as is often the case. Perhaps the following script, called somesum(1) would suffice ...
#!/bin/sh
# ident "#(#)somesum.sh: Find a command to do a checksum of a file"
################################################################################
export CMD
CMD=''
# if can find the command in the search path and CMD is not set,
# set CMD to the command name ...
[ "$CMD" = '' -a -f "`which md5 2>/dev/null`" ] && CMD=md5
[ "$CMD" = '' -a -f "`which md5sum 2>/dev/null`" ] && CMD=md5sum
################################################################################
# if command was found execute it else complain could not find desired commands
if [ "$CMD" != '' ]
then
$CMD $*
else
echo could not find md5sum or md5 1>&2
fi
exit
Or installing the preferred command on the other platforms in your own search path;
or using the hostname(1) command to figure out which platform you are on. I am assuming
you are on a platform that has the Bash shell (or ksh/pdksh/bash/...) in the example.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio