wget jpg from url list keeping same structure - bash

I have a 9000 urls list to scrap into a file.txt keeping same dir structure as writed in the url list.
Each url is composed by http://domain.com/$dir/$sub1/$ID/img_$ID.jpg where $dir and $sub1 are integer numbers from 0 to 9
I tried running
wget -i file.txt
but it takes any img_$ID.jpg in the same local dir where i'm, so i get any file in root loosing the $dir/%sub1/$ID folders structure.
I thought have to write a script which does
mkdir -p $dir/$sub1/$ID
wget -P $dir/$ #Correcting a typo in the message i left the full path pending, it was the same as previous mkdir command => "wget -P $dir/$sub1/$ID"
for each line in file.txt, but i have no idea on where to start.

I think simple shell loop with a bit of string processing should work for you:
while read line; do
line2=${line%/*} # removing filename
line3=${line2#*//} # removing "http://"
path=${line3#*/} # removing "domain.com/"
mkdir -p $path
wget -P$path $line
done <file.txt
(SO's editor mis-interprets # in the expression and colors the rest of the string as comment - don't mind it. The actual comments are on the very right.)
Notice that wget command is not as you described (wget -P $dir/$), but rather the one that seems more correct (wget -P $dir/$sub1/$ID). If you insist on your version, please clarify what do you mean by terminal $.
Also, for the purpose of debugging you might wanna verify the processing before you run the actual script (that copies the files) - you may do something like that:
while read line; do
echo $line
line2=${line%/*} # removing filename
echo $line2
line3=${line2#*//} # removing "http://"
echo $line3
path=${line3#*/} # removing "domain.com/"
echo $path
done <file.txt
You'll see all string processing steps and will make sure the resulting path is correct.

Related

Problems with escaping in heredocs

I am writing a Jenkins job that will move files between two chrooted directories on a remote server.
This uses a Jenkins multiline string variable to store one or more file name, one per line.
The following will work for files without special characters or spaces:
## Jenkins parameters
# accountAlias = "test"
# sftpDir = "/path/to/chrooted home"
# srcDir = "/path/to/get/files"
# destDir = "/path/to/put/files"
# fileName = "file names # multiline Jenkins shell parameter, one file name per
#!/bin/bash
ssh user#server << EOF
#!/bin/bash
printf "\nCopying following file(s) from "${accountAlias}"_old account to "${accountAlias}"_new account:\n"
# Exit if no filename is given so Rsync does not copy all files in src directory.
if [ -z "${fileName}" ]; then
printf "\n***** At least one filename is required! *****\n"
exit 1
else
# While reading each line of fileName
while IFS= read -r line; do
printf "\n/"${sftpDir}"/"${accountAlias}"_old/"${srcDir}"/"\${line}" -> /"${sftpDir}"/"${accountAlias}"_new/"${destDir}"/"\${line}"\n"
# Rsync the files from old account to new account
# -v | verbose
# -c | replace existing files based on checksum, not timestamp or size
# -r | recursively copy
# -t | preserve timestamps
# -h | human readable file sizes
# -P | resume incomplete files + show progress bars for large files
# -s | Sends file names without interpreting special chars
sudo rsync -vcrthPs /"${sftpDir}"/"${accountAlias}"_old/"${srcDir}"/"\${line}" /"${sftpDir}"/"${accountAlias}"_new/"${destDir}"/"\${line}"
done <<< "${fileName}"
fi
printf "\nEnsuring all new files are owned by the "${accountAlias}"_new account:\n"
sudo chown -vR "${accountAlias}"_new:"${accountAlias}"_new /"${sftpDir}"/"${accountAlias}"_new/"${destDir}"
EOF
Using the file name "sudo bash -c 'echo "hello" > f.txt'.txt" as a test, my script will fail after the "sudo" in the file name.
I believe my problem that my $line variable are not properly quoted or escaped, resulting in bash not treating the $line value as one string.
I have tried single quotes or using awk/sed to insert back slashes in variable string, but this hasn't worked.
My theory is I am running into a problem with special chars and heredocs.
Although it's unclear to me from your description exactly what error you are encountering or where, you do have several problems in the script presented.
The main one might simply be the sudo command that you're trying to execute on the remote side. Unless user has passwordless sudo privilege (rather dangerous) sudo will prompt for a password and attempt to read it from the user's terminal. You are not providing a password. You could probably just interpolate it into the command stream (in the here doc) if in fact you collect it. Nevertheless, there is still a potential problem with that, as you perform potentially many sudo commands, and they may or may not request passwords depending on remote sudo configuration and the time between sudo commands. Best would be to structure the command stream so that only one sudo execution is required.
Additional considerations follow.
## Jenkins parameters
# accountAlias = "test"
# sftpDir = "/path/to/chrooted home"
# srcDir = "/path/to/get/files"
# destDir = "/path/to/put/files"
# fileName = "file names # multiline Jenkins shell parameter, one file name per
#!/bin/bash
The #!/bin/bash there is not the first line of the script, so it does not function as a shebang line. Instead, it is just an ordinary comment. As a result, when the script is executed directly, it might or might not be bash that runs it, and if if it is bash, it might or might not be running in POSIX compatibility mode.
ssh user#server << EOF
#!/bin/bash
This #!/bin/bash is not a shebang line either, because that applies only to scripts read from regular files. As a result, the following commands are run by user's default shell, whatever that happens to be. If you want to ensure that the rest is run by bash, then perhaps you should execute bash explicitly.
printf "\nCopying following file(s) from "${accountAlias}"_old account to "${accountAlias}"_new account:\n"
The two expansions of $accountAlias (by the local shell) result in unquoted text passed to printf in the remote shell. You could consider just removing the de-quoting, but that would still leave you susceptible to malicious accountAlias values that included double-quote characters. Remember that these will be expanded on the local side, before the command is sent over the wire, and then the data will be processed by a remote shell, which is the one that will interpret the quoting.
This can be resolved by
Outside the heredoc, preparing a version of the account alias that can be safely presented to the remote shell
accountAlias_safe=$(printf %q "$accountAlias")
and
Inside the heredoc, expanding it unquoted. I would furthermore suggest passing it as a separate argument instead of interpolating it into the larger string.
printf "\nCopying following file(s) from %s_old account to %s_new account:\n" ${accountAlias_safe} ${accountAlias_safe}
Similar applies to most of the other places where variables from the local shell are interpolated into the heredoc.
Here ...
# Exit if no filename is given so Rsync does not copy all files in src directory.
if [ -z "${fileName}" ]; then
... why are you performing this test on the remote side? You would save yourself some trouble by performing it on the local side instead.
Here ...
printf "\n/"${sftpDir}"/"${accountAlias}"_old/"${srcDir}"/"\${line}" -> /"${sftpDir}"/"${accountAlias}"_new/"${destDir}"/"\${line}"\n"
... remote shell variable $line is used unquoted in the printf command. Its appearance should be quoted. Also, since you use the source and destination names twice each, it would be cleaner and clearer to put them in (remote-side) variables. AND, if the directory names have the form presented in comments in the script, then you are introducing excess / characters (though these probably are not harmful).
Good for you, documenting the meaning of all the rsync options used, but why are you sending all that over the wire to the remote side?
Also, you probably want to include rsync's -p option to preserve the same permissions. Possibly you want to include the -l option too, to copy any symbolic link as symbolic links.
Putting all that together, something more like this (untested) is probably in order:
#!/bin/bash
## Jenkins parameters
# accountAlias = "test"
# sftpDir = "/path/to/chrooted home"
# srcDir = "/path/to/get/files"
# destDir = "/path/to/put/files"
# fileName = "file names # multiline Jenkins shell parameter, one file name per
# Exit if no filename is given so Rsync does not copy all files in src directory.
if [ -z "${fileName}" ]; then
printf "\n***** At least one filename is required! *****\n"
exit 1
fi
accountAlias_safe=$(printf %q "$accountAlias")
sftpDir_safe=$(printf %q "$sftpDir")
srcDir_safe=$(printf %q "$srcDir")
destDir_safe=$(printf %q "$destDir")
fileName_safe=$(printf %q "$fileName")
IFS= read -r -p 'password for user#server: ' -s -t 60 password || {
echo 'password not entered in time' 1>&2
exit 1
}
# Rsync options used:
# -v | verbose
# -c | replace existing files based on checksum, not timestamp or size
# -r | recursively copy
# -t | preserve timestamps
# -h | human readable file sizes
# -P | resume incomplete files + show progress bars for large files
# -s | Sends file names without interpreting special chars
# -p | preserve file permissions
# -l | copy symbolic links as links
ssh user#server /bin/bash << EOF
printf "\nCopying following file(s) from %s_old account to %s_new account:\n" ${accountAlias_safe} ${accountAlias_safe}
sudo /bin/bash -c '
while IFS= read -r line; do
src=${sftpDir_safe}${accountAlias_safe}_old${srcDir_safe}"/\${line}"
dest="${sftpDir_safe}/${accountAlias_safe}_new${destDir_safe}/"\${line}"
printf "\n\${src} -> \${dest}\n"
rsync -vcrthPspl "\${src}" "\${dest}"
done <<<'${fileName_safe}'
printf "\nEnsuring all new files are owned by the %s_new account:\n" ${accountAlias_safe}
chown -vR ${accountAlias_safe}_new:${accountAlias_safe}_new ${sftpDir_safe}/${accountAlias_safe}_new${destDir_safe}
'
${password}
EOF

Bash insert saved file name to variable

The below script download file using CURL, I'm trying inside the loop to save the file and also to insert the saved file name into a variable and then to print it.
My script downloads the script and saved the file but can't echo the saved file name:
for link in $url2; do
cd /var/script/twitter/html_files/ && file1=$({ curl -O $link ; cd -; })
echo $file1
done
Script explanation:
$url2 contains one or more URLs
curl -O write output to a file named as the remote file
Your code has several problems. Assuming $url2 is a list of valid URLs which do not require shell quoting, you can make curl print the output variable directly.
cd /var/script/twitter/html_files
for link in $url2; do
curl -s -w '%{filename_effective}\n' -O "$link"
done
Without the -w formatstring option, the output of curl does not normally contain the output file name in a conveniently machine-readable format (or actually at all). I also added an -s option to disable the download status output it prints by default.
There is no point in doing cd to the same directory over and over again inside the loop, or capturing the output into a variable which you only use once to print to standard output the string which curl by itself would otherwise print to standard output.
Finally, the cd - does not seem to do anything useful here; even if it did something useful per se, you are doing it in a subshell, which doesn't change the current working directory of the script which contains the $(cd -) command substitution.
If your task is to temporarily switch to that directory, then switch back to where you started, just cd once. You can use cd - in Bash but a slightly more robust and portable solution is to run the fetch in a subshell.
( cd directory;
do things ...
)
# you are now back to where you were before the cd
If you genuinely need the variable, you can trivially use
for link in $url2; do
file1=$(curl -s -w '%{filename_effective}' -O "$link")
echo "$file1"
done
but obviously the variable will only contain the result from the last iteration after the loop (in the code after done). (The format string doesn't need the final \n here because the command substitution will trim off any trailing newline anyway.)

no such file or directory error when using variables (works otherwise)

I am new to programming and just starting in bash.
I'm trying to print a list of directories and files to a txt file, and remove some of the path that gets printed to make it cleaner.
It works with this:
TODAY=$(date +"%Y-%m-%d")
cd
cd Downloads
ls -R ~/Music/iTunes/iTunes\ Media/Music | sed 's/\/Users\/BilPaLo\/Music\/iTunes\/iTunes\ Media\/Music\///g' > music-list-$TODAY.txt
But to clean it up I want to use variables like so,
# Creates a string of the date, format YYYY-MM-DD
TODAY="$(date +"%Y-%m-%d")"
# Where my music folders are
MUSIC="$HOME/Music/iTunes/iTunes\ Media/Music/"
# Where I want it to go
DESTINATION="$HOME/Downloads/music-list-"$TODAY".txt"
# Path name to be removed from text file
REMOVED="\/Users\/BilPaLo\/Music\/iTunes\/iTunes\ Media\/Music\/"
ls -R "$MUSIC" > "$DESTINATION"
sed "s/$REMOVED//g" > "$DESTINATION"
but it gives me a 'no such file or directory' error that I can't seem to get around.
I'm sure there are many other problems with this code but this one I don't understand.
Thank you everyone! I followed the much needed formatting advice and #amo-ej1's answer and now this works:
# Creates a string of the date format YYYY-MM-DD
today="$(date +"%Y-%m-%d")"
# Where my music folders are
music="$HOME/Music/iTunes/iTunes Media/Music/"
# Where I want it to go
destination="$HOME/Downloads/music-list-$today.txt"
# Temporary file
temp="$HOME/Downloads/temp.txt"
# Path name to be removed of text file to only leave artist name and album
remove="\\/Users\\/BilPaLo\\/Music\\/iTunes\\/iTunes\\ Media\\/Music\\/"
# lists all children of music and writes it in temp
ls -R "$music" > "$temp"
# substitutes remove by nothing and writes it in destination
sed "s/$remove//g" "$temp" > "$destination"
rm $temp #deletes temp
First when debugging bash it can be helpful to start bash with the -x flags (bash -x script.sh) or within the script enter set -x, that way bash will print out the commands it is executing (with the variable expansions) and you can more easily spot errors that way.
In this specific snippet our ls output is being redirected to a file called $DESTINATION and and sed will read from standard input and write also to $DESTINATION. So however you wanted to replace the pipe in your oneliner is wrong. As a result this will look as if your program is blocked but sed will simply wait for input arriving on standard input.
As for the 'no such file or directory', try executing with set -x and doublecheck the paths it is trying to access.

ftp script in bash

I have the following script that pushes files to remote location:
#!/usr/bin/bash
HOST1='a.b.c.d'
USER1='load'
PASSWD1='load'
DATE=`date +%Y%m%d%H%M`
DATE2=`date +%Y%m%d%H`
DATE3=`date +%Y%m%d`
FTPLOGFILE=/logs/Done.$DATE2.log
D_FOLDER='/dir/load01/input'
PUTFILE='file*un'
ls $PUTFILE | while read file
do
echo "${file} transfered at $DATE" >> /logs/$DATE3.log
done
ftp -n -v $HOST1 <<SCRIPT >> ${FTPLOGFILE} 2>&1
quote USER $USER1
quote PASS $PASSWD1
cd $D_FOLDER
ascii
prompt off
mput /data/file*un
quit
SCRIPT
mv *un test/
ls test/*un | awk '{print("mv "$1" "$1)}' | sed 's/\.un/\.processed/2' |sh
rm *unl
I am getting this error output:
200 PORT command successful.
553 /data/file1.un: A file or directory in the path name does not exist.
200 PORT command successful.
Some improvements:
#!/usr/bin/bash
HOST1='a.b.c.d'
USER1='load'
PASSWD1='load'
read Y m d H M <<<$(date "+%Y %m %d %H %M") # only one call to date
DATE='$Y$m$d$H$M'
DATE2='$Y$m$d$H'
DATE3='$Y$m$d'
FTPLOGFILE=/logs/Done.$DATE2.log
D_FOLDER='/dir/load01/input'
PUTFILE='file*un'
for file in $PUTFILE # no need for ls
do
echo "${file} transfered at $DATE"
done >> /logs/$DATE3.log # output can be done all at once at the end of the loop.
ftp -n -v $HOST1 <<SCRIPT >> ${FTPLOGFILE} 2>&1
quote USER $USER1
quote PASS $PASSWD1
cd $D_FOLDER
ascii
prompt off
mput /data/file*un
quit
SCRIPT
mv *un test/
for f in test/*un # no need for ls and awk
do
mv "$f" "${f/%.un/.processed}"
done
rm *unl
I recommend using lower case or mixed case variables to reduce the chance of name collisions with shell variables.
Are all those directories really directly off the root directory?
Ftp to the the remote site and execute the ftp commands by hand. When the error occurs, look around to see what is the cause. (Use "help" if you don't know the ftp command line.)
Probably the /data directory does not exist. has anyone reorganized the upload directory recently, or maybe moved the root directory of the ftp server?
The problem with scripting an FTP session is that FTP believes it has executed itself correctly if it reports errors to stdout. Consequently, it's devilishly hard to pick up errors, since it will only return a fail on something catastrophic. If you need anything more than the most simple of command lists, you should really be using something like expect or a java or perl program that can easily test the result of each action.
That said, you can run the ftp as a coprocess, or set it up so that it runs in background with it's stdin and stdout fitted to named pipes, or some structure like that where you can read and parse the output from one command before deciding what to pass in for the next one.
A read loop that cycles on a case statement which tests for known responses and behaves accordingly is a passably acceptable all-bash version. if you always terminate every command block with something like an image command that returns a fixed and known value, you can scan for known errors, and check for the return from that command in the case statement, and when you get the "sentinal" return loop back and read the next input. This makes for a largish and fairly complicated shell script, though.
Also, you need to test that when you get (for example) a 5[0-9][0-9] *) return it isn't actually "553 bytes*" because ftp screws you that way too.
Apologies for the length of the answer without including a code example - I just wanted to mention some ideas and caveats that wouldn't fit readably in a comment.

BASH script: Downloading consecutive numbered files with wget

I have a web server that saves the logs files of a web application numbered. A file name example for this would be:
dbsclog01s001.log
dbsclog01s002.log
dbsclog01s003.log
The last 3 digits are the counter and they can get sometime up to 100.
I usually open a web browser, browse to the file like:
http://someaddress.com/logs/dbsclog01s001.log
and save the files. This of course gets a bit annoying when you get 50 logs.
I tried to come up with a BASH script for using wget and passing
http://someaddress.com/logs/dbsclog01s*.log
but I am having problems with my the script.
Anyway, anyone has a sample on how to do this?
thanks!
#!/bin/sh
if [ $# -lt 3 ]; then
echo "Usage: $0 url_format seq_start seq_end [wget_args]"
exit
fi
url_format=$1
seq_start=$2
seq_end=$3
shift 3
printf "$url_format\\n" `seq $seq_start $seq_end` | wget -i- "$#"
Save the above as seq_wget, give it execution permission (chmod +x seq_wget), and then run, for example:
$ ./seq_wget http://someaddress.com/logs/dbsclog01s%03d.log 1 50
Or, if you have Bash 4.0, you could just type
$ wget http://someaddress.com/logs/dbsclog01s{001..050}.log
Or, if you have curl instead of wget, you could follow Dennis Williamson's answer.
curl seems to support ranges. From the man page:
URL
The URL syntax is protocol dependent. You’ll find a detailed descrip‐
tion in RFC 3986.
You can specify multiple URLs or parts of URLs by writing part sets
within braces as in:
http://site.{one,two,three}.com
or you can get sequences of alphanumeric series by using [] as in:
ftp://ftp.numericals.com/file[1-100].txt
ftp://ftp.numericals.com/file[001-100].txt (with leading zeros)
ftp://ftp.letters.com/file[a-z].txt
No nesting of the sequences is supported at the moment, but you can use
several ones next to each other:
http://any.org/archive[1996-1999]/vol[1-4]/part{a,b,c}.html
You can specify any amount of URLs on the command line. They will be
fetched in a sequential manner in the specified order.
Since curl 7.15.1 you can also specify step counter for the ranges, so
that you can get every Nth number or letter:
http://www.numericals.com/file[1-100:10].txt
http://www.letters.com/file[a-z:2].txt
You may have noticed that it says "with leading zeros"!
You can use echo type sequences in the wget url to download a string of numbers...
wget http://someaddress.com/logs/dbsclog01s00{1..3}.log
This also works with letters
{a..z} {A..Z}
Not sure precisely what problems you were experiencing, but it sounds like a simple for loop in bash would do it for you.
for i in {1..999}; do
wget -k http://someaddress.com/logs/dbsclog01s$i.log -O your_local_output_dir_$i;
done
You can use a combination of a for loop in bash with the printf command (of course modifying echo to wget as needed):
$ for i in {1..10}; do echo "http://www.com/myurl`printf "%03d" $i`.html"; done
http://www.com/myurl001.html
http://www.com/myurl002.html
http://www.com/myurl003.html
http://www.com/myurl004.html
http://www.com/myurl005.html
http://www.com/myurl006.html
http://www.com/myurl007.html
http://www.com/myurl008.html
http://www.com/myurl009.html
http://www.com/myurl010.html
Interesting task, so I wrote full script for you (combined several answers and more). Here it is:
#!/bin/bash
# fixed vars
URL=http://domain.com/logs/ # URL address 'till logfile name
PREF=logprefix # logfile prefix (before number)
POSTF=.log # logfile suffix (after number)
DIGITS=3 # how many digits logfile's number have
DLDIR=~/Downloads # download directory
TOUT=5 # timeout for quit
# code
for((i=1;i<10**$DIGITS;++i))
do
file=$PREF`printf "%0${DIGITS}d" $i`$POSTF # local file name
dl=$URL$file # full URL to download
echo "$dl -> $DLDIR/$file" # monitoring, can be commented
wget -T $TOUT -q $dl -O $file
if [ "$?" -ne 0 ] # test if we finished
then
exit
fi
done
At the beggiing of the script you can set URL, log file prefix and suffix, how many digits you have in numbering part and download directory. Loop will download all logfiles it found, and automaticaly exit on first non-existant (using wget's timeout).
Note that this script assumes that logfile indexing starts with 1, not zero, as you mentioned in example.
Hope this helps.
Here you can find a Perl script that looks like what you want
http://osix.net/modules/article/?id=677
#!/usr/bin/perl
$program="wget"; #change this to proz if you have it ;-)
my $count=1; #the lesson number starts from 1
my $base_url= "http://www.und.nodak.edu/org/crypto/crypto/lanaki.crypt.class/lessons/lesson";
my $format=".zip"; #the format of the file to download
my $max=24; #the total number of files to download
my $url;
for($count=1;$count<=$max;$count++) {
if($count<10) {
$url=$base_url."0".$count.$format; #insert a '0' and form the URL
}
else {
$url=$base_url.$count.$format; #no need to insert a zero
}
system("$program $url");
}
I just had a look at the wget manpage discussion of 'globbing':
By default, globbing will be turned on if the URL contains a globbing character. This option may be used to turn globbing on or off permanently.
You may have to quote the URL to protect it from being expanded by your shell. Globbing makes Wget look for a directory listing, which is system-specific. This is why it currently works only with Unix FTP servers (and the ones emulating Unix "ls" output).
So wget http://... won't work with globbing.
Check to see if your system has seq, then it would be easy:
for i in $(seq -f "%03g" 1 10); do wget "http://.../dbsclog${i}.log"; done
If your system has the jot command instead of seq:
for i in $(jot -w "http://.../dbsclog%03d.log" 10); do wget $i; done
Oh! this is a similar problem I ran into when learning bash to automate manga downloads.
Something like this should work:
for a in `seq 1 999`; do
if [ ${#a} -eq 1 ]; then
b="00"
elif [ ${#a} -eq 2 ]; then
b="0"
fi
echo "$a of 231"
wget -q http://site.com/path/fileprefix$b$a.jpg
done
Late to the party, but a real easy solution that requires no coding is to use the DownThemAll Firefox add-on, which has the functionality to retrieve ranges of files. That was my solution when I needed to download 800 consecutively numbered files.

Resources