How to download a file with wget and save it according to the http-reported filename? - bash

When you request a file with wget and that file is being served by some dynamic page (e.g. php), wget will try to use the path to that dynamic page (usually looking as if an angry child got hold of your keyboard: index.php?a8s7df6a8s=d6fa8sd6f90v78wg&l45i87ylqwiu45h=j76h2g461k326v).
However, these pages usually send an HTTP header with the file so that user agents can display a sensible file name. How do I get wget to listen to that and use it (instead of the url) to determine the name under which to save the file?

I found that a way to do this was to use the --server-response flag with --spider and invoke wget twice (there is certainly room for improvement, there!)
Assume the url to be in $link:
wget --quiet --server-response --spider -O /dev/null -- "$link" 2>&1 \
| sed -n 's/^.*filename=\([^;]*\)\(;.*\)\?$/\1/p' \
| while read name; do
wget -O "$name" -- "$link"
break
done
Seems to work like a charm for me.
Possibly, there is a direct way, though. This creates (completely unnecessarily) two connections to the server.

Related

Bash script to wget url starting with a specific character

I have a URL http://example.com/dir that has many subdirectories with files that I want to save. Because its size is very big I want to break this operation in parts
eg. download everything from subdirectories starting with A like
http://example.com/A
http://example.com/Aa
http://example.com/Ab
etc
I have created the following script
#!/bin/bash
for g in A B C
do wget -e robots=off -r -nc -np -R "index.html*" http://example.com/$g
done
but it tries to download only http://example.com/A and not http://example.com/A*
Look at this page, it has all you need to know:
https://www.gnu.org/software/wget/manual/wget.html
1) You could use:
--spider -nd -r -o outputfile <domain>
which does not download the files, it just checks if they are there.
-nd prevents wget from creating directories locally
-r to parse entire site
-o outputfile to send the output to a file
to get a list of URLs to download.
2) then parse the outputfile to extract the files, and create smaller lists of links you want to download.
3) then use -i file (== --input-file=file) to download each list, thus limiting how many you download in one execution of wget.
Notes:
- --limit-rate=amount can be used to slow down downloads, to spare your Internet link!

Check if a remote file exists in bash

I am downloading files with this script:
parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'
Would it be possible to not download files, just check them on the remote side and if exists create a dummy file instead of downloading?
Something like:
if wget --spider $url 2>/dev/null; then
#touch img.file
fi
should work, but I don't know how to combine this code with GNU Parallel.
Edit:
Based on Ole's answer I wrote this piece of code:
#!/bin/bash
do_url() {
url="$1"
wget -q -nc --method HEAD "$url" && touch ./images/${url##*/}
#get filename from $url
url2=${url##*/}
wget -q -nc --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url
parallel --progress -a urls.txt do_url {}
It works, but it fails for some files. I can not find consistency why it works for some files, why it fails for others. Maybe it has something with the last filename. Second wget tries to access the currect url, but the touch command after that simply does not create the desidered file. First wget always (correctly) downloads the main image without the _001.jpg, _002.jpg.
Example urls.txt:
http://host.com/092401.jpg (works correctly, _001.jpg.._005.jpg are downloaded)
http://host.com/HT11019.jpg (not works, only the main image is downloaded)
It is pretty hard to understand what it is you really want to accomplish. Let me try to rephrase your question.
I have urls.txt containing:
http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg
On example.com these URLs exist:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg
On example.org these URLs exist:
http://example.org/dira/foo_001.jpg
Given urls.txt I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:
http://example.com/dira/foo.jpg
becomes:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg
Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.
If the URL exists I want an empty file created.
(Version 1): I want the empty file created in a the similar directory structure in the dir images. This is needed because some of the images have the same name, but in different dirs.
So the files created should be:
images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg
(Version 2): I want the empty file created in the dir images. This can be done because all the images have unique names.
So the files created should be:
images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg
(Version 3): I want the empty file created in the dir images called the name from urls.txt. This can be done because only one of _001.jpg .. _005.jpg exists.
images/foo.jpg
images/bar.jpg
images/baz.jpg
#!/bin/bash
do_url() {
url="$1"
# Version 1:
# If you want to keep the folder structure from the server (similar to wget -m):
wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$3"
# Version 3:
# If all the images have unique names when _###.jpg is removed and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$4"
}
export -f do_url
parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
GNU Parallel takes a few ms per job. When your jobs are this short, the overhead will affect the timing. If none of your CPU cores are running at 100% you can run more jobs in parallel:
parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
You can also "unroll" the loop. This will save 5 overheads per URL:
do_url() {
url="$1"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url
parallel -j0 do_url {.} :::: urls.txt
Finally you can run more than 250 jobs: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
You may use curl instead to check if the URLs you are parsing are there without downloading any file as such:
if curl --head --fail --silent "$url" >/dev/null; then
touch .images/"${url##*/}"
fi
Explanation:
--fail will make the exit status nonzero on a failed request.
--head will avoid downloading the file contents
--silent will avoid status or errors from being emitted by the check itself.
To solve the "looping" issue, you can do:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if curl --head --silent --fail "$url" > /dev/null; then
touch .images/${url##*/}
fi
done
From what I can see, your question isn't really about how to use wget to test for the existence of a file, but rather on how to perform correct looping in a shell script.
Here is a simple solution for that:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if wget -q --method=HEAD "$url"; then
touch .images/${url##*/}
fi
done
What this does is that it invokes Wget with the --method=HEAD option. With the HEAD request, the server will simply report back whether the file exists or not, without returning any data.
Of course, with a large data set this is pretty inefficient. You're creating a new connection to the server for every file you're trying. Instead, as suggested in the other answer, you could use GNU Wget2. With wget2, you can test all of these in parallel, and use the new --stats-server option to find a list of all the files and the specific return code that the server provided. For example:
$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}
Site Statistics:
http://example.com:
Status No. of docs
404 3
http://example.com/3 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
http://example.com/1 0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
http://example.com/2 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
200 1
http://example.com/ 0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)
You can even get this data printed as a CSV or JSON for easier parsing
Just loop over the names?
for uname in ${url%.jpg}_{001..005}.jpg
do
if wget --spider $uname 2>/dev/null; then
touch ./images/${uname##*/}
fi
done
You could send a command via ssh to see if the remote file exists and cat it if it does:
ssh your_host 'test -e "somefile" && cat "somefile"' > somefile
Could also try scp which supports glob expressions and recursion.

How to search for a string in a text file and perform a specific action based on the result

I have very little experience with Bash but here is what I am trying to accomplish.
I have two different text files with a bunch of server names in them. Before installing any windows updates and rebooting them, I need to disable all the nagios host/service alerts.
host=/Users/bob/WSUS/wsus_test.txt
password="my_password"
while read -r host
do
curl -vs -o /dev/null -d "cmd_mod=2&cmd_typ=25&host=$host&btnSubmit=Commit" "https://nagios.fqdn.here/nagios/cgi-bin/cmd.cgi" -u "bob:$password" -k
done < wsus_test.txt >> /Users/bob/WSUS/diable_test.log 2>&1
This is a reduced form of my current code which works as intended, however, we have servers in a bunch of regions. Each server name is prepended with a 3 letter code based on region (ie, LAX, NYC, etc). Secondly, we have a nagios server in each region so I need the code above to be connecting to the correct regional nagios server based on the server name being passed in.
I tried adding 4 test servers into a text file and just adding a line like this:
if grep lax1 /Users/bob/WSUS/wsus_text.txt; then
<same command as above but with the regional nagios server name>
fi
This doesn't work as intended and nothing is actually disabled/enabled via API calls. Again, I've done very little with Bash so any pointers would be appreciated.
Extract the region from host name and use it in the Nagios URL, like this:
while read -r host; do
region=$(cut -f1 -d- <<< "$host")
curl -vs -o /dev/null -d "cmd_mod=2&cmd_typ=25&host=$host&btnSubmit=Commit" "https://nagios-$region.fqdn.here/nagios/cgi-bin/cmd.cgi" -u "bob:$password" -k
done < wsus_test.txt >> /Users/bob/WSUS/diable_test.log 2>&1

using wget to download a directory

I'm trying to download all the files in an online directory. The command I'm using is:
wget -r -np -nH -R index.html
http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
Using this command I get an empty directory. If I specify file names at the end I can get one at a time, but I'd like to get them all at once. Am I just missing something simple?
output from command:
--2015-03-14 14:54:05-- http://www.oecd-nea.org/dbforms/data/evaevatapes/mendl_2/
Resolving www.oecd-nea.org... 193.51.64.80
Connecting to www.oecd-nea.org|193.51.64.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: âdbforms/data/eva/evatapes/mendl_2/index.htmlâdbforms/data/eva/evatapes/mendl_2/index.htmlârobots.txtârobots.txt
Add the depth of links you want to follow (-l1, since you only want to follow one link):
wget -e robots=off -l1 -r -np -nH -R index.html http://www.oecd-nea.org/dbforms/data/eva/evatapes/mendl_2/
I also added -e robots=off, since there is a robots.txt which would normally stop wget from going through that directory. For the rest of the world:
-r recursive,
-np no parent directory
-nH no spanning across hosts

Using wget to recursively fetch a directory with arbitrary files in it

I have a web directory where I store some config files. I'd like to use wget to pull those files down and maintain their current structure. For instance, the remote directory looks like:
http://mysite.com/configs/.vim/
.vim holds multiple files and directories. I want to replicate that on the client using wget. Can't seem to find the right combo of wget flags to get this done. Any ideas?
You have to pass the -np/--no-parent option to wget (in addition to -r/--recursive, of course), otherwise it will follow the link in the directory index on my site to the parent directory. So the command would look like this:
wget --recursive --no-parent http://example.com/configs/.vim/
To avoid downloading the auto-generated index.html files, use the -R/--reject option:
wget -r -np -R "index.html*" http://example.com/configs/.vim/
To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :
wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data
For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:
wget -e robots=off http://www.example.com/
http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html
You should use the -m (mirror) flag, as that takes care to not mess with timestamps and to recurse indefinitely.
wget -m http://example.com/configs/.vim/
If you add the points mentioned by others in this thread, it would be:
wget -m -e robots=off --no-parent http://example.com/configs/.vim/
Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):
wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/
If --no-parent not help, you might use --include option.
Directory struct:
http://<host>/downloads/good
http://<host>/downloads/bad
And you want to download downloads/good but not downloads/bad directory:
wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good
wget -r http://mysite.com/configs/.vim/
works for me.
Perhaps you have a .wgetrc which is interfering with it?
First of all, thanks to everyone who posted their answers. Here is my "ultimate" wget script to download a website recursively:
wget --recursive ${comment# self-explanatory} \
--no-parent ${comment# will not crawl links in folders above the base of the URL} \
--convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
--random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
--no-host-directories ${comment# do not create folders with the domain name} \
--execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
--level=inf --accept '*' ${comment# do not limit to 5 levels or common file formats} \
--reject="index.html*" ${comment# use this option if you need an exact mirror} \
--cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL
Afterwards, stripping the query params from URLs like main.css?crc=12324567 and running a local server (e.g. via python3 -m http.server in the dir you just wget'ed) to run JS may be necessary. Please note that the --convert-links option kicks in only after the full crawl was completed.
Also, if you are trying to wget a website that may go down soon, you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue.
To fetch a directory recursively with username and password, use the following command:
wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/
This version downloads recursively and doesn't create parent directories.
wgetod() {
NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}
Usage:
Add to ~/.bashrc or paste into terminal
wgetod "http://example.com/x/"
The following option seems to be the perfect combination when dealing with recursive download:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:
wget -r --no-parent http://example.com/configs/.vim/
That's it. It will download into the following local tree: ./example.com/configs/.vim .
However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:
wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/
And it will download your file tree only into ./.vim/
In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.
It sounds like you're trying to get a mirror of your file. While wget has some interesting FTP and SFTP uses, a simple mirror should work. Just a few considerations to make sure you're able to download the file properly.
Respect robots.txt
Ensure that if you have a /robots.txt file in your public_html, www, or configs directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:
wget -e robots=off 'http://your-site.com/configs/.vim/'
Convert remote links to local files.
Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.
Try this:
wget -mpEk 'http://your-site.com/configs/.vim/'
# If robots.txt is present:
wget -mpEk robots=off 'http://your-site.com/configs/.vim/'
# Good practice to only deal with the highest level directory you specify (instead of downloading all of `mysite.com` you're just mirroring from `.vim`
wget -mpEk robots=off --no-parent 'http://your-site.com/configs/.vim/'
Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files.
Since you should have a link set up, you should get your config folder with a file /.vim.
Mirror mode also works with a directory structure that's set up as an ftp:// also.
General rule of thumb:
Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads.
wget -mpEk --no-parent robots=off --random-wait 'http://your-site.com/configs/.vim/'
But if you're simply downloading the ../config/.vim/ file you shouldn't have to worry about it as your ignoring parent directories and downloading a single file.
Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...
wget --recursive (...)
...only retrieves index.html instead of all files.
Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.
Recursive wget ignoring robots (for websites)
wget -e robots=off -r -np --page-requisites --convert-links 'http://example.com/folder/'
-e robots=off causes it to ignore robots.txt for that domain
-r makes it recursive
-np = no parents, so it doesn't follow links up to the parent folder
You should be able to do it simply by adding a -r
wget -r http://stackoverflow.com/

Resources