wget command to download a file and save as a different filename - download

I am downloading a file using the wget command. But when it downloads to my local machine, I want it to be saved as a different filename.
For example: I am downloading a file from www.examplesite.com/textfile.txt
I want to use wget to save the file textfile.txt on my local directory as newfile.txt. I am using the wget command as follows:
wget www.examplesite.com/textfile.txt

Use the -O file option.
E.g.
wget google.com
...
16:07:52 (538.47 MB/s) - `index.html' saved [10728]
vs.
wget -O foo.html google.com
...
16:08:00 (1.57 MB/s) - `foo.html' saved [10728]

Also notice the order of parameters on the command line. At least on some systems (e.g. CentOS 6):
wget -O FILE URL
works. But:
wget URL -O FILE
does not work.

You would use the command Mechanical snail listed. Notice the uppercase O. Full command line to use could be:
wget www.examplesite.com/textfile.txt --output-document=newfile.txt
or
wget www.examplesite.com/textfile.txt -O newfile.txt
Hope that helps.

wget -O yourfilename.zip remote-storage.url/theirfilename.zip
will do the trick for you.
Note:
a) its a capital O.
b) wget -O filename url will only work. Putting -O last will not.

Either curl or wget can be used in this case. All 3 of these commands do the same thing, downloading the file at http://path/to/file.txt and saving it locally into "my_file.txt":
wget http://path/to/file.txt -O my_file.txt # my favorite--it has a progress bar
curl http://path/to/file.txt -o my_file.txt
curl http://path/to/file.txt > my_file.txt
Notice the first one's -O is the capital letter "O".
The nice thing about the wget command is it shows a nice progress bar.
You can prove the files downloaded by each of the 3 techniques above are exactly identical by comparing their sha512 hashes. Running sha512sum my_file.txt after running each of the commands above, and comparing the results, reveals all 3 files to have the exact same sha hashes (sha sums), meaning the files are exactly identical, byte-for-byte.
See also: How to capture cURL output to a file?

Using CentOS Linux I found that the easiest syntax would be:
wget "link" -O file.ext
where "link" is the web address you want to save and "file.ext" is the filename and extension of your choice.

Related

How define output folder of curl while running script

I have command that executed when script over and download file from list. I use Termux on android and it say you can't use cd while running script.
xargs -n 1 curl -O -C - <url
But it download all file to folder where I runned this script. How I can change output directory.
PS: Only curl please. Aria2c and wget will ignored by me.
Okay. This script I use now
while read url
do
curl --create-dirs -o "$file path/name" $url
I use "basename" of url for name.
Please answer if you have better code.

Bash script to wget url starting with a specific character

I have a URL http://example.com/dir that has many subdirectories with files that I want to save. Because its size is very big I want to break this operation in parts
eg. download everything from subdirectories starting with A like
http://example.com/A
http://example.com/Aa
http://example.com/Ab
etc
I have created the following script
#!/bin/bash
for g in A B C
do wget -e robots=off -r -nc -np -R "index.html*" http://example.com/$g
done
but it tries to download only http://example.com/A and not http://example.com/A*
Look at this page, it has all you need to know:
https://www.gnu.org/software/wget/manual/wget.html
1) You could use:
--spider -nd -r -o outputfile <domain>
which does not download the files, it just checks if they are there.
-nd prevents wget from creating directories locally
-r to parse entire site
-o outputfile to send the output to a file
to get a list of URLs to download.
2) then parse the outputfile to extract the files, and create smaller lists of links you want to download.
3) then use -i file (== --input-file=file) to download each list, thus limiting how many you download in one execution of wget.
Notes:
- --limit-rate=amount can be used to slow down downloads, to spare your Internet link!

How to download a file using curl

I'm on mac OS X and can't figure out how to download a file from a URL via the command line. It's from a static page so I thought copying the download link and then using curl would do the trick but it's not.
I referenced this StackOverflow question but that didn't work. I also referenced this article which also didn't work.
What I've tried:
curl -o https://github.com/jdfwarrior/Workflows.git
curl: no URL specified!
curl: try 'curl --help' or 'curl --manual' for more information
.
wget -r -np -l 1 -A zip https://github.com/jdfwarrior/Workflows.git
zsh: command not found: wget
How can a file be downloaded through the command line?
The -o --output option means curl writes output to the file you specify instead of stdout. Your mistake was putting the url after -o, and so curl thought the url was a file to write to rate and hence that no url was specified. You need a file name after the -o, then the url:
curl -o ./filename https://github.com/jdfwarrior/Workflows.git
And wget is not available by default on OS X.
curl -OL https://github.com/jdfwarrior/Workflows.git
-O: This option used to write the output to a file which named like remote file we get. In this curl that file would be Workflows.git.
-L: This option used if the server reports that the requested page has moved to a different location (indicated with a Location: header and a 3XX response code), this option will make curl redo the request on the new place.
Ref: curl man page
The easiest solution for your question is to keep the original filename. In that case, you just need to use a capital o ("-O") as option (not a zero=0!). So it looks like:
curl -O https://github.com/jdfwarrior/Workflows.git
There are several options to make curl output to a file
# saves it to myfile.txt
curl http://www.example.com/data.txt -o myfile.txt -L
# The #1 will get substituted with the url, so the filename contains the url
curl http://www.example.com/data.txt -o "file_#1.txt" -L
# saves to data.txt, the filename extracted from the URL
curl http://www.example.com/data.txt -O -L
# saves to filename determined by the Content-Disposition header sent by the server.
curl http://www.example.com/data.txt -O -J -L
# -O Write output to a local file named like the remote file we get
# -o <file> Write output to <file> instead of stdout (variable replacement performed on <file>)
# -J Use the Content-Disposition filename instead of extracting filename from URL
# -L Follow redirects

Save file to specific folder with curl command

In a shell script, I want to download a file from some URL and save it to a specific folder. What is the specific CLI flag I should use to download files to a specific folder with the curl command, or how else do I get that result?
I don't think you can give a path to curl, but you can CD to the location, download and CD back.
cd target/path && { curl -O URL ; cd -; }
Or using subshell.
(cd target/path && curl -O URL)
Both ways will only download if path exists. -O keeps remote file name. After download it will return to original location.
If you need to set filename explicitly, you can use small -o option:
curl -o target/path/filename URL
The --output-dir option is available since curl 7.73.0:
curl --create-dirs -O --output-dir /tmp/receipes https://example.com/pancakes.jpg
curl doesn't have an option to that (without also specifying the filename), but wget does. The directory can be relative or absolute. Also, the directory will automatically be created if it doesn't exist.
wget -P relative/dir "$url"
wget -P /absolute/dir "$url"
it works for me:
curl http://centos.mirror.constant.com/8-stream/isos/aarch64/CentOS-Stream-8-aarch64-20210916-boot.iso --output ~/Downloads/centos.iso
where:
--output allows me to set up the path and the naming of the file and extension file that I want to place.
Use redirection:
This works to drop a curl downloaded file into a specified path:
curl https://download.test.com/test.zip > /tmp/test.zip
Obviously "test.zip" is whatever arbitrary name you want to label the redirected file- could be the same name or a different name.
I actually prefer #oderibas solution, but this will get you around the issue until your distro supports curl version 7.73.0 or later-
For powershell in Windows, you can add relative path + filename to --output flag:
curl -L http://github.com/GorvGoyl/Notion-Boost-browser-extension/archive/master.zip --output build_firefox/master-repo.zip
here build_firefox is relative folder.
Use wget
wget -P /your/absolut/path "https://jdbc.postgresql.org/download/postgresql-42.3.3.jar"
For Windows, in PowerShell, curl is an alias of the cmdlet Invoke-WebRequest and this syntax works:
curl "url" -OutFile file_name.ext
For instance:
curl "https://airflow.apache.org/docs/apache-airflow/2.2.5/docker-compose.yaml" -OutFile docker-compose.yaml
Source: https://krypted.com/windows-server/its-not-wget-or-curl-its-iwr-in-windows/
Here is an example using Batch to create a safe filename from a URL and save it to a folder named tmp/. I do think it's strange that this isn't an option on the Windows or Linux Curl versions.
#echo off
set url=%1%
for /r %%f in (%url%) do (
set url=%%~nxf.txt
curl --create-dirs -L -v -o tmp/%%~nxf.txt %url%
)
The above Batch file will take a single input, a URL, and create a filename from the url. If no filename is specified it will be saved as tmp/.txt. So it's not all done for you but it gets the job done in Windows.

Using wget to recursively fetch a directory with arbitrary files in it

I have a web directory where I store some config files. I'd like to use wget to pull those files down and maintain their current structure. For instance, the remote directory looks like:
http://mysite.com/configs/.vim/
.vim holds multiple files and directories. I want to replicate that on the client using wget. Can't seem to find the right combo of wget flags to get this done. Any ideas?
You have to pass the -np/--no-parent option to wget (in addition to -r/--recursive, of course), otherwise it will follow the link in the directory index on my site to the parent directory. So the command would look like this:
wget --recursive --no-parent http://example.com/configs/.vim/
To avoid downloading the auto-generated index.html files, use the -R/--reject option:
wget -r -np -R "index.html*" http://example.com/configs/.vim/
To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :
wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data
For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:
wget -e robots=off http://www.example.com/
http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html
You should use the -m (mirror) flag, as that takes care to not mess with timestamps and to recurse indefinitely.
wget -m http://example.com/configs/.vim/
If you add the points mentioned by others in this thread, it would be:
wget -m -e robots=off --no-parent http://example.com/configs/.vim/
Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):
wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/
If --no-parent not help, you might use --include option.
Directory struct:
http://<host>/downloads/good
http://<host>/downloads/bad
And you want to download downloads/good but not downloads/bad directory:
wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good
wget -r http://mysite.com/configs/.vim/
works for me.
Perhaps you have a .wgetrc which is interfering with it?
First of all, thanks to everyone who posted their answers. Here is my "ultimate" wget script to download a website recursively:
wget --recursive ${comment# self-explanatory} \
--no-parent ${comment# will not crawl links in folders above the base of the URL} \
--convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
--random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
--no-host-directories ${comment# do not create folders with the domain name} \
--execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
--level=inf --accept '*' ${comment# do not limit to 5 levels or common file formats} \
--reject="index.html*" ${comment# use this option if you need an exact mirror} \
--cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL
Afterwards, stripping the query params from URLs like main.css?crc=12324567 and running a local server (e.g. via python3 -m http.server in the dir you just wget'ed) to run JS may be necessary. Please note that the --convert-links option kicks in only after the full crawl was completed.
Also, if you are trying to wget a website that may go down soon, you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue.
To fetch a directory recursively with username and password, use the following command:
wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/
This version downloads recursively and doesn't create parent directories.
wgetod() {
NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}
Usage:
Add to ~/.bashrc or paste into terminal
wgetod "http://example.com/x/"
The following option seems to be the perfect combination when dealing with recursive download:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:
wget -r --no-parent http://example.com/configs/.vim/
That's it. It will download into the following local tree: ./example.com/configs/.vim .
However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:
wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/
And it will download your file tree only into ./.vim/
In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.
It sounds like you're trying to get a mirror of your file. While wget has some interesting FTP and SFTP uses, a simple mirror should work. Just a few considerations to make sure you're able to download the file properly.
Respect robots.txt
Ensure that if you have a /robots.txt file in your public_html, www, or configs directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:
wget -e robots=off 'http://your-site.com/configs/.vim/'
Convert remote links to local files.
Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.
Try this:
wget -mpEk 'http://your-site.com/configs/.vim/'
# If robots.txt is present:
wget -mpEk robots=off 'http://your-site.com/configs/.vim/'
# Good practice to only deal with the highest level directory you specify (instead of downloading all of `mysite.com` you're just mirroring from `.vim`
wget -mpEk robots=off --no-parent 'http://your-site.com/configs/.vim/'
Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files.
Since you should have a link set up, you should get your config folder with a file /.vim.
Mirror mode also works with a directory structure that's set up as an ftp:// also.
General rule of thumb:
Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads.
wget -mpEk --no-parent robots=off --random-wait 'http://your-site.com/configs/.vim/'
But if you're simply downloading the ../config/.vim/ file you shouldn't have to worry about it as your ignoring parent directories and downloading a single file.
Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...
wget --recursive (...)
...only retrieves index.html instead of all files.
Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.
Recursive wget ignoring robots (for websites)
wget -e robots=off -r -np --page-requisites --convert-links 'http://example.com/folder/'
-e robots=off causes it to ignore robots.txt for that domain
-r makes it recursive
-np = no parents, so it doesn't follow links up to the parent folder
You should be able to do it simply by adding a -r
wget -r http://stackoverflow.com/

Resources