I am trying to run a command to download 3000 files in parallel. I am using Cygwin + Windows.
Downloading a single file via WGET in terminal :
wget --no-check-certificate --content-disposition --load-cookies cookies.txt \ -p https://username:password#website.webpage.com/folder/document/download/1?type=file
allows me to download the file with ID 1 singularly, in the correct format (as long as --content-disposition is in the command).
I iterate over this REST API call to download the entire folder (3000 files). This works OK, but is quite slow.
FOR /L %i in (0,1,3000) do wget --no-check-certificate --content-disposition --load-cookies cookies.txt \ -p https://username:password#website.webpage.com/folder/document/download/%i?type=file
Now I am trying to run the program in Cygwin, in parallel.
seq 3000 | parallel -j 200 wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.webpage.com/folder/document/download/{}?type=file
It runs, but the file-name and format is lost (instead of "index.html", for example, we may get "4#type=file" as the file-name).
Is there a way for me to fix this?
It is unclear what you would like them named. Let us assume you want them named: index.[1-3000].html
seq 3000 | parallel -j 200 wget -O index.{}.html --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.webpage.com/folder/document/download/{}?type=file
My guess is that it is caused by --content-disposition being experimental, and the wget used by CygWin may be older than the wget used by the FOR loop. To check that run:
wget --version
in CygWin and outside CygWin (ie. where you would run the FOR loop).
Related
Is there any way to make wget output everything in a single flat folder? right now i'm
wget --quiet -E -H -k -K -p -e robots=off #{url}
but i'm getting everything in the same nested way as it is on the site, is there any option to flatten the resulting folder structure? (and also the sources links on the index.html file)
After reading on the documentation and some examples i found that i was missing the -nd flag, that would make wget just get the files and not the directories
correct call wget --quiet -E -H -k -nd -K -p -e robots=off #{url}
I create a non interactive cookie in first command so that i can go in passwordless after to get all pages in the second wget command.
wget --no-check-certificate --save-cookies cookies.txt --keep-session-cookies --post-data 'user=me&password=mepassword' --delete-after https://site.address.com/php/index.php?site=abc&var=tor&dir=200219&date=2019-02-20
wget --no-check-certificate --load-cookies cookies.txt -r --spider https://site.address.com/php/index.php?site=abc&var=tor&dir=200219&date=2019-02-20
However, only dirs are returned
I am trying to download the files for a project using wget, as the SVN server for that project isn't running anymore and I am only able to access the files through a browser. The base URLs for all the files is the same like
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/*
How can I use wget (or any other similar tool) to download all the files in this repository, where the "tzivi" folder is the root folder and there are several files and sub-folders (upto 2 or 3 levels) under it?
You may use this in shell:
wget -r --no-parent http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r //recursive Download
and
--no-parent // Don´t download something from the parent directory
If you don't want to download the entire content, you may use:
-l1 just download the directory (tzivi in your case)
-l2 download the directory and all level 1 subfolders ('tzivi/something' but not 'tivizi/somthing/foo')
And so on. If you insert no -l option, wget will use -l 5 automatically.
If you insert a -l 0 you´ll download the whole Internet, because wget will follow every link it finds.
You can use this in a shell:
wget -r -nH --cut-dirs=7 --reject="index.html*" \
http://abc.tamu.edu/projects/tzivi/repository/revisions/2/raw/tzivi/
The Parameters are:
-r recursively download
-nH (--no-host-directories) cuts out hostname
--cut-dirs=X (cuts out X directories)
This link just gave me the best answer:
$ wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off -U mozilla http://base.site/dir/
Worked like a charm.
wget -r --no-parent URL --user=username --password=password
the last two options are optional if you have the username and password for downloading, otherwise no need to use them.
You can also see more options in the link https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/
use the command
wget -m www.ilanni.com/nexus/content/
you can also use this command :
wget --mirror -pc --convert-links -P ./your-local-dir/ http://www.your-website.com
so that you get the exact mirror of the website you want to download
try this working code (30-08-2021):
!wget --no-clobber --convert-links --random-wait -r -p --level 1 -E -e robots=off --adjust-extension -U mozilla "yourweb directory with in quotations"
I can't get this to work.
Whatever I try, I just get some http file.
Just looking at these commands for simply downloading a directory?
There must be a better way.
wget seems the wrong tool for this task, unless it is a complete failure.
This works:
wget -m -np -c --no-check-certificate -R "index.html*" "https://the-eye.eu/public/AudioBooks/Edgar%20Allan%20Poe%20-%2"
This will help
wget -m -np -c --level 0 --no-check-certificate -R"index.html*"http://www.your-websitepage.com/dir
I have three thousand files on a server. I can retrieve one at a time via a REST API call. I have written a command to retrieve these files. It works perfectly, but for my login timing out after roughly 200 downloads.
I would like to download all of these files in parallel rather than serially. Ideally, I would like to retrieve files 1-200 at once, 200-400 at the same time, 400-600 at the same time....etc.
So my attempt :
FOR /L %i in (0,1,200) do wget --no-check-certificate --content-disposition --load-cookies cookies.txt \ -p https://username:password#website.APICall.com/download/%i
How can I convert this into the parallel call I want to create?
Thanks.
With Cygwin and GNU Parallel installed you can download the 3000 files with 200 parallel downloads running constantly using:
seq 3000 | parallel -j 200 wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.APICall.com/download/{}
Don't go through the hassle of Cygwin; trying to turn Windows into UNIX is compounding problems and adds layers of dependencies. Use PowerShell.
If you can get 200 files downloaded before timing out, break it up into three jobs:
invoke-command -asjob -scriptblock {$files = #(1..200);$files | foreach-object{ & wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.APICall.com/download/$_}};
invoke-command -asjob -scriptblock {$files = #(201..400);$files | foreach-object{ & wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.APICall.com/download/$_}};
invoke-command -asjob -scriptblock {$files = #(601..400);$files | foreach-object{ & wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.APICall.com/download/$_}};
Or get Invoke-Parallel and use it like this:
$filenames = #(1..600);
invoke-parallel -InputObject $servers -throttle 200 -runspaceTimeout 30 -ScriptBlock { & wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p https://username:password#website.APICall.com/download/$_}
Another (and probably the best) option would be to use invoke-webrequest, but I don't know if it will work with your cookie requirement here.
Disclaimer: working from memory as I don't have Windows or your URL accessible at the moment.
An alternative to the GNU parallel method is the good ol' xargs with the -P option:
$ seq 3000 | xargs -i '{}' -n 1 -P 200 wget <url_start>{}<url_end>
I doubt your command works, because the iterator variable needs a double percent as far as I know, i.e. %i needs to be %%i.
Concerning the parallelization, you can try this:
FOR /L %%i IN (0,1,200) DO (
start wget --no-check-certificate --content-disposition --load-cookies cookies.txt -p "https://username:password#website.APICall.com/download/%%i"
)
It will, for your first 200 downloads, spawn a seperate process (and shell window!) for every download. Doing so will cause a lot of load on the server and I'm not sure this is really the way to go forward. But it does what you've asked for.
Edit: The above note holds for using the command in a .bat file, if you're executing this on the shell directly, a single percent is sufficient.
I am using following command to download a single webpage with all its images and js using wget in Windows 7:
wget -E -H -k -K -p -e robots=off -P /Downloads/ http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
It is downloading the HTML as required, but when I tried to pass on a text file having a list of 3 URLs to download, it didn't give any output, below is the command I am using:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt -B 'http://'
I tried this also:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
This text file had URLs http:// prepended in it.
list.txt contains list of 3 URLs which I need to download using a single command. Please help me in resolving this issue.
From man wget:
2 Invoking
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
So, just use multiple URLs:
wget URL1 URL2
Or using the links from comments:
$ cat list.txt
http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
http://www.verizonwireless.com/smartphones-2.shtml
http://www.att.com/shop/wireless/devices/smartphones.html
and your command line:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
works as expected.
First create a text file with the URLs that you need to download.
eg: download.txt
download.txt will as below:
http://www.google.com
http://www.yahoo.com
then use the command wget -i download.txt to download the files. You can add many URLs to the text file.
If you have a list of URLs separated on multiple lines like this:
http://example.com/a
http://example.com/b
http://example.com/c
but you don't want to create a file and point wget to it, you can do this:
wget -i - <<< 'http://example.com/a
http://example.com/b
http://example.com/c'
pedantic version:
for x in {'url1','url2'}; do wget $x; done
the advantage of it you can treat is as a single wget url command