I'm trying to crawl a local site with wget -r but I'm unsuccessful: it just downloads the first page and doesn't go any deeper. By the way, I'm so unsuccessful that for whatever site I'm trying it doesn't work... :)
I've tried various options but nothing better happens. Here's the command I thought I'd make it with:
wget -r -e robots=off --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/537.4" --follow-tags=a,ref --debug `http://rocky:8081/obix`
Really, I've no clue. Whatever site or documentation I read about wget tells me that it should simply work with wget -r so I'm starting to think my wget is buggy (I'm on Fedora 16).
Any idea?
EDIT: Here's the output I'm getting for wget -r --follow-tags=ref,a http://rocky:8081/obix/ :
wget -r --follow-tags=ref,a http://rocky:8081/obix/
--2012-10-19 09:29:51-- http://rocky:8081/obix/ Resolving rocky... 127.0.0.1 Connecting to rocky|127.0.0.1|:8081...
connected. HTTP request sent, awaiting response... 200 OK Length: 792
[text/xml] Saving to: “rocky:8081/obix/index.html”
100%[==============================================================================>] 792 --.-K/s in 0s
2012-10-19 09:29:51 (86,0 MB/s) - “rocky:8081/obix/index.html”
saved [792/792]
FINISHED --2012-10-19 09:29:51-- Downloaded: 1 files, 792 in 0s (86,0
MB/s)
Usually there's no need to give the user-agent.
It should be sufficient to give:
wget -r http://stackoverflow.com/questions/12955253/recursive-wget-wont-work
To see, why wget doesn't do what you want, look at the output it is giving you and post it here.
Related
As I made a batch file to update NirSoft tools, I had a strange experience using wget.
First I downloaded a text file with pad links:
wget http://www.nirsoft.net/pad/pad-links.txt --backups=20 --append-output=C:\Path\Update\LOG\Nirsoft\%Timestamp%_NirSoft.log
After, I used fart-js to delete rows I did not need from the pad-links.txt file. Also I used that program to change the download links to https://www.nirsoft.net/utils, and change the file extensions to .zip.
fart ".\pad-links.txt" "http://www.nirsoft.net/pad" "http://www.nirsoft.net/utils" | tee --append C:\Path\Update\LOG\Nirsoft\%Timestamp%_NirSoft.log
and
fart ".\pad-links.txt" ".xml" ".zip" | tee --append C:\Path\Update\LOG\Nirsoft\%Timestamp%_NirSoft.log
After, to download the programs, I used:
wget --timestamping --input-file=C:\Path\UtilSuit\NirLauncher\Download\pad-links.txt --append-output=C:\Path\Update\LOG\Nirsoft\%Timestamp%_NirSoft.log
Having a look at the log file I found out that not all programs are stored in this location. For example WirelessKeyView is stored in https://www.nirsoft.net/toolsdownload/wirelesskeyview.zip.
Trying to get this file with wget leads to downloaded corrupt files at size of 4kb. The same with cURL and aria2. When I download it with Mozilla, or IDM, I have no problems to get the file. So I tried out wget --auth-no-challenge or wget --header="Accept: text/html" --user-agent="Mozilla/5.0 …"
I also tried cliget, the wget/aria2/curl lines it produced while normal downloading with Mozilla.
wget --header 'Host: www.nirsoft.net' --user-agent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Language: de,en-US;q=0.7,en;q=0.3' --referer 'https://www.nirsoft.net/utils/wirelesskeyview.html' --header 'Upgrade-Insecure-Requests: 1' --header 'Sec-Fetch-Dest: document' --header 'Sec-Fetch-Mode: navigate' --header 'Sec-Fetch-Site: same-origin' --header 'Sec-Fetch-User: ?1' --header 'DNT: 1' --header 'Sec-GPC: 1' 'https://www.nirsoft.net/toolsdownload/wirelesskeyview.zip' --output-document 'wirelesskeyview.zip'
I googled and found this reference for powershell, (same error), but cannot reproduce the working answer in batch, (I am not familiar with powershell scripting).
So how is is possible to download the single wirelesskey.zip file with wget/curl or aria2 in a batch script?
A workaround I found out is downloading it directly from the pad Panel but I want the .zip-file, including the updated .chm-file, and also the 64-bit versions, if available.
One more note, within my anti-virus tool the nirsoft site is exempted from scanning, so that is not the answer.
Any solutions?
Aah, this one is simple. If you look at the actual page downloaded, it's called "403.html". So, let's open it. The first thing that strikes you is this:
<title>Error 403: Missing HTTP referer in the HTTP request</title>
So, the server wants a Referer header. Sure, let's give it one:
$ wget --referer foo <URL>
And it downloads the zip file correctly as expected.
Now, really, the server should not be returning a HTTP 200 response with a file called 403. It really should have sent back a HTTP 403 response. But what can you do? There's broken servers everywhere
I have the following command
sudo wget --output-document=/dev/null http://speedtest.pixelwolf.ch which outputs
--2016-03-27 17:15:47-- http://speedtest.pixelwolf.ch/
Resolving speedtest.pixelwolf.ch (speedtest.pixelwolf.ch)... 178.63.18.88, 2a02:418:3102::6
Connecting to speedtest.pixelwolf.ch (speedtest.pixelwolf.ch) | 178.63.18.88|:80... connected.
HTTP Request sent, awaiting response... 200 OK
Length: 85 [text/html]
Saving to: `/dev/null`
100%[======================>]85 --.-K/s in 0s
2016-03-27 17:15:47 (8.79 MB/s) - `dev/null` saved [85/85]
I'd like to be able to parse the (8.79 MB/s) from the last line and store this in a file (or any other way I can get this into a local PHP file easily), I tried to store the full output by changing my command to --output-document=/dev/speedtest however this just saved "Could not reach website" in the file and not the terminal output of the command.
Not quite sure where to start with this, so any help would be awesome.
Not sure if it helps, but my intention is for this stored value (8.79) in this instance to be read by a PHP file and handled there, every 30 seconds which I'll achieve by: while true; do (run speed test and save speed variable to a file cmd); php handleSpeedTest.php; sleep 5; done where handleSpeedTest.php will be able to read that stored value and handle it accordingly.
I changed the URL to one that works. Redirected stderr onto stdout. Used grep --only-matching (-o) and a regex.
sudo wget -O /dev/null http://www.google.com 2>&1 | grep -o '\([0-9.]\+ [KM]B/s\)'
curl -v -r 0-500 http://somefile -o localfile
It should download just the first 501 bytes, no? Instead, it just downloads the entire thing. All 67 megabytes. Thanks curl! Could my companies proxy servers be blocking this feature somehow? I am skeptical about that, since the downloads themselves do work, just not the range feature. Am I missing something?
As a client you could always abort the download when you have received what you want.
By using head, you will be able to limit the download to 500 bytes, even if the server does not accept the range-header
curl -v -r 0-500 http://somefile |head -c 500 > localfile
It should download just the first 501 bytes, no?
It depends on the server. From man curl:
You should also be aware that many HTTP/1.1 servers do not have this feature enabled, so that when you attempt to get a range, you'll instead get the whole document.
As you can see in the response from the server, it's using HTTP/1.1. So it's not surprising that the range feature is not supported at the server side.
Please use the following command
curl -H "range: bytes=354-500" -O http://example.com/file.extension
How can I download OracleXE using wget and avoid the login?
I tried applying logic from this question for Oracle Java but I couldn't get it to work.
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1 .0.x86_64.rpm.zip
I get:
--2015-10-13 04:51:03-- http://download.oracle.com/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1.0.x86_64.rpm.zip
Resolving download.oracle.com (download.oracle.com)... 206.248.168.160, 206.248.168.139, 206.248.168.160, ...
Connecting to download.oracle.com (download.oracle.com)|206.248.168.160|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://edelivery.oracle.com/akam/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1.0.x86_64.rpm.zip [following]
--2015-10-13 04:51:03-- https://edelivery.oracle.com/akam/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1.0.x86_64.rpm.zip
Resolving edelivery.oracle.com (edelivery.oracle.com)... 23.9.117.183, 23.9.117.183
Connecting to edelivery.oracle.com (edelivery.oracle.com)|23.9.117.183|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://login.oracle.com/pls/orasso/orasso.wwsso_app_admin.ls_login?Site2pstoreToken=v1.2~CA55CD32~7E777A421E00059BE8321AEAF3C29C59D68A2F46E15A49137CE5AAF6D6B46A0C599A4560AD622CF26FFFCF23FF8FC274F021B7E57B08CEF2076FADB1A57BBFB02B991E320BB3A417DDF966B4406E225736912745DE8F5E660631675765D519A5E7FF61481F567ED9C582AEAAEEC6E2A6C59D046AD82EA1C7AA08E9A1EDAFC44D97F22C470FE530A0F58872A00CAFD27012DF4851AD4964085264393C7220CF07817E14ED0B2130ECF06758DB538644A119246C4B65963CD1C825650BE3B3C86C1670EC8F754E943853BE4C58F0A4FD89B1CE14E7110087134765A9EBAA170769C75645798E1D978B944D2D896A564E49CD42478328D8661794E3DC377DBEF9F7C27184E0DFF7EAAB [following]
--2015-10-13 04:51:03-- https://login.oracle.com/pls/orasso/orasso.wwsso_app_admin.ls_login?Site2pstoreToken=v1.2~CA55CD32~7E777A421E00059BE8321AEAF3C29C59D68A2F46E15A49137CE5AAF6D6B46A0C599A4560AD622CF26FFFCF23FF8FC274F021B7E57B08CEF2076FADB1A57BBFB02B991E320BB3A417DDF966B4406E225736912745DE8F5E660631675765D519A5E7FF61481F567ED9C582AEAAEEC6E2A6C59D046AD82EA1C7AA08E9A1EDAFC44D97F22C470FE530A0F58872A00CAFD27012DF4851AD4964085264393C7220CF07817E14ED0B2130ECF06758DB538644A119246C4B65963CD1C825650BE3B3C86C1670EC8F754E943853BE4C58F0A4FD89B1CE14E7110087134765A9EBAA170769C75645798E1D978B944D2D896A564E49CD42478328D8661794E3DC377DBEF9F7C27184E0DFF7EAAB
Resolving login.oracle.com (login.oracle.com)... 209.17.4.8, 209.17.4.8
Connecting to login.oracle.com (login.oracle.com)|209.17.4.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2051 (2.0K) [text/html]
Saving to: ‘oracle-xe-11.2.0-1.0.x86_64.rpm.zip’
100%[======================================================================================================================================================>] 2,051 --.-K/s in 0s
2015-10-13 04:51:03 (142 MB/s) - ‘oracle-xe-11.2.0-1.0.x86_64.rpm.zip’ saved [2051/2051]
For download oracle for linux zips from url to directly server you have to:
1 - login Oracle.com with credentials.. (https://login.oracle.com/mysso/signon.jsp)
2 - export cookie.txt with browser
3 - copy this file to your Server
scp cookies.txt root#url:/path/
4 - go to path where your cookies.txt and copy install link and paste as to this to your server terminal
wget --load-cookies=cookies.txt http://download.oracle.com/otn/linux/oracle12c/121020/linuxamd64_12102_database_1of2.zip
wget --load-cookies=cookies.txt http://download.oracle.com/otn/linux/oracle12c/121020/linuxamd64_12102_database_2of2.zip
check file size with ls -lah
Note, that wget --header "Cookie: oraclelicense=accept-securebackup-cookie" breaks all other cookies, including authorization ones.
Instead You can use custom cookies.txt file and --user/--password (tested on Oracle Archive and OracleXE)
echo .oracle.com TRUE / FALSE 0 oraclelicense accept-securebackup-cookie >cookies.txt
wget -c --load-cookies cookies.txt --trust-server-names --user=SSO_USERNAME --password=SSO_PASSWORD URL
UPD: Attention! cookies.txt is tab-separated! To be sure of tabs please use `echo -e .oracle.com\tTRUE\t/\tFALSE\t0\toraclelicense\taccept-securebackup-cookie >cookies.txt
I am using following command in my bash script to trigger jenkins build:
wget --no-check-certificate "http://<jenkins_url>/view/some_view/job/some_prj/buildWithParameters?token=xxx"
Output:
HTTP request sent, awaiting response... 201 Created
Length: 0
Saving to: “buildWithParameters?token=xxx”
[ <=> ] 0 --.-K/s in 0s
2015-02-20 10:10:46 (0.00 B/s) - “buildWithParameters?token=xxx” saved [0/0]
And then it's creates empty file: “buildWithParameters?token=xxx”
My question is: why wget creates this file and how to turn that functionality off?
Most simply:
wget --no-check-certificate -O /dev/null http://foo
this will make wget save the file to /dev/null, effectively discarding it.