How to stream into wget? - bash

tac FILE | sed -n -e 's/^.*URL: //p' | SEND TO WGET HERE
This one liner above gives a list of URLs from a file, one per line. I am trying to stream/pipe these into wget directly. Each URL is a thumbnail picture that I need to do a massive download on. Trying to write this one liner to facilitate this process.

This one liner above gives a list of URLs from a file, one per line. I
am trying to (...) pipe these into wget directly.
In order to do so you might harness -i file option, if you give - as file wget will be reading standard input, from wget man page
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input(...)If this function is
used, no URLs need be present on the command line(...)
So in your case
command | wget -i -
where command is command which output is one URL per line

Use xargs to set the argument of a command from standard input:
tac FILE | sed -n -e 's/^.*URL: //p' | xargs wget
Here each word of the standard input of xargs is set as a positional argument to wget
Demo:
$ cat FILE
URL: https://google.com https://netflix.com
asdfdas URL: https://stackoverflow.com
$ tac FILE | sed -n -e 's/^.*URL: //p' | xargs wget
--2021-12-30 12:53:17-- https://stackoverflow.com/
Resolving stackoverflow.com (stackoverflow.com)... 151.101.65.69, 151.101.193.69, 151.101.129.69, ...
Connecting to stackoverflow.com (stackoverflow.com)|151.101.65.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.7’
index.html.7 [ <=> ] 175,76K 427KB/s in 0,4s
2021-12-30 12:53:18 (427 KB/s) - ‘index.html.7’ saved [179983]
--2021-12-30 12:53:18-- https://google.com/
Resolving google.com (google.com)... 142.250.184.142, 2a00:1450:4017:80c::200e
Connecting to google.com (google.com)|142.250.184.142|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.google.com/ [following]
--2021-12-30 12:53:18-- https://www.google.com/
Resolving www.google.com (www.google.com)... 142.250.187.100, 2a00:1450:4017:807::2004
Connecting to www.google.com (www.google.com)|142.250.187.100|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://consent.google.com/ml?continue=https://www.google.com/&gl=GR&m=0&pc=shp&hl=el&src=1 [following]
--2021-12-30 12:53:19-- https://consent.google.com/ml?continue=https://www.google.com/&gl=GR&m=0&pc=shp&hl=el&src=1
Resolving consent.google.com (consent.google.com)... 216.58.206.206, 2a00:1450:4017:80c::200e
Connecting to consent.google.com (consent.google.com)|216.58.206.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.8’
index.html.8 [ <=> ] 12,16K --.-KB/s in 0,01s
2021-12-30 12:53:19 (1,25 MB/s) - ‘index.html.8’ saved [12450]
--2021-12-30 12:53:19-- https://netflix.com/
Resolving netflix.com (netflix.com)... 54.155.246.232, 18.200.8.190, 54.73.148.110, ...
Connecting to netflix.com (netflix.com)|54.155.246.232|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.netflix.com/ [following]
--2021-12-30 12:53:19-- https://www.netflix.com/
Resolving www.netflix.com (www.netflix.com)... 54.155.178.5, 3.251.50.149, 54.74.73.31, ...
Connecting to www.netflix.com (www.netflix.com)|54.155.178.5|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.netflix.com/gr-en/ [following]
--2021-12-30 12:53:20-- https://www.netflix.com/gr-en/
Reusing existing connection to www.netflix.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.9’
index.html.9 [ <=> ] 424,83K 1003KB/s in 0,4s
2021-12-30 12:53:21 (1003 KB/s) - ‘index.html.9’ saved [435027]
FINISHED --2021-12-30 12:53:21--
Total wall clock time: 4,1s
Downloaded: 3 files, 613K in 0,8s (725 KB/s)

Related

POST form login credentials through redirect using wget

I am trying to use bash and wget to POST form data to a login page and save the cookies after logging in. However, I have to go through an indirect URL that takes me to the login page through two redirects.
I've tried a variety of other methods using curl and wget to no avail. They all reach the page, but don't actually login.
All of the StackOverflow questions and articles I've read on the subject claim it's as easy as the wget call below.
wget \
--save-cookies cookies.txt \
--keep-session-cookies \
--post-data "username=$username&password=$password" \
"$indirect_url"
Here is an example wget output:
--2021-07-09 15:57:21-- https://<<indirect.url.1>>/
Resolving <<indirect.url.1>> (<<indirect.url.1>>)... <<indirect.ip.1>>
Connecting to <<indirect.url.1>> (<<indirect.url.1>>)|<<indirect.ip.1>>|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://<<indirect.url.2>> [following]
--2021-07-09 15:57:22-- https://<<indirect.url.2>>
Reusing existing connection to <<indirect.url.2>>:443.
HTTP request sent, awaiting response... 302 Found
Location: https://<<login.url.2>> [following]
--2021-07-09 15:57:22-- https://<<login.url>>
Resolving <<login.url>> (<<login.url>>)... <<login.ip>>
Connecting to <<login.url>> (<<login.url>>)|<<login.ip>>|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’
index.html [ <=> ] 6.32K --.-KB/s in 0s
2021-07-09 15:57:22 (284 MB/s) - ‘index.html’ saved [6473]
It seems to connect to the login page (200 response) and it downloads it, but it doesn't actually log me in, nor are the cookies.txt correct (e.g. no post-login cookies).
Are the credentials not getting carried through the redirect?
Is there something else I'm doing wrong?
Any help would be appreciated,
Thank you.

Getting a list of uls with wget using regex

I'm starting with page:
https://mysite/a"
I'd like to spider the page getting the full urls of any nested urls below this that begin with the same stem (like https://mysite/a/b ).
I've tried:
$ wget -r --spider --accept-regex "https://...*" 'https://.../' 2>test.txt
which produces a large amount of output inclusing what appear to be the urls I'm after like:
--2018-04-21 15:04:48-- https:/mysite/a/
Reusing existing connection to mysite:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'a/index.html.tmp.tmp'
How do I just print out a list of the urls?
Edit:
changed it to
$ wget -r --spider 'https://mysite/a/' |grep 'https://mysite/a*' 2>test.txt
as a test . No output is being saved in test.txt. The file is empty.

Download OracleXE Using wget

How can I download OracleXE using wget and avoid the login?
I tried applying logic from this question for Oracle Java but I couldn't get it to work.
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1 .0.x86_64.rpm.zip
I get:
--2015-10-13 04:51:03-- http://download.oracle.com/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1.0.x86_64.rpm.zip
Resolving download.oracle.com (download.oracle.com)... 206.248.168.160, 206.248.168.139, 206.248.168.160, ...
Connecting to download.oracle.com (download.oracle.com)|206.248.168.160|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://edelivery.oracle.com/akam/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1.0.x86_64.rpm.zip [following]
--2015-10-13 04:51:03-- https://edelivery.oracle.com/akam/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1.0.x86_64.rpm.zip
Resolving edelivery.oracle.com (edelivery.oracle.com)... 23.9.117.183, 23.9.117.183
Connecting to edelivery.oracle.com (edelivery.oracle.com)|23.9.117.183|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://login.oracle.com/pls/orasso/orasso.wwsso_app_admin.ls_login?Site2pstoreToken=v1.2~CA55CD32~7E777A421E00059BE8321AEAF3C29C59D68A2F46E15A49137CE5AAF6D6B46A0C599A4560AD622CF26FFFCF23FF8FC274F021B7E57B08CEF2076FADB1A57BBFB02B991E320BB3A417DDF966B4406E225736912745DE8F5E660631675765D519A5E7FF61481F567ED9C582AEAAEEC6E2A6C59D046AD82EA1C7AA08E9A1EDAFC44D97F22C470FE530A0F58872A00CAFD27012DF4851AD4964085264393C7220CF07817E14ED0B2130ECF06758DB538644A119246C4B65963CD1C825650BE3B3C86C1670EC8F754E943853BE4C58F0A4FD89B1CE14E7110087134765A9EBAA170769C75645798E1D978B944D2D896A564E49CD42478328D8661794E3DC377DBEF9F7C27184E0DFF7EAAB [following]
--2015-10-13 04:51:03-- https://login.oracle.com/pls/orasso/orasso.wwsso_app_admin.ls_login?Site2pstoreToken=v1.2~CA55CD32~7E777A421E00059BE8321AEAF3C29C59D68A2F46E15A49137CE5AAF6D6B46A0C599A4560AD622CF26FFFCF23FF8FC274F021B7E57B08CEF2076FADB1A57BBFB02B991E320BB3A417DDF966B4406E225736912745DE8F5E660631675765D519A5E7FF61481F567ED9C582AEAAEEC6E2A6C59D046AD82EA1C7AA08E9A1EDAFC44D97F22C470FE530A0F58872A00CAFD27012DF4851AD4964085264393C7220CF07817E14ED0B2130ECF06758DB538644A119246C4B65963CD1C825650BE3B3C86C1670EC8F754E943853BE4C58F0A4FD89B1CE14E7110087134765A9EBAA170769C75645798E1D978B944D2D896A564E49CD42478328D8661794E3DC377DBEF9F7C27184E0DFF7EAAB
Resolving login.oracle.com (login.oracle.com)... 209.17.4.8, 209.17.4.8
Connecting to login.oracle.com (login.oracle.com)|209.17.4.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2051 (2.0K) [text/html]
Saving to: ‘oracle-xe-11.2.0-1.0.x86_64.rpm.zip’
100%[======================================================================================================================================================>] 2,051 --.-K/s in 0s
2015-10-13 04:51:03 (142 MB/s) - ‘oracle-xe-11.2.0-1.0.x86_64.rpm.zip’ saved [2051/2051]
For download oracle for linux zips from url to directly server you have to:
1 - login Oracle.com with credentials.. (https://login.oracle.com/mysso/signon.jsp)
2 - export cookie.txt with browser
3 - copy this file to your Server
scp cookies.txt root#url:/path/
4 - go to path where your cookies.txt and copy install link and paste as to this to your server terminal
wget --load-cookies=cookies.txt http://download.oracle.com/otn/linux/oracle12c/121020/linuxamd64_12102_database_1of2.zip
wget --load-cookies=cookies.txt http://download.oracle.com/otn/linux/oracle12c/121020/linuxamd64_12102_database_2of2.zip
check file size with ls -lah
Note, that wget --header "Cookie: oraclelicense=accept-securebackup-cookie" breaks all other cookies, including authorization ones.
Instead You can use custom cookies.txt file and --user/--password (tested on Oracle Archive and OracleXE)
echo .oracle.com TRUE / FALSE 0 oraclelicense accept-securebackup-cookie >cookies.txt
wget -c --load-cookies cookies.txt --trust-server-names --user=SSO_USERNAME --password=SSO_PASSWORD URL
UPD: Attention! cookies.txt is tab-separated! To be sure of tabs please use `echo -e .oracle.com\tTRUE\t/\tFALSE\t0\toraclelicense\taccept-securebackup-cookie >cookies.txt

Why is wget saving something when using parametrized url?

I am using following command in my bash script to trigger jenkins build:
wget --no-check-certificate "http://<jenkins_url>/view/some_view/job/some_prj/buildWithParameters?token=xxx"
Output:
HTTP request sent, awaiting response... 201 Created
Length: 0
Saving to: “buildWithParameters?token=xxx”
[ <=> ] 0 --.-K/s in 0s
2015-02-20 10:10:46 (0.00 B/s) - “buildWithParameters?token=xxx” saved [0/0]
And then it's creates empty file: “buildWithParameters?token=xxx”
My question is: why wget creates this file and how to turn that functionality off?
Most simply:
wget --no-check-certificate -O /dev/null http://foo
this will make wget save the file to /dev/null, effectively discarding it.

Why does my wget in Bash result in a 400 Bad Request error?

I have this data in text file:
-O BNU-ESM-pr-Historical-19560101-19601231.nc https://dataserver.nccs.nasa.gov/thredds/ncss/bypass/NEX-GDDP/bcsd/historical/r1i1p1/pr/BNU-ESM.ncml?var=pr&north=55&west=72&east=136&south=16&horizStride=1&time_start=1956-01-01T12%3A00%3A00Z&time_end=1960-12-31T12%3A00%3A00Z&timeStride=1
I am using this code for a .sh file:
#!/bin/bash
while read -r line; do wget $line; done < pr_china.txt
Result of the command in BASH:
ahmad#ahmad:/mnt/c/script_sh_files$ ./pr_china.sh
--2018-12-29 23:10:30-- https://dataserver.nccs.nasa.gov/thredds/ncss/bypass/NEX-GDDP/bcsd/historical/r1i1p1/pr/BNU-ESM.ncml?var=pr&north=55&west=72&east=136&south=16&horizStride=1&time_start=1956-01-01T12%3A00%3A00Z&time_end=1960-12-31T12%3A00%3A00Z&timeStride=1%0D
Resolving dataserver.nccs.nasa.gov (dataserver.nccs.nasa.gov)... 2001:4d0:2418:2800::a99a:9229, 169.154.146.41
Connecting to dataserver.nccs.nasa.gov (dataserver.nccs.nasa.gov)|2001:4d0:2418:2800::a99a:9229|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2018-12-29 23:10:33 ERROR 400: Bad Request.
Run dos2unix on your pr_china.txt file before you use it.
See: How to remove %0D from end of URL when using wget?

Resources