wget to parse a webpage in shell

wget to parse a webpage in shell - bash

I am trying to extract URLS from a webpage using wget. I tried this
wget -r -l2 --reject=gif -O out.html www.google.com | sed -n 's/.*href="\([^"]*\).*/\1/p'
It is displaiyng FINISHED
Downloaded: 18,472 bytes in 1 files
But not displaying the weblinks. If I try to do it seperately
wget -r -l2 --reject=gif -O out.html www.google.com
sed -n 's/.*href="\([^"]*\).*/\1/p' < out.html
Output
http://www.google.com/intl/en/options/
/intl/en/policies/terms/
It is not displaying all the links
ttp://www.google.com
http://maps.google.com
https://play.google.com
http://www.youtube.com
http://news.google.com
https://mail.google.com
https://drive.google.com
http://www.google.com
http://www.google.com
http://www.google.com
https://www.google.com
https://plus.google.com
And more over I want to get links from 2nd level and more can any one give a solution for this
Thanks in advance

The -O file option captures the output of wget and writes it to the specified file, so there is no output going through the pipe to sed.
You can say -O - to direct wget output to standard output.

If you don't want to use grep, you can try
sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*/\1/gp"

Related

how to pipe multi commands to bash?

I want to check some file on the remote website.
Here is bash command to generate commands that calculate the file md5
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\;
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce | md5sum;
Then I pipe the outputed commands to bash, but only the first command was executed.
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\; | bash -v
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
3d2f0e76e04444f4ec456ef9f11289ec -
[root]#

Would you please try the following instead:
while read -r _ _ url _; do
wget -q -O - "$url"e | md5sum
done < <(head -n 3 zrcpathAll)
we should not put -i in front of "$url" here.
[Explanation about -i option]
Manpage of wget says:
-i file
--input-file=file
Read URLs from a local or external file. [snip]
If this function is used, no URLs need be present on the command line. [snip]
If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html.
Furthermore, the file's location will be implicitly used as base
href if none was specified.
where the file will contain line(s) of url such as:
https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce
https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce
https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce
Whereas if we use the option as -i url, wget first
downloads the url as a file which contains the lines of urls
as above. In our case, the url is the target to download itself,
not the list of urls, wget causes an error: No URLs found in url.
Even if the wget fails, why the command outputs just one line, not
three lines as the result of md5sum?
This seems to be because the head command immediately flushes the remaining
lines when the piped subprocess fails.

Why does curl -o output contain sequences like "^[[38;5;250m", when "surf" output looks fine?

I want to output wttr.in in to a file with curl. The problem is that the output isn't how it would be when i just surf wttr.in.
What i did is:
curl wttr.in -o ~/wt.tex and curl wttr.in -o ~/wt
The output is like: <output>
It should be https://wttr.in.

I solved my self:
less -r -f -L wt.tex
-r controlls the binary characters
-f forces to open the the file with out asking.

wget command to download a file and save as a different filename

I am downloading a file using the wget command. But when it downloads to my local machine, I want it to be saved as a different filename.
For example: I am downloading a file from www.examplesite.com/textfile.txt
I want to use wget to save the file textfile.txt on my local directory as newfile.txt. I am using the wget command as follows:
wget www.examplesite.com/textfile.txt

Use the -O file option.
E.g.
wget google.com
...
16:07:52 (538.47 MB/s) - `index.html' saved [10728]
vs.
wget -O foo.html google.com
...
16:08:00 (1.57 MB/s) - `foo.html' saved [10728]

Also notice the order of parameters on the command line. At least on some systems (e.g. CentOS 6):
wget -O FILE URL
works. But:
wget URL -O FILE
does not work.

You would use the command Mechanical snail listed. Notice the uppercase O. Full command line to use could be:
wget www.examplesite.com/textfile.txt --output-document=newfile.txt
or
wget www.examplesite.com/textfile.txt -O newfile.txt
Hope that helps.

wget -O yourfilename.zip remote-storage.url/theirfilename.zip
will do the trick for you.
Note:
a) its a capital O.
b) wget -O filename url will only work. Putting -O last will not.

Either curl or wget can be used in this case. All 3 of these commands do the same thing, downloading the file at http://path/to/file.txt and saving it locally into "my_file.txt":
wget http://path/to/file.txt -O my_file.txt # my favorite--it has a progress bar
curl http://path/to/file.txt -o my_file.txt
curl http://path/to/file.txt > my_file.txt
Notice the first one's -O is the capital letter "O".
The nice thing about the wget command is it shows a nice progress bar.
You can prove the files downloaded by each of the 3 techniques above are exactly identical by comparing their sha512 hashes. Running sha512sum my_file.txt after running each of the commands above, and comparing the results, reveals all 3 files to have the exact same sha hashes (sha sums), meaning the files are exactly identical, byte-for-byte.
See also: How to capture cURL output to a file?

Using CentOS Linux I found that the easiest syntax would be:
wget "link" -O file.ext
where "link" is the web address you want to save and "file.ext" is the filename and extension of your choice.

How to download multiple URLs using wget using a single command?

I am using following command to download a single webpage with all its images and js using wget in Windows 7:
wget -E -H -k -K -p -e robots=off -P /Downloads/ http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
It is downloading the HTML as required, but when I tried to pass on a text file having a list of 3 URLs to download, it didn't give any output, below is the command I am using:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt -B 'http://'
I tried this also:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
This text file had URLs http:// prepended in it.
list.txt contains list of 3 URLs which I need to download using a single command. Please help me in resolving this issue.

From man wget:
2 Invoking
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
So, just use multiple URLs:
wget URL1 URL2
Or using the links from comments:
$ cat list.txt
http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
http://www.verizonwireless.com/smartphones-2.shtml
http://www.att.com/shop/wireless/devices/smartphones.html
and your command line:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
works as expected.

First create a text file with the URLs that you need to download.
eg: download.txt
download.txt will as below:
http://www.google.com
http://www.yahoo.com
then use the command wget -i download.txt to download the files. You can add many URLs to the text file.

If you have a list of URLs separated on multiple lines like this:
http://example.com/a
http://example.com/b
http://example.com/c
but you don't want to create a file and point wget to it, you can do this:
wget -i - <<< 'http://example.com/a
http://example.com/b
http://example.com/c'

pedantic version:
for x in {'url1','url2'}; do wget $x; done
the advantage of it you can treat is as a single wget url command

get veehd url in bash/python?

Can anybody figure out how to get the .avi URL of a veehd[dot]com video, by providing the page of the video in a script? It can be BASH, or Python, or common programs in Ubuntu.
They make you install a extension, and I've tried looking at the code but I can't figure it out.

This worked for me:
#!/bin/bash
URL=$1 # page with the video
FRAME=`wget -q -O - $URL | sed -n -e '/playeriframe.*do=d/{s/.*src : "//;s/".*//p;q}'`
STREAM=`wget -q -O - http://veehd.com$FRAME | sed -n -e '/<a/{s/.*href="//;s/".*//p;q}'`
echo $STREAM

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

wget to parse a webpage in shell - bash

The -O file option captures the output of wget and writes it to the specified file, so there is no output going through the pipe to sed. You can say -O - to direct wget output to standard output.

If you don't want to use grep, you can try sed -n "/href/ s/.href=['\"]\([^'\"]\)['\"].*/\1/gp"

Related

how to pipe multi commands to bash?

Why does curl -o output contain sequences like "^[[38;5;250m", when "surf" output looks fine?

wget command to download a file and save as a different filename

How to download multiple URLs using wget using a single command?

get veehd url in bash/python?

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

wget to parse a webpage in shell - bash

The -O file option captures the output of wget and writes it to the specified file, so there is no output going through the pipe to sed. You can say -O - to direct wget output to standard output.

If you don't want to use grep, you can try sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*/\1/gp"

Related

how to pipe multi commands to bash?

Why does curl -o output contain sequences like "^[[38;5;250m", when "surf" output looks fine?

wget command to download a file and save as a different filename

How to download multiple URLs using wget using a single command?

get veehd url in bash/python?

Categories

Resources

If you don't want to use grep, you can try sed -n "/href/ s/.href=['\"]\([^'\"]\)['\"].*/\1/gp"