Print links to all pdfs using bash - bash

I'm writing bash script that should download html page and from that page extracts all links to the pdf files.
I have to say, that I'm newbie to bash so for now I can only grep all lines that contains <a href and afterwards grep these lines that have pdf word.
I can barelly use awk but i don't know how to write right regex to get only text in <a href="*.pdf"> where I want to have *.pdf.
EDIT: grep "... is not found.

Try this line to the whole html String. Works perfectly for me.
grep -io "<a[[:space:]]*href=\"[^\"]\+\.pdf\">" | awk 'BEGIN{FS="\""}{print $2}'

Related

BASH deleting HTML tags from the text file

I need to filter out all HTML tags from the text file (could be any sequence between <...>)
I came up with this command: cat my_file | sed 's/<[^>]*>//', but it olny delets first tag in the line. How do I delete all the tags? Is the problem with the regular expression?
From the sed manual:
The s command can be followed by zero or more of the following flags:
g
Apply the replacement to all matches to the regexp, not just the first.
So
cat my_file | sed 's/<[^>]*>//g'
If your intent is to remove all tags and get only text between them. Use, html2text or pup 'text{}'
https://github.com/ericchiang/pup
http://www.mbayer.de/html2text/
There are other tools like xidel, xmlstarlet too.

BASH extract links from youtube html file

Im trying to make youtube music player on my raspberry, and I've stuck on this moment:
Wget is downloading site for example: https://www.youtube.com/results?search_query=test to file output.html
Links in that site are saved in strings like that: <a href="/watch?v=DDzfeTTigKo"
Now when I am trying to grep them cat site | grep -B 0 -A 0 watch?v=
It prints me the wall of text from that file, and I just want that specific lines like i mention above. And i want it to be saved in file site2
Is this possible?
Try this with GNU grep:
grep -o '"/watch?v=[^"]*"' file.html

parse word from html file

I am having a lot of trouble trying to extract a word from an html file. The line in the html file appears like this:
<span id="result">WORD</span>
I am trying to get the WORD out but I can't figure it out. So far I've got:
grep 'span id="result"' FILE
Which just gets me the line. I've also tried:
sed -n '/<span id="result">/,/<\/span>/p' FILE
which didn't work either.
I know this is probably a very simple question, but I'm just beginning so I could really use some help.
Do not use regex to parse html.
Use a html parser.
My Xidel has the shortest syntax for this:
xidel FILE -e "#result"
This is a task for awk
I do guess you have other line in same files so a search for span id is a must.
echo "<span id="result">WORD</span>" | awk -F"[<>]" '/span id/ {print $3}'
WORD
You can try
awk -f ext.awk input.html
where input.html is your input html file, and ext.awk is
{
line=line $0 RS
}
END {
match (line,/<span id="result">([^<]*)<\/span>/,a)
print a[1]
}
This will extract the contents across line breaks..
Use grep with backward reference:
grep -Po '(?<=<span id="result">)\w+'
The expression between parenthèses is a backward reference; it is not captured but serves as test for the following regex part: if the expression appears, the captured pattern is only \w+ here. Add option -o for outputting only the word; option -P enables forward and backward references.
If you want to modifiy this regex, please note that with grep, a backward reference must have a fixed size.

Extract .co.uk urls from HTML file

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that?
pd: im learning bash
edit:
code sample:
32
<tr><td id="Table_td" align="center">23<a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>
note some repeat
important: i need all links, broken or 404 too
found this code somwhere in the net:
cat file.html | tr " " "\n" | grep .co.uk
output:
href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"
think im close
thanks!
The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:
links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
| tac \
| sed -e '/^Links:/,$ d' \
-e 's/[0-9]\+.[[:space:]]//' \
| grep '^http://[^/]\+[.]co[.]uk'
It works as follows:
links (a text-based web browser) actually retrieves the site.
Using -dump causes the rendered page to be emitted to stdout.
Using -html-numbered-links requests a numbered table of links.
Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.
One way using awk:
awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html
output:
http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2
If you are only interested in unique urls, pipe the output into sort -u
HTH
Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:
wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"
If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.
If you do not have wget, then you can get it here

Getting text from html page, shell

I am trying to get text from a html page in shell, as part of a script to show me the temperature in my local area.
I however can't get my head around how to use grep properly
Excerpt from web page
</div><div id="yw-forecast" class="night" style="height:auto"><em>Current conditions as of 8:18 PM GMT</em><div id="yw-cond">Light Rain Shower</div><dl><dt>Feels Like:</dt><dd>6 °C</dd><dt>Barometer:</dt><dd style="position:relative;">1,015.92 mb and steady</dd><dt>Humidity:</dt><dd>87 %</dd><dt>Visibility:</dt><dd>9.99 km</dd><dt>Dewpoint
Except shorter cut down further
<dt>Feels Like:</dt><dd>6 °C</dd>
Trying to grab the 6 °C
I have tried a variety of different tactics, including grep and awk. Can a shell wizard help me out?
Try
grep -o -e "<dd>.*deg;C</dd>" the_html.txt
From the man page:
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. This can be used to specify
multiple search patterns, or to protect a pattern beginning with
a hyphen (-). (-e is specified by POSIX.)
...
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
If you want to get rid of <dd> and </dd> too, just append | cut -b 5-12.
Give this a try:
grep -Po '(?<=Feels Like:</dt><dd>).*?(?=</dd>)' | sed 's/ °/°/'
Result:
6°C
If x is your input file and the HTML source is as regularly formatted as your write, this should work --
grep deg x | sed -e "s#^.>([0-9]{1,2} \°[CF])<.#\1#"
Seth

Resources