Extract .co.uk urls from HTML file - bash

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that?
pd: im learning bash
edit:
code sample:
32
<tr><td id="Table_td" align="center">23<a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>
note some repeat
important: i need all links, broken or 404 too
found this code somwhere in the net:
cat file.html | tr " " "\n" | grep .co.uk
output:
href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"
think im close
thanks!

The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:
links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
| tac \
| sed -e '/^Links:/,$ d' \
-e 's/[0-9]\+.[[:space:]]//' \
| grep '^http://[^/]\+[.]co[.]uk'
It works as follows:
links (a text-based web browser) actually retrieves the site.
Using -dump causes the rendered page to be emitted to stdout.
Using -html-numbered-links requests a numbered table of links.
Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.

One way using awk:
awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html
output:
http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2
If you are only interested in unique urls, pipe the output into sort -u
HTH

Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:
wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"
If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.
If you do not have wget, then you can get it here

Related

Grep title of a page which is written with spaces [duplicate]

This question already has answers here:
Parsing XML using unix terminal
(9 answers)
Closed last month.
I am trying to get the meta title of some website...
some people write title like
`<title>AllHeart Web INC, IT Services Digital Solutions Technology
</title>
`
`<title>AllHeart Web INC, IT Services Digital Solutions Technology</title>`
`<title>
AllHeart Web INC, IT Services Digital Solutions Technology
</title>`
some like more ways... my current focus on above 3 ways...
I wrote a simple code, it only capture 2nd way of title written, but i am not sure how can I grep the other ways,
`curl -s https://allheartweb.com/ | grep -o '<title>.*</title>'`
I also made a code (very bad i guess)
where i can grep number of line like
`
% curl -s https://allheartweb.com/ | grep -n '<title>'
7:<title>AllHeart Web INC, IT Services Digital Solutions Technology
% curl -s https://allheartweb.com/ | grep -n '</title>'
8:</title>
`
and store it and run loop to get title item... which i guess a bad idea...
any help I can get all possible of getting title?
Try this:
curl -s https://allheartweb.com/ | tr -d '\n' | grep -m 1 -oP '(?<=<title>).+?(?=</title>)'
You can remove newlines from HTML via tr because they have no meaning in the title. The next step returns the first match of the shortest string enclosed in <title> </title>.
This is quite a simple approach of course. xmllint would be better but that's not available to all platforms by default.
'grep' is not a very good tool to match multiple lines. It is processing line-by-line. You could hack that by making your incoming text one line like
curl -s https://allheartweb.com/ | xargs | grep -o -E "<title>.*</title>"
This is probably what you want.
Try this sed:
curl -s https://allheartweb.com/ | sed -n "{/<title>/,/<\/title>/p}"

How to search through links and save only those that contain a specific data?

I have a text file with thousands of hyperlinks in the format "URL = http://examplelink.com" in a file called mylinks.txt.
What I want to do is search through all of these links, and checks if any of them contains some keywords, like, "2018", "2017". If the link contains the keyword, I want to save the link in the file "yes.txt" and if it doesn't it goes to the file "no.txt".
So at the end, I would end up with two files: one with the links that send me to pages with the keywords I'm searching for, and other one with the links that doesn't.
I was thinking about doing this with curl, but I don't know even if it's possible and I don't know also how to "filter" the links by keywords.
What I have got until now is:
curl -K mylinks.txt >> output.txt
But this only creates a super large file with the HTML's of the links it searches.
I've searched and read through various curl tutorials and haven't found anything that "selectively" search for pages and save the links (not the content) of the pages it found matching the criteria.
-Untested-
For links in lines containing "2017" or "2018".
cat mylinks.txt | grep -E '2017|2018' | grep -o 'URL =*>' >> yes.txt
To get url of lines that doesn't contain the keywords.
cat mylinks.txt | grep -vE '2017|2018' | grep -o 'URL =*>' >> no.txt
This is unix piping. (The char | ) takes the program output stdout at the left and feed the stdin to the program on the right.
In Unix-like computer operating systems, a pipeline is a sequence of
processes chained together by their standard streams, so that the
output of each process (stdout) feeds directly as input (stdin) to the
next one. https://en.wikipedia.org/wiki/Pipeline_(Unix)
Here is my take at it (kind of tested on an url-file with a few examples).
This is supposed to be saved as a script, it's too long to type it into the console directly.
#!/bin/bash
urlFile="/path/to/myLinks.txt"
cut -d' ' -f3 "$urlFile" | \
while read url
do
echo "checking url $url"
if (curl "$url" | grep "2017")
then
echo "$url" >> /tmp/yes.txt
else
echo "$url" >> /tmp/no.txt
fi
done
Explanation: the cut is necessary to cut away the prefix "URL = " in each line. Then the url's are fed into the while-read loop. For each url, we curl it, grep for the interesting keyword in it (in this case "2017"), and if the grep returns 0, we append this URL to the file with the interesting URLs.
Obviously, you should adjust the paths and the keyword.

Is it possible to clean up an HTML file with grep to extract certain strings?

There is a website that I am a part of and I wanted to get the information out of the site on a daily basis. The page looks like this:
User1 added User2.
User40 added user3.
User13 added user71
User47 added user461
so on..
There's no JSON end point to get the information and parse it. So I have to wget the page and clean up the HTML:
User1 added user2
Is it possible to clean this up even though the username always changes?
I would divide that problem into two:
How to clean up your HTML
Yes it is possible to use grep directly, but I would recommend using a standard tool to convert HTML to text before using grep. I can think of two (html2text is a conversion utility, and w3m is actually a text browser), but there are more:
wget -O - http://www.stackoverflow.com/ | html2text | grep "How.*\?"
w3m http://www.stackoverflow.com/ | grep "How.*\?"
These examples will get the homepage of StackOverflow and display all questions found on that page starting with How and ending with ? (it displays about 20 such lines for me, but YMMV depending on your settings).
How to extract only the desired strings
Concerning your username, you can just tune your expression to match different users (-E is necessary due to the extended regular expression syntax, -o will make grep print only the matching part(s) of each line):
[...] | grep -o -E ".ser[0-9]+ added .ser[0-9]+"
This however assumes that users always have a name matching .ser[0-9]+. You may want to use a more general pattern like this one:
[...] | grep -o -E "[[:graph:]]+[[:space:]]+added[[:space:]]+[[:graph:]]+"
This pattern will match added surrounded by any two other words, delimited by an arbitrary number of whitespace characters. Or simpler (assuming that a word may contain everything but blank, and the words are delimited by exactly one blank):
[...] | grep -o -E "[^ ]+ added [^ ]+"
Do you intent to just strip away the HTML-Tags?
Then try this:
sed 's/<[^>]*>//g' infile >outfile

grep or sed pattern matching of domain name and truncating of subdomain?

I am trying to extract a list of domain names from a httrack data stream using grep. I have it close to working, but the result also includes any and all sub-domains.
httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo "([0-9,a-z\.-]+)\.(com)"
Here is my current example result:
domain1.com
domain2.com
www.domain3.com
subdomain.domain4.com
whatever.domain5.com
Here is my desired example result.
domain1.com
domain2.com
domain3.com
domain4.com
domain5.com
Is there something I can add to this grep expression, or should I pipe it to a new sed expression to truncate any subdomains? And if so, how do I accomplish this task? I'm stuck. Any help is much appreciated.
Regards,
Wyatt
You could drop the . in the grep pattern. The following should work
httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" |
grep -iEo '[[:alnum:]-]+\.(com|net|org)'
If you are just wanting to do a .com then the following will work as it will remove HTTP:// with or without an s, and the next sub-domains. As you can see though it will only work for a .com.
/(?:https?:\/\/[a-z09.]*?)([a-zA-Z0-9-]*\.com)/
Example Dataset
http://www.ilovefreestuff.com/
https://test.ilovefreestuff.com/
https://test.sub.ilovefreestuff.com/
REGEX101
That being said it is generally bad practice to parse and/or validate domain names using Regex as there are a ton of variants that can never be fully accounted for with the exception being when the conditions for matching and/or the dataset is clearly defined and not all encompassing. THIS post has more details on this process and covers a few more situations.
I use this code
include all domain & subdomains
grep -oE '[[:alnum:]_.-]+[.][[:alnum:]_.-]+' file_name | sed -re 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}//g' | sort -u > test.txt

Getting text from html page, shell

I am trying to get text from a html page in shell, as part of a script to show me the temperature in my local area.
I however can't get my head around how to use grep properly
Excerpt from web page
</div><div id="yw-forecast" class="night" style="height:auto"><em>Current conditions as of 8:18 PM GMT</em><div id="yw-cond">Light Rain Shower</div><dl><dt>Feels Like:</dt><dd>6 °C</dd><dt>Barometer:</dt><dd style="position:relative;">1,015.92 mb and steady</dd><dt>Humidity:</dt><dd>87 %</dd><dt>Visibility:</dt><dd>9.99 km</dd><dt>Dewpoint
Except shorter cut down further
<dt>Feels Like:</dt><dd>6 °C</dd>
Trying to grab the 6 °C
I have tried a variety of different tactics, including grep and awk. Can a shell wizard help me out?
Try
grep -o -e "<dd>.*deg;C</dd>" the_html.txt
From the man page:
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. This can be used to specify
multiple search patterns, or to protect a pattern beginning with
a hyphen (-). (-e is specified by POSIX.)
...
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
If you want to get rid of <dd> and </dd> too, just append | cut -b 5-12.
Give this a try:
grep -Po '(?<=Feels Like:</dt><dd>).*?(?=</dd>)' | sed 's/ °/°/'
Result:
6°C
If x is your input file and the HTML source is as regularly formatted as your write, this should work --
grep deg x | sed -e "s#^.>([0-9]{1,2} \°[CF])<.#\1#"
Seth

Resources