Parse HTML using shell - bash

I have a HTML with lots of data and part I am interested in:
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
I try to use awk which now is:
awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"
but what I want is to have:
54
1
0
0
Right now I am getting:
'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'
Any suggestions?

awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.
Also you can use a programming language which is able to parse HTML

awk -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file
Output:
54
1
0
0
Another:
awk -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
while (getline > 0 && /<td /) {
gsub(/<b>/, ""); sub(/ .*/, "", $3)
print $3
}
exit
}' file

$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0

You really should to use some real HTML parser for this job, like:
perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'
prints:
54
1
0
0
But for this you need to have perl, and installed Mojolicious package.
(it is easy to install with:)
curl -L get.mojolicio.us | sh

BSD/GNU grep/ripgrep
For simple extracting, you can use grep, for example:
Your example using grep:
$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0
and using ripgrep:
$ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
54
1
0 (0/0)
0
Extracting outer html of H1:
$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>
Other examples:
Extracting the body:
$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...
Instead of xargs you can also use tr '\n' ' '.
For multiple tags, see: Text between two tags.
If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

HTML-XML-utils
You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:
$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>
Here is the example with provided data:
$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>
Here is the final example with stripping out <b> tags:
$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0
For more examples, check the html-xml-utils.

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.
cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF
Prints:
54
1
0 (0/0)
0

With xidel, a true HTML parser, and XPath:
$ xidel -s "input.html" -e '//td[#align="right"]'
54
1
0 (0/0)
0
$ xidel -s "input.html" -e '//td[#align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[#align="right"]/extract(.,"\d+")'
54
1
0
0

ex/vim
For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
Here is the command:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".
The substitution pattern consists of 3 parts:
Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
Select our main part till < (([^<]+)).
Select everything else after < for removal (<.*).
We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.
Finally, print the current buffer on the screen by +%p.
Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").
When tested, to replace file in-place, change -scq! to -scwq.
Here is another simple example which removes style tag from the header and prints the parsed output:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also:
How to parse hundred HTML source code files in shell?
Extract data from HTML table in shell script?

What about:
lynx -dump index.html

Related

Bash sed command issue

I'm trying to further parse an output file I generated using an additional grep command. The code that I'm currently using is:
##!/bin/bash
# fetches the links of the movie's imdb pages for a given actor
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi
curl "https://www.imdb.com/name/$code/#actor" | grep -Eo
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' |
sort -u > imdb_links.txt
#parses each of the link in the link text file and gets the details for
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt)
do
curl $i |
html2text |
sed -n '/Sign_In/,$p'|
sed -n '/YOUR RATING/q;p' |
head -n-1 |
tail -n+2
done > imdb_all.txt
The sample generated output is:
EN
⁰
* Fully supported
* English (United States)
* Partially_supported
* Français (Canada)
* Français (France)
* Deutsch (Deutschland)
* हिंदी (भारत)
* Italiano (Italia)
* Português (Brasil)
* Español (España)
* Español (México)
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
How do I change the code to further parse the output to get only the data from the title of the movie up until the imdb rating ( in this case, the line that contains the title 'Duck Soup' up until the end.
Here is the code:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt
Please(!) have a look at the following urls on why it's a really bad idea to parse HTML with sed:
RegEx match open tags except XHTML self-contained tags
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
The thing you're trying to do can be done with the HTML/XML/JSON parser xidel and with just 1 call!
In this example I'll use the IMDB of Charlie Chaplin as source.
Extract all 94 "Actor" IMDB movie urls:
$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94
There's no need to save these to a text-file. Just use -f (--follow) instead of -e and xidel will open all of them.
For the individual movie urls you could parse the HTML to get the text-nodes you want...
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
//h1,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
(//div[#class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10
...but with those class-names I'd say that's a rather fragile endeavor. Instead I'd recommend to parse the JSON at the top of the HTML-source within the <script>-node:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
datePublished,
duration,
aggregateRating/ratingValue
)
'
A Countess from Hong Kong
1967-03-15
PT2H
6
...or to get a similar output as above:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
All combined:
$ xidel -s "https://www.imdb.com/name/nm0000122" \
-f '//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href' \
-e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10

Bash: replace specific text with its translation

There is a huge file, in it I want to replace all the text between '=' and '\n' with its translation, here is an example:
input:
screen.LIGHT_COLOR=Lighting Color
screen.LIGHT_M=Light (Morning)
screen.AMBIENT_M=Ambient (Morning)
output:
screen.LIGHT_COLOR=Цвет Освещения
screen.LIGHT_M=Свет (Утро)
screen.AMBIENT_M=Эмбиент (Утро)
All I have managed to do until now is to extract and translate the targeted text.
while IFS= read -r line
do
echo $line | cut -d= -f2- | trans -b en:ru
done < file.txt
output:
Цвет Освещения
Свет (Утро)
Эмбиент (Утро)
*trans is short for translate-shell. It is slow, but does the job. -b for brief translation; en:ru means English to Russian.
If you have any suggestions or solutions i'll be glad to know, thanks!
edit, in case someone needs it:
After discovering trans-shell limitations I ended up going with the #TaylorG. suggestion. It is seams that translation-shell allows around 110 request per some time. Processing each line seperatly results in 1300 requests, which breaks the script.
long story short, it is faster to pack all the data into a single request. Its possible to reduce processing time from couple of minutes to mere seconds. sorry for the messy code, it's my third day with:
cut -s -d = -f 1 en_US.lang > option_en.txt
cut -s -d = -f 2 en_US.lang > value_en.txt
# merge lines
sed ':a; N; $!ba; s/\n/ :: /g' value_en.txt > value_en_block.txt
trans -b en:ru -i value_en_block.txt -o value_ru_block.txt
sed 's/ :: /\n/g' value_ru_block.txt > value_ru.txt
paste -d = option_en.txt value_ru.txt > ru_RU.lang
# remove trmporary files
rm option_en.txt value_en.txt value_en_block.txt value_ru.txt value_ru_block.txt
Thanks Taylor G., Armali and every commentator
Using pipe in a large loop is expensive. You can try the following instead.
cut -s -d = -f 1 file.txt > name.txt
cut -s -d = -f 2- file.txt | trans -b en:ru > translate.txt
paste -d = name.txt translate.txt
It shall be much faster than your current script. I'm not sure how your trans method is written. It needs to be updated to process batch input if it's not, e.g. using a while loop.
trans() {
while read -r line; do
# do translate and print result
done
}
You already did most of the work, though it can be optimized a bit. What's missing is just to output the first part of the line up to the equal sign together with the translation:
while IFS== read left right
do echo $left=`trans -b en:ru <<<$right`
done <file.txt

extract text beetwen two words and in a specific line

I'm trying to make a linux bash script to download an html page, extract numbers from this html page and assign them to a variable.
the html page has several lines but I'm interested in these :
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 1</strong></td>
<td width="132">
<div align="right"><strong>61</strong></div></td>
</tr>
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 2</strong></td>
<td width="132">
<div align="right"><strong>65</strong></div></td>
</tr>
</table></td>
Every time I download the page I have to read the two values ​​in row 5 and 11 between strong> and </strong (61 ad 65 in this example; 61 and 65 in this example, but each time they are different)
The two values ​​extracted from html must be able to assign them to two variables
Thanks for any idea
Let's assume we a page called page.html. You can firstly select the line with grep, then extract the value with sed and finally select values iteratively with awk:
$ var0=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==1')
$ var1=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==0')
output:
$ echo $var0
61
$ echo $var1
65
This might work for you (GNU sed):
sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.*TIME ([^<]*).*<strong>([^<]*).*/var\1=\2/p}' file
Use the integer associated with the TIME in the preceding code to differentiate the two variable names.

I can't figure out how to extract a string in bash

I am trying to make a bash script that will download a youtube page, see the latest video and find its url. I have the part to download the page except I can not figure out how to isolate the text with the url.
I have this to download the page
curl -s https://www.youtube.com/user/h3h3Productions/videos > YoutubePage.txt
which will save it to a file.
But I cannot figure out how to isolate the single part of a div.
The div is
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="Why I'm Unlisting the Leafyishere Rant" aria-describedby="description-id-877692" data-sessionlink="ei=a2lSV9zEI9PJ-wODjKuICg&feature=c4-videos-u&ved=CD4QvxsiEwicpteI1I3NAhXT5H4KHQPGCqEomxw" href="/watch?v=q6TNODqcHWA">Why I'm Unlisting the Leafyishere Rant</a>
And I need to isolate the href at the end but i cannot figure out how to do this with grep or sed.
With sed :
sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\([^"]*\)".*/\1/p' YoutubePage.txt
To just extract the video ahref :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt
/watch?v=q6TNODqcHWA
/watch?v=q6TNODqcHWA
/watch?v=ix4mTekl3MM
/watch?v=ix4mTekl3MM
/watch?v=fEGVOysbC8w
/watch?v=fEGVOysbC8w
...
To omit repeated lines :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt | sort | uniq
/watch?v=2QOx7vmjV2E
/watch?v=4UNLhoePqqQ
/watch?v=5IoTGVeqwjw
/watch?v=8qwxYaZhUGA
/watch?v=AemSBOsfhc0
/watch?v=CrKkjXMYFzs
...
You can also pipe it to your curl command :
curl -s https://www.youtube.com/user/h3h3Productions/videos | sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' | sort | uniq
You can use lynx which is a terminal browser, but have a -dump mode which will output a HTML parsed text, with URL extracted. This makes it easier to grep the URL:
lynx -dump 'https://www.youtube.com/user/h3h3Productions/videos' \
| sed -n '/\/watch?/s/^ *[0-9]*\. *//p'
This will output something like:
https://www.youtube.com/watch?v=EBbLPnQ-CEw
https://www.youtube.com/watch?v=2QOx7vmjV2E
...
Breakdown:
-n ' # Disable auto printing
/\/watch?/ # Match lines with /watch?
s/^ *[0-9]*\. *// # Remove leading index: " 123. https://..." ->
# "https://..."
p # Print line if all the above have not failed.
'

Looking for exact match using grep

Suppose that I have a file like this:
tst.txt
fName1 lName1-a 222
fname1 lName1-b 22
fName1 lName1 2
And I want to get the 3rd column only for "fName1 lName1", using this command:
var=`grep -i -w "fName1 lName1" tst.txt`
However this returns me every line that starts with "fName1 lName1", how can I look for the exact match?
Here you go:
#!/bin/bash
var=$(grep -Po '(?<=fName1 lName1 ).+' tst.txt)
echo $var
The trick is to use the o option of the grep command. The P option tells the interpreter to use Perl-compatible regular expression syntax when parsing the pattern.
var=$(grep "fName1 lName1 " tst.txt |cut -d ' ' -f 3)
you can try this method:
grep -i -E "^fName1 lName1\s" tst.txt | cut -f3,3- -d ' '
But you must be sure that line starts with fName1 and you have space after lName1.

Resources