extract text beetwen two words and in a specific line - bash

I'm trying to make a linux bash script to download an html page, extract numbers from this html page and assign them to a variable.
the html page has several lines but I'm interested in these :
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 1</strong></td>
<td width="132">
<div align="right"><strong>61</strong></div></td>
</tr>
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 2</strong></td>
<td width="132">
<div align="right"><strong>65</strong></div></td>
</tr>
</table></td>
Every time I download the page I have to read the two values ​​in row 5 and 11 between strong> and </strong (61 ad 65 in this example; 61 and 65 in this example, but each time they are different)
The two values ​​extracted from html must be able to assign them to two variables
Thanks for any idea

Let's assume we a page called page.html. You can firstly select the line with grep, then extract the value with sed and finally select values iteratively with awk:
$ var0=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==1')
$ var1=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==0')
output:
$ echo $var0
61
$ echo $var1
65

This might work for you (GNU sed):
sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.*TIME ([^<]*).*<strong>([^<]*).*/var\1=\2/p}' file
Use the integer associated with the TIME in the preceding code to differentiate the two variable names.

Related

How to replace any text between html tags

i have text between html tags. For example:
<td>vip</td>
I will have any text between tags <td></td>
How can i cut any text from these tags and put any text between these tags.
I need to do it via bash/shell.
How can i do this ?
First of all, i tried to get this text, but without success
sed -n "/<td>/,/<\/td>/p" test.txt. But in a result i have
<td>vip</td>. but according to documentation, i should get only vip
You can try this:
sed -i -e 's/\(<td>\).*\(<\/td>\)/<td>TEXT_TO_REPLACE_BY<\/td>/g' test.txt
Note that it will only work for the <td> tags. It will replace everything between tags <td> (actually with them together and put the tags back) with TEXT_TO_REPLACE_BY.
You can use this to get the value vip
sed -e 's,.*<td>\([^<]*\)</td>.*,\1,g'
If you Input_file is same as shown example then following may help you too.
echo "<td>vip</td>" | awk -F"[><]" '{print $3}'
Simply printing the tag with echo then using awk to create a field separator >< then printing the 3rd field then which is your request.
d=$'<td>vip</td>\n<table>vip</table>\n<td>more data here</td>'
echo "$d"
<td>vip</td>
<table>vip</table>
<td>more data here</td>
awk '/<td>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>something</td>
<table>vip</table>
<td>something</td>
awk '/<table>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>vip</td>
<table>something</table>
<td>more data here</td>

Unterminated address regex - misapplying escape characters in bash sed script?

Just learning sed, and I feel like I'm getting close to doing what I want, just missing something obvious.
The objective is to take bunch of <tr>...</tr>s in an html table and appended it to the single table in another page. So I want to take the initial file, strip everything above the first time I use <tr> and everything from </table> on down, then insert it just above the </table> in the other file. So like below, except <tr> and </tr> are on their own lines, if it matters.
Input File: Target File:
<html><body> <html><body>
<p>Whatever...</p> <p>Other whatever...</p>
<table> <table>
<tr><td>4</td></tr> <thead>
<tr><td>5</td></tr> <tr><th>#</th></tr>
<tr><td>6</td></tr> </thead>
</table> <tbody>
</body></html> <tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
</tbody>
</table>
</body></html>
Becomes:
Input file Target File:
doesn't matter. <html><body>
<p>Other whatever...</p>
<table>
<thead>
<tr><th>#</th></tr>
</thead>
<tbody>
<tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td></tr>
<tr><td>6</td></tr>
</tbody>
</table>
</body></html>
Here's the code I'm trying to use:
#!/bin/bash
#$1 is the first parameter and $2 is the second parameter being passed when calling the script. The variable filename will be used to refer to this.
input=$1
inserttarget=$2
sed -e '/\<\/thead\>,$input' $input
sed -e '/\<\/table\>,$input' $input
sed -n -i -e '\<\/tbody\>/r' $inserttarget -e 1x -e '2,${x;p}' -e '${x;p}' $input
Pretty sure it's pretty simple, just messing the expression up. Can anyone set me straight?
Here I cut the problem in two:
1. Cut the rows from the input
2. Paste those rows in the output file
sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p'
This line will remove all lines containing <tr> in the block ranging from the first line matching <table> to the first line matching </table>. All those freshly cut lines are printed in the standard output.
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget}
This multi-line command will add the lines read from stdin after the line matching </tbody>. Then we move the </tbody> by appending it after the new lines and removing the old one.
Another trick used here is to replace the default regex delimiter / by :, so that we can use '/' in our matching pattern.
Final sotuion:
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget} < <(sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p')
Et voila!

Parse HTML using shell

I have a HTML with lots of data and part I am interested in:
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
I try to use awk which now is:
awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"
but what I want is to have:
54
1
0
0
Right now I am getting:
'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'
Any suggestions?
awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.
Also you can use a programming language which is able to parse HTML
awk -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file
Output:
54
1
0
0
Another:
awk -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
while (getline > 0 && /<td /) {
gsub(/<b>/, ""); sub(/ .*/, "", $3)
print $3
}
exit
}' file
$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0
You really should to use some real HTML parser for this job, like:
perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'
prints:
54
1
0
0
But for this you need to have perl, and installed Mojolicious package.
(it is easy to install with:)
curl -L get.mojolicio.us | sh
BSD/GNU grep/ripgrep
For simple extracting, you can use grep, for example:
Your example using grep:
$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0
and using ripgrep:
$ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
54
1
0 (0/0)
0
Extracting outer html of H1:
$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>
Other examples:
Extracting the body:
$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...
Instead of xargs you can also use tr '\n' ' '.
For multiple tags, see: Text between two tags.
If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.
HTML-XML-utils
You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:
$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>
Here is the example with provided data:
$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>
Here is the final example with stripping out <b> tags:
$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0
For more examples, check the html-xml-utils.
I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.
cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF
Prints:
54
1
0 (0/0)
0
With xidel, a true HTML parser, and XPath:
$ xidel -s "input.html" -e '//td[#align="right"]'
54
1
0 (0/0)
0
$ xidel -s "input.html" -e '//td[#align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[#align="right"]/extract(.,"\d+")'
54
1
0
0
ex/vim
For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
Here is the command:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".
The substitution pattern consists of 3 parts:
Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
Select our main part till < (([^<]+)).
Select everything else after < for removal (<.*).
We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.
Finally, print the current buffer on the screen by +%p.
Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").
When tested, to replace file in-place, change -scq! to -scwq.
Here is another simple example which removes style tag from the header and prints the parsed output:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also:
How to parse hundred HTML source code files in shell?
Extract data from HTML table in shell script?
What about:
lynx -dump index.html

extract pattern with text editors

I have a URL source page like:
href="http://path/to/file.bz2">german.txt.bz2</a> (2,371,487 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://a/web/page/">American cities</a></td>
<td><a rel="nofollow" class="external text" href="http://another/page/to.bz2">us_cities.txt.bz2</a> (77,081 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://other/page/to/file.bz2">test.txt.bz2</a> (7,158,285 bytes)</td>
<td>World's largest test password collection!<br />Created by <a rel="nofollow" class="external text" href="http://page/web.com/">Matt Weir</a>
I want use text editors like sed or awk in order to extract exactly pages that have .bz2 at the end of them...
like:
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Could you help me?
Sed and grep:
sed 's/.*href=\"\(.*\)\".*/\1/g' file | grep -oP '.*\.bz2$'
$ sed -n 's/.*href="\([^"]*\.bz2\)".*/\1/p' file
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Use a proper parser. For example, using xsh:
open :F html input.html ;
for //a/#href['bz2' = xsh:matches(., '\.bz2$')]
echo (.) ;

How to use sed to insert in a line before the matching pattern

I am dealing with some html code and i got stucked in some problem. Here is the extract of some code and the format is exactly the same
<tr>
<td nowrap valign="top" class="table_1row"><a name="d071301" id="d071301"></a>13-Jul-2011</td>
<td width="21%" valign="top" class="table_1row">LCQ8: Personal data of job</td>
Here i have to match with
<tr>
<td nowrap valign="top"
and insert something before <tr> .the problem occurs as i have to match a pattern in different lines.
i have tried
grep -c "<tr>\n<td nowrap valign="top"" test.html
grep -c "<tr>\n*<td nowrap valign="top"" test.html
grep -c "<tr>*<td nowrap valign="top"" test.html
to test but none of them works.So i have two dimension to figure out the problem:
Match <td nowrap valign="top" and insert in the line above
Match whole string
<tr>
<td nowrap valign="top"
Would anyone suggest a way to doing it in either way?
Using sed you can perfom replacement on multiple lines. Its also easy to substitute the match.
sed "/\s*<tr>\s*/ { N; s/.*<tr>\n\s*<td.*/insertion\n&/ }"
This cryptic line basically say:
match a line with (/\s*<tr>\s*/)
continue on next line (N)
substitute the matched pattern whit the insertion and the matched string, where & represent the matched string (s/.*<tr>\n\s*td.*/insertion\n&/)
Sed is very powerful to perform substitution, its a nice to know tool. See this manual if you want to learn more about sed:
http://www.grymoire.com/Unix/Sed.html
Try grep -P "tr>\s*\n\s*<td".
It's not clear how it will help you to insert something before <tr>, but anyway.
Quoted strings do not nest, you need to escape the quote characters, or use single quotes instead of double quotes.

Resources