extract text beetwen two words and in a specific line

extract text beetwen two words and in a specific line - bash

I'm trying to make a linux bash script to download an html page, extract numbers from this html page and assign them to a variable.
the html page has several lines but I'm interested in these :
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 1</strong></td>
<td width="132">
<div align="right"><strong>61</strong></div></td>
</tr>
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 2</strong></td>
<td width="132">
<div align="right"><strong>65</strong></div></td>
</tr>
</table></td>
Every time I download the page I have to read the two values in row 5 and 11 between strong> and </strong (61 ad 65 in this example; 61 and 65 in this example, but each time they are different)
The two values extracted from html must be able to assign them to two variables
Thanks for any idea

Let's assume we a page called page.html. You can firstly select the line with grep, then extract the value with sed and finally select values iteratively with awk:
$ var0=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==1')
$ var1=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==0')
output:
$ echo $var0
61
$ echo $var1
65

This might work for you (GNU sed):
sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.*TIME ([^<]*).*<strong>([^<]*).*/var\1=\2/p}' file
Use the integer associated with the TIME in the preceding code to differentiate the two variable names.

Related

How to replace any text between html tags

i have text between html tags. For example:
<td>vip</td>
I will have any text between tags <td></td>
How can i cut any text from these tags and put any text between these tags.
I need to do it via bash/shell.
How can i do this ?
First of all, i tried to get this text, but without success
sed -n "/<td>/,/<\/td>/p" test.txt. But in a result i have
<td>vip</td>. but according to documentation, i should get only vip

You can try this:
sed -i -e 's/\(<td>\).*\(<\/td>\)/<td>TEXT_TO_REPLACE_BY<\/td>/g' test.txt
Note that it will only work for the <td> tags. It will replace everything between tags <td> (actually with them together and put the tags back) with TEXT_TO_REPLACE_BY.

You can use this to get the value vip
sed -e 's,.*<td>\([^<]*\)</td>.*,\1,g'

If you Input_file is same as shown example then following may help you too.
echo "<td>vip</td>" | awk -F"[><]" '{print $3}'
Simply printing the tag with echo then using awk to create a field separator >< then printing the 3rd field then which is your request.

d=$'<td>vip</td>\n<table>vip</table>\n<td>more data here</td>'
echo "$d"
<td>vip</td>
<table>vip</table>
<td>more data here</td>
awk '/<td>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>something</td>
<table>vip</table>
<td>something</td>
awk '/<table>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>vip</td>
<table>something</table>
<td>more data here</td>

Unterminated address regex - misapplying escape characters in bash sed script?

Just learning sed, and I feel like I'm getting close to doing what I want, just missing something obvious.
The objective is to take bunch of <tr>...</tr>s in an html table and appended it to the single table in another page. So I want to take the initial file, strip everything above the first time I use <tr> and everything from </table> on down, then insert it just above the </table> in the other file. So like below, except <tr> and </tr> are on their own lines, if it matters.
Input File: Target File:
<html><body> <html><body>
<p>Whatever...</p> <p>Other whatever...</p>
<table> <table>
<tr><td>4</td></tr> <thead>
<tr><td>5</td></tr> <tr><th>#</th></tr>
<tr><td>6</td></tr> </thead>
</table> <tbody>
</body></html> <tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
</tbody>
</table>
</body></html>
Becomes:
Input file Target File:
doesn't matter. <html><body>
<p>Other whatever...</p>
<table>
<thead>
<tr><th>#</th></tr>
</thead>
<tbody>
<tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td></tr>
<tr><td>6</td></tr>
</tbody>
</table>
</body></html>
Here's the code I'm trying to use:
#!/bin/bash
#$1 is the first parameter and $2 is the second parameter being passed when calling the script. The variable filename will be used to refer to this.
input=$1
inserttarget=$2
sed -e '/\<\/thead\>,$input' $input
sed -e '/\<\/table\>,$input' $input
sed -n -i -e '\<\/tbody\>/r' $inserttarget -e 1x -e '2,${x;p}' -e '${x;p}' $input
Pretty sure it's pretty simple, just messing the expression up. Can anyone set me straight?

Here I cut the problem in two:
1. Cut the rows from the input
2. Paste those rows in the output file
sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p'
This line will remove all lines containing <tr> in the block ranging from the first line matching <table> to the first line matching </table>. All those freshly cut lines are printed in the standard output.
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget}
This multi-line command will add the lines read from stdin after the line matching </tbody>. Then we move the </tbody> by appending it after the new lines and removing the old one.
Another trick used here is to replace the default regex delimiter / by :, so that we can use '/' in our matching pattern.
Final sotuion:
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget} < <(sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p')
Et voila!

Parse HTML using shell

I have a HTML with lots of data and part I am interested in:
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
I try to use awk which now is:
awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"
but what I want is to have:
54
1
0
0
Right now I am getting:
'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'
Any suggestions?

awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.
Also you can use a programming language which is able to parse HTML

awk -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file
Output:
54
1
0
0
Another:
awk -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
while (getline > 0 && /<td /) {
gsub(/<b>/, ""); sub(/ .*/, "", $3)
print $3
}
exit
}' file

$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0

You really should to use some real HTML parser for this job, like:
perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'
prints:
54
1
0
0
But for this you need to have perl, and installed Mojolicious package.
(it is easy to install with:)
curl -L get.mojolicio.us | sh

BSD/GNU grep/ripgrep
For simple extracting, you can use grep, for example:
Your example using grep:
$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0
and using ripgrep:
$ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
54
1
0 (0/0)
0
Extracting outer html of H1:
$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>
Other examples:
Extracting the body:
$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...
Instead of xargs you can also use tr '\n' ' '.
For multiple tags, see: Text between two tags.
If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

HTML-XML-utils
You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:
$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>
Here is the example with provided data:
$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>
Here is the final example with stripping out <b> tags:
$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0
For more examples, check the html-xml-utils.

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.
cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF
Prints:
54
1
0 (0/0)
0

With xidel, a true HTML parser, and XPath:
$ xidel -s "input.html" -e '//td[#align="right"]'
54
1
0 (0/0)
0
$ xidel -s "input.html" -e '//td[#align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[#align="right"]/extract(.,"\d+")'
54
1
0
0

ex/vim
For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
Here is the command:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".
The substitution pattern consists of 3 parts:
Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
Select our main part till < (([^<]+)).
Select everything else after < for removal (<.*).
We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.
Finally, print the current buffer on the screen by +%p.
Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").
When tested, to replace file in-place, change -scq! to -scwq.
Here is another simple example which removes style tag from the header and prints the parsed output:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also:
How to parse hundred HTML source code files in shell?
Extract data from HTML table in shell script?

What about:
lynx -dump index.html

extract pattern with text editors

I have a URL source page like:
href="http://path/to/file.bz2">german.txt.bz2</a> (2,371,487 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://a/web/page/">American cities</a></td>
<td><a rel="nofollow" class="external text" href="http://another/page/to.bz2">us_cities.txt.bz2</a> (77,081 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://other/page/to/file.bz2">test.txt.bz2</a> (7,158,285 bytes)</td>
<td>World's largest test password collection!<br />Created by <a rel="nofollow" class="external text" href="http://page/web.com/">Matt Weir</a>
I want use text editors like sed or awk in order to extract exactly pages that have .bz2 at the end of them...
like:
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Could you help me?

Sed and grep:
sed 's/.*href=\"\(.*\)\".*/\1/g' file | grep -oP '.*\.bz2$'

$ sed -n 's/.*href="\([^"]*\.bz2\)".*/\1/p' file
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2

Use a proper parser. For example, using xsh:
open :F html input.html ;
for //a/#href['bz2' = xsh:matches(., '\.bz2$')]
echo (.) ;

How to use sed to insert in a line before the matching pattern

I am dealing with some html code and i got stucked in some problem. Here is the extract of some code and the format is exactly the same
<tr>
<td nowrap valign="top" class="table_1row"><a name="d071301" id="d071301"></a>13-Jul-2011</td>
<td width="21%" valign="top" class="table_1row">LCQ8: Personal data of job</td>
Here i have to match with
<tr>
<td nowrap valign="top"
and insert something before <tr> .the problem occurs as i have to match a pattern in different lines.
i have tried
grep -c "<tr>\n<td nowrap valign="top"" test.html
grep -c "<tr>\n*<td nowrap valign="top"" test.html
grep -c "<tr>*<td nowrap valign="top"" test.html
to test but none of them works.So i have two dimension to figure out the problem:
Match <td nowrap valign="top" and insert in the line above
Match whole string
<tr>
<td nowrap valign="top"
Would anyone suggest a way to doing it in either way?

Using sed you can perfom replacement on multiple lines. Its also easy to substitute the match.
sed "/\s*<tr>\s*/ { N; s/.*<tr>\n\s*<td.*/insertion\n&/ }"
This cryptic line basically say:
match a line with (/\s*<tr>\s*/)
continue on next line (N)
substitute the matched pattern whit the insertion and the matched string, where & represent the matched string (s/.*<tr>\n\s*td.*/insertion\n&/)
Sed is very powerful to perform substitution, its a nice to know tool. See this manual if you want to learn more about sed:
http://www.grymoire.com/Unix/Sed.html

Try grep -P "tr>\s*\n\s*<td".
It's not clear how it will help you to insert something before <tr>, but anyway.
Quoted strings do not nest, you need to escape the quote characters, or use single quotes instead of double quotes.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

extract text beetwen two words and in a specific line - bash

This might work for you (GNU sed): sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.TIME ([^<]).<strong>([^<]).*/var\1=\2/p}' file Use the integer associated with the TIME in the preceding code to differentiate the two variable names.

Related

How to replace any text between html tags

Unterminated address regex - misapplying escape characters in bash sed script?

Parse HTML using shell

extract pattern with text editors

How to use sed to insert in a line before the matching pattern

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

extract text beetwen two words and in a specific line - bash

This might work for you (GNU sed): sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.*TIME ([^<]*).*<strong>([^<]*).*/var\1=\2/p}' file Use the integer associated with the TIME in the preceding code to differentiate the two variable names.

Related

How to replace any text between html tags

Unterminated address regex - misapplying escape characters in bash sed script?

Parse HTML using shell

extract pattern with text editors

How to use sed to insert in a line before the matching pattern

Categories

Resources

This might work for you (GNU sed): sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.TIME ([^<]).<strong>([^<]).*/var\1=\2/p}' file Use the integer associated with the TIME in the preceding code to differentiate the two variable names.