parse word from html file - bash

I am having a lot of trouble trying to extract a word from an html file. The line in the html file appears like this:
<span id="result">WORD</span>
I am trying to get the WORD out but I can't figure it out. So far I've got:
grep 'span id="result"' FILE
Which just gets me the line. I've also tried:
sed -n '/<span id="result">/,/<\/span>/p' FILE
which didn't work either.
I know this is probably a very simple question, but I'm just beginning so I could really use some help.

Do not use regex to parse html.
Use a html parser.
My Xidel has the shortest syntax for this:
xidel FILE -e "#result"

This is a task for awk
I do guess you have other line in same files so a search for span id is a must.
echo "<span id="result">WORD</span>" | awk -F"[<>]" '/span id/ {print $3}'
WORD

You can try
awk -f ext.awk input.html
where input.html is your input html file, and ext.awk is
{
line=line $0 RS
}
END {
match (line,/<span id="result">([^<]*)<\/span>/,a)
print a[1]
}
This will extract the contents across line breaks..

Use grep with backward reference:
grep -Po '(?<=<span id="result">)\w+'
The expression between parenthèses is a backward reference; it is not captured but serves as test for the following regex part: if the expression appears, the captured pattern is only \w+ here. Add option -o for outputting only the word; option -P enables forward and backward references.
If you want to modifiy this regex, please note that with grep, a backward reference must have a fixed size.

Related

Extract a string in linux shell script

Guys i have a string like this:
variable='<partyRoleId>12345</partyRoleId>'
what i want is to extract the value so the output is 12345.
Note the tag can be in any form:
<partyRoleId> or <ns1:partyRoleId>
any idea how to get the tag value using grep or sed only?
Use an XML parser to extract the value:
echo "$variable" | xmllint -xpath '*/text()' -
You probably should use it for the whole XML document instead of extracting a single line from it into a variable, anyway.
to use only grep, you need regexp to find first closing brackets and cut all digits:
echo '<partyRoleId>12345</partyRoleId>'|grep -Po ">\K\d*"
-P means PCRE
-o tells to grep to show only matched pattern
and special \K tells to grep cut off everything before this.

Using sed or awk to select

I'm trying to select the lines between between two markers in an html file. I've tried using sed and awk but I think there's an issue with the way i'm escaping some of the characters. I have seen some similar questions and answers, but the examples given are simple, with no special characters. I think my escaping is the issue. I need the lines between
<div class="bread crumb">
and
</div>
There is no other div within the block and there are multiple lines within the block.
Do I need to escape the characters <, > and ? as below?
sed -n -e '/^\<div class=\"bread crumb\"\>$/,/^\<\/div\>$/{ /^\<div class=\"bread crumb\">$/d; /^\<\/div>$/d; p; }'
My awk attempt :
awk '/\<div class=\"bread crumb\"\>/{flag=1;next}/\<\/div\>/{flag=0}flag'
Actually, you just need to escape the / in the </div>, rest goes fine..
sed -n '/<div class="bread crumb">/,/<\/div>/{//!p}'
You should use a html parser for that job.
If you still want to do it with sed, don't escape < and > that are used for word boundary.
Try this:
sed -ne '/<div class="bread crumb">/,/<\/div>/{//!p;}' file
The //!p part outputs all the block except the lines matching the address patterns.
Just use string matches in awk:
awk '$0=="</div>"{f=0} f{print} $0=="<div class=\"bread crumb\">"{f=1} ' file

How can extract word between special character and other words

I am trying to find a way, how to extract a word between special character and other words.
Example of the text:
description "CST 500M TEST/VPNGW/11040 X {} // test"
description "test2-VPNGW-110642 -VPNGW"
I am trying to achieve result like,only the word including VPNGW:
TEST/VPNGW/11040
test2-VPNGW-110642
I tried with grep and AWK, but looks like my knowledge is not so far enough.
The way to print with awk '{$1=""; $2=""; ... is not working due to the whole word is not always on the same position.
Thanks for the help!
With grep you can output only the part of the string that matches the regex:
grep -o '[^ "]\+VPNGW[^ "]\+' file.name
You could try something like:
grep -Eoi 'test.*[0-9]'
Of course this would be greedy and if there is another number after the ones in the required string it will grab up to there. Normally I would suggest an inverted test to stop at the thing you don't want:
grep -Eoi 'test[^ ]+'
The problem with this is like in your first example, there is more than one occurrence of the string 'test' and so the output for the first example is:
TEST/VPNGW/11040
test"
Of course knowing what your real data looks like you can make your own decision on what might best suit
Uou could go with the perl regex machine in grep and use a look-ahead:
grep -Eoi 'test[^ ]+(?= )'
Again though, if you have the string 'test' somewhere else on the line followed by a single space, this will still not work as desired.
Lastly, awk can do the job but you would need to cycle through each item or set RS to white space:
Option 1:
awk '{for(i=1;i<=NF;i++)if(tolower($i) ~ /test.*[0-9]/)print $i}'
Option 2:
awk 'tolower($i) ~ /test.*[0-9]/' RS="[[:space:]]+"
awk '/test2/{sub(/"/,"")}$0{print $4}/test2/{print $2}' file
TEST/VPNGW/11040
test2-VPNGW-110642

Using BASH sed command to strip a line

I have one line in a html file which i located using
grep -m 1 'argument'
That line looks a lot like this
<tag option="something" option="something"><span option="something"> Text1 </span> - <span option="something"> Text2 </span></tag>
I need to extract Text 1 and Text 2 using seperate lines, what do I do? I get that I need to use sed, I have removed tag and span at the begginng leaving me with
Text1 </span> - <span...........</tag>
but I need only Text1 and i realy dont know how to remove that non-static Text2
If the lines are exactly always looking like the example you provide, you can do it with a regexp.
But in all other cases, you should really use a XML parser instead (for example, use perl : twig, or others)
So here is a regexp, but you've been warned ^^
#replace each <...> with "|", so you can easily separate each fields
sed -e 's/<[^>]*>/|/g'
You can then fetch each section by using that new "simple" separator, |:
grep 'argument' | sed -e 's/<[^>]*>/|/g' | awk -F'|' '{print $3}' #shows Text1. Change $3 to $5 to fetch the Text2
Here is much shorter way to do that using grep and perl regular expressions.
$ cat testfile # I've placed your line in this file
<tag option="something" option="something"><span option="something"> Text1 </span> - <span option="something"> Text2 </span></tag>
$ grep -Po '<span[^>]*>\K[^<]*' testfile
Text1
Text2
But if you want to get exactly Text1 and Text2 you need a bit more complicated regexp. And here it is:
$ grep -Po '<span[^>]*>( )?\s*\K.*?(?=\s*( )?</span>)' testfile
Text1
Text2
Some explanation:
This regex is using lookaround syntax or zero-width assertions. You can read about it here
\K might be unfamiliar too. It is very similar to zero-width assertions and is greatly explained here. Here is a quote from that link:
There is a special form of this construct, called \K , which causes the regex engine to "keep" everything it had matched prior to the \K and not include it in $& . This effectively provides variable-length look-behind. The use of \K inside of another look-around assertion is allowed, but the behaviour is currently not well defined.
Ok, but why?
One of the greatest things in this approach is that you're already using grep, you can possibly combine this regex with your search pattern so you will only need one grep command, unlike Oliver`s answer which uses grep, sed and awk.
But still, there are special tools to parse xml, please use them instead of this regex porn.

SED bash script Assistance

I'm trying to follow who my friend is following (all 1,522 of them)
and a got a text file with from his twitter page and I want to see just the last word of a line that begins with #.
Example:
Podcaster, broadcaster and tech pundit. The Tech Guy on the Premiere Radio
Networks. Live at live.twit.tv For my link feed follow #links_for_twit
(Line-wrapped to remove hateful horizontal scrollbar.)
I want that to turn into #links_for_twit.
Use awk instead:
awk '$NF ~ /^#/ {print $NF}'
You mean, like:
grep -o '#[a-zA-Z_0-9]*$' tweets.txt
?
If you're wanting to use sed, try this:
sed -n 's/.*\(#.*\)/\1/p'
-n: don't print anything unless asked
s/.*\(#.*\): capture everything after the last '#' in the line
/\1/: replace the whole line with the captured bit
p: print if a substitution was made
Hope that helps
EDIT: I just saw the complaint below about email addresses. you can add \s just before the # to ensure there's a space: sed -n 's/.*\s(#.*\)/\1/p'
If you have GNU grep, you could use a Perl-flavoured regex to ensure the # is at the start of a word:
grep -Po '(?<=^|\s)#\w+' filename

Resources