How to use sed to extract the specific substring?

How to use sed to extract the specific substring? - bash

div class="panel-body" id="current-conditions-body">
<!-- Graphic and temperatures -->
<div id="current_conditions-summary" class="pull-left" >
<img src="newimages/large/sct.png" alt="" class="pull-left" />
<p class="myforecast-current">Partly Cloudy</p>
<p class="myforecast-current-lrg">64°F</p>
<p class="myforecast-current-sm">18°C</p>
I try to extract the "64" in line 6, I was thinking to use awk '/<p class="myforecast-current-lrg">/{print}', but this only gave me the full line. Then I think I need to use sed, but i don't know how to use sed.

Assumptions:
input is nicely formatted as per the sample provided by OP so we can use some 'simple' pattern matching
Modifying OP's current awk code:
# use split() function to break line using dual delimiters ">" and "&"; print 2nd array entry
awk '/<p class="myforecast-current-lrg">/{ n=split($0,arr,"[>&]");print arr[2]}'
# define dual input field delimiter as ">" and "&"; print 2nd field in line that matches search string
awk -F'[>&]' ' /<p class="myforecast-current-lrg">/{print $2}'
Both of these generate:
64
One sed idea:
sed -En 's/.*<p class="myforecast-current-lrg">([^&]+)&deg.*/\1/p'
This generates:
64

Related

Awk to get the attribute value from XML file

For getting the attribute value from the below mentioned xml for attribute code from tag c
random.xml
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
Currently the logic is:
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How does the above logic work to get the values of code from tag c?
Getting the expected output:
abc
efg

Firstly observe that
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
is of dubious quality, as
egrep does not require standard input, it can read file itself, so you have useless use of cat
simple pattern is used in egrep which will work equally well in common grep, no need to summon ehanced grep, this usage is overkill
1 is set as field separator in awk, but code does not make any use of fields mechanism
after fixing these issue code looks following way
grep "<c.*/>" random.xml | awk ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How it does work: select lines which contain <c followed by zero-or-more any characters followed by />, then instruct awk that row are separated by qoutes (") when row does contain code= set f variable value to number of row, print such row that f is set to non-zero value and f value is equal to current number of lines minus one, which does mean print rows which are directly after row containing code=.
Observe that GNU AWK is poorly suited for working with XML and using regular expression against XML is very poor idea, as XML is not Chomsky Type 3 contraption.
If possible use proper tools for working with XML data, e.g. hxselect might be used following way, let file.xml content be
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
then
hxselect -c -s '\n' 'c[code]::attr(code)' < file.xml
gives output
abc
efg
Explanation: -c get just value rather than name and value, -s '\n' shear using newline, i.e. each value will be on own line c[code] is CSS3 selector meaning any c tag with attribute code, ::attr(code) is hxselect feature meaning get attribute named code. Observe that this solution is more robust than peculiar cat-egrep-awk pipeline as is immune to e.g. other whitespace usage in file (whitespaces outside tags in XML are optional).

This might be an awk question but parsing XML should be done with XML tools.
Here's an example with Xidel (available here for a few OSs) and a standard XPath expression:
xidel --xpath '//c[#code]/#code' random.xml
note: //c[#code] selects the c nodes that have a code attribute, and .../#code outputs the value of the code attribute.
Output
abc
efg

If your input always looks likes the sample XML then you can make the code attribute itself a field separator, and < the record separator, so that you can easily extract the value as the second field when the first field is the tag name c:
awk -F' .*code="|" ' -vRS='<' '$1=="c"{print $2}'
Demo: https://awk.js.org/?snippet=Lz6yx7

How to replace any text between html tags

i have text between html tags. For example:
<td>vip</td>
I will have any text between tags <td></td>
How can i cut any text from these tags and put any text between these tags.
I need to do it via bash/shell.
How can i do this ?
First of all, i tried to get this text, but without success
sed -n "/<td>/,/<\/td>/p" test.txt. But in a result i have
<td>vip</td>. but according to documentation, i should get only vip

You can try this:
sed -i -e 's/\(<td>\).*\(<\/td>\)/<td>TEXT_TO_REPLACE_BY<\/td>/g' test.txt
Note that it will only work for the <td> tags. It will replace everything between tags <td> (actually with them together and put the tags back) with TEXT_TO_REPLACE_BY.

You can use this to get the value vip
sed -e 's,.*<td>\([^<]*\)</td>.*,\1,g'

If you Input_file is same as shown example then following may help you too.
echo "<td>vip</td>" | awk -F"[><]" '{print $3}'
Simply printing the tag with echo then using awk to create a field separator >< then printing the 3rd field then which is your request.

d=$'<td>vip</td>\n<table>vip</table>\n<td>more data here</td>'
echo "$d"
<td>vip</td>
<table>vip</table>
<td>more data here</td>
awk '/<td>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>something</td>
<table>vip</table>
<td>something</td>
awk '/<table>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>vip</td>
<table>something</table>
<td>more data here</td>

Sed backreference not working properly

Test Data:
<img src=\"images/docs/mydash_grooms.png\" alt=\"\" />
Sed:
sed 's/<img\ssrc=\\"images\/docs\/\([[:graph:]]\)/<a class=\\"popup-image\\" href=\\"images\/docs\/\1\\"><img src=\\"images\/docs\/tn.\1/g' test.txt
Output from Sed:
<a class=\"popup-image\" href=\"images/docs/m\"><img src=\"images/docs/tn.mydash_grooms.png\" alt=\"\" />
Why is my backreference not working properly both times used?
Trying to accomplish:
Changing:
<img src=\"images/docs/mydash_grooms.png\" alt=\"\" />
to
<a class=\"popup-image\" href=\"images/docs/mydash_grooms.png\"><img src=\"images/docs/tn.mydash_grooms.png\" alt=\"\" />

You have to escape the \ so they become actually "\\". However, you also have to escape the /, which makes the string very complex. I suggest replacing the delimiter of sed (i.e., the '/'), to another character to avoid complex strings. For example, using #
sed 's#<img src=\\"images/docs/\(.*\)\\" alt=\\"\\" />#<a class=\\"popup-image\\" href=\\"images/docs/\1\\"><img src=\\"images/docs/tn.\1\\" alt=\\"\\" />#g' test.txt
Futhermore, please replace the [[:graph:]], it was not working for me.

This might work for you (GNU sed):
sed -r 'h;s|img src(.*) alt.*|a class=\\"popup-image\\" href\1>|;G;s/\n//;s|(.*/)([^>])|\1tn.\2|' file
Save the line in the hold space then alter the line to replicate the first attribute. Append the original line and insert the tn. into the file name.

Fetching all those text that match a pattern using shell

html file that i have to read line by line. I then need to run a script that matches some class attribute of span tag and then returns the text enclosing the span and the line number on which it exists.
Following is my single line code of .html file:
<span id="L9_454" class="e"><span class="ln">454</span><span class="bar"></span> <span class="k">if</span> ( (strncmp(<span class="fm" value="2705">p_rout</span>-><span class="fm" value="186">source_corresp</span>.<span class="fm" value="105">name</span>, <span class="fm" value="5190">IL_LOWERING_INIT_ROUTINE_PREFIX</span>, strlen(<span class="fm" value="5190">IL_LOWERING_INIT_ROUTINE_PREFIX</span>)) == 0) </span>
i need to run the script on every line and search if class="fm" is set for any span tag then i need to dump the line no i.e 454 in above example and text that have span class="fm" i.e p_rout,source_corresp,name,IL_LOWERING_INIT_ROUTINE_PREFIX and IL_LOWERING_INIT_ROUTINE_PREFIX in a .xml file.
i know how to dump the data but i just dont know how can i get the texts required. I tried it using awk but cudn't get what regex should i match. Any other filter would also work. Pls help

awk '$1 ~ /fm/ {print $2}' RS=span FS='[<>]'
set Record Separator to span
set Field Separator to < or >
if field one contains fm print field two
Result
p_rout
source_corresp
name
IL_LOWERING_INIT_ROUTINE_PREFIX
IL_LOWERING_INIT_ROUTINE_PREFIX

Print text between two strings on the same line

I've been searching for a ling time, and have not been able to find a working answer for my problem.
I have a line from an HTML file extracted with sed '162!d' skinlist.html, which contains the text
<a href="/skin/dwarf-red-beard-734/" title="Dwarf Red Beard">.
I want to extract the text Dwarf Red Beard, but that text is modular (can be changed), so I would like to extract the text between title=" and ".
I cannot, for the life of me, figure out how to do this.

awk 'NR==162 {print $4}' FS='"' skinlist.html
set field separator to "
print only line 162
print field 4

Solution in sed
sed -n '162 s/^.*title="\(.*\)".*$/\1/p' skinlist.html
Extracts line 162 in skinlist.html and captures the title attributes contents in\1.

The shell's variable expansion syntax allows you to trim prefixes and suffixes from a string:
line="$(sed '162!d' skinlist.html)" # extract the relevant line from the file
temp="${line#* title=\"}" # remove from the beginning through the first match of ' title="'
if [ "$temp" = "$line" ]; then
echo "title not found in '$line'" >&2
else
title="${temp%%\"*}" # remote from the first '"' through the end
fi

You can pass it through another sed or add expressions to that sed like -e 's/.*title="//g' -e 's/">.*$//g'

also sed
sed -n '162 s/.*"\([a-zA-Z ]*\)"./\1/p' skinlist.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to use sed to extract the specific substring? - bash

Related

Awk to get the attribute value from XML file

How to replace any text between html tags

Sed backreference not working properly

Fetching all those text that match a pattern using shell

Print text between two strings on the same line

Categories

Resources