Parse HTML snippet with awk

Parse HTML snippet with awk - bash

I am trying to parse an HTML document with awk.
The document contains several <div class="p_header_bottom"></div blocks
<div class="p_header_bottom">
<span class="fl_r"></span>
287,489 people
</div>
<div class="p_header_bottom">
<span class="fl_r"></span>
5 links
</div>
I am using
awk '/<div class="p_header_bottom">/,/<\/div>/'
to receive all such div's.
How I can get 287,489 number from first one?
Actually awk '/<\/span>/,/people/' doesn't work correctly.

With gawk, and assuming that the only digits and commas within each <div> </div> block occur in the numeric portion of interest
awk -v RS='<[/]?div[^>]*>' '/span/ && /people/{gsub(/[^[:digit:],]/, ""); print}' file.txt

Related

How to use sed to extract the specific substring?

div class="panel-body" id="current-conditions-body">
<!-- Graphic and temperatures -->
<div id="current_conditions-summary" class="pull-left" >
<img src="newimages/large/sct.png" alt="" class="pull-left" />
<p class="myforecast-current">Partly Cloudy</p>
<p class="myforecast-current-lrg">64°F</p>
<p class="myforecast-current-sm">18°C</p>
I try to extract the "64" in line 6, I was thinking to use awk '/<p class="myforecast-current-lrg">/{print}', but this only gave me the full line. Then I think I need to use sed, but i don't know how to use sed.

Assumptions:
input is nicely formatted as per the sample provided by OP so we can use some 'simple' pattern matching
Modifying OP's current awk code:
# use split() function to break line using dual delimiters ">" and "&"; print 2nd array entry
awk '/<p class="myforecast-current-lrg">/{ n=split($0,arr,"[>&]");print arr[2]}'
# define dual input field delimiter as ">" and "&"; print 2nd field in line that matches search string
awk -F'[>&]' ' /<p class="myforecast-current-lrg">/{print $2}'
Both of these generate:
64
One sed idea:
sed -En 's/.*<p class="myforecast-current-lrg">([^&]+)&deg.*/\1/p'
This generates:
64

Replacing filename placeholder with file contents in sed

I'm trying to write a basic script to compile HTML file includes.
The premise goes like this:
I have 3 files
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
include1.html
<span>
banana
</span>
include2.html
<span>
apple
</span>
My desired output would be:
output.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
I've tried the following:
sed "s|#include \(.*)|$(cat \1)|" test.html >output.html
This returns cat: 1: No such file or directory
sed "s|#include \(.*)|cat \1|" test.html >output.html
This runs but gives:
output.html
<div>
cat include1.html
<div>content</div>
cat include2.html
</div>
Any ideas on how to run cat inside sed using group substitution? Or perhaps another solution.

I wrote this 15-20 years ago to recursively include files and it's included in the article I wrote about how/when to use getline under "Applications" then "d)". I tweaked it now to work with your specific "#include" directive, provide indenting to match the "#include" indentation, and added a safeguard against infinite recursion (e.g. file A includes file B and file B includes file A):
$ cat tst.awk
function read(file,indent) {
if ( isOpen[file]++ ) {
print "Infinite recursion detected" | "cat>&2"
exit 1
}
while ( (getline < file) > 0) {
if ($1 == "#include") {
match($0,/^[[:space:]]+/)
read($2,indent substr($0,1,RLENGTH))
} else {
print indent $0
}
}
close(file)
delete isOpen[file]
}
BEGIN{
read(ARGV[1],"")
exit
}
.
$ awk -f tst.awk test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Note that if include1.html itself contained a #include ... directive then it'd be honored too, and so on. Look:
$ for i in test.html include?.html; do printf -- '-----\n%s\n' "$i"; cat "$i"; done
-----
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
-----
include1.html
<span>
#include include3.html
</span>
-----
include2.html
<span>
apple
</span>
-----
include3.html
<div>
#include include4.html
</div>
-----
include4.html
<span>
grape
</span>
.
$ awk -f tst.awk test.html
<div>
<span>
<div>
<span>
grape
</span>
</div>
</span>
<div>content</div>
<span>
apple
</span>
</div>
With a non-GNU awk I'd expect it to fail after about 20 levels of recursion with a "too many open files" error so get gawk if you need to go deeper than that or you'd have to write your own file management code.

If you have GNU sed, you can use the e flag to the s command, which executes the current pattern space as a shell command and replaces it with the output:
$ sed 's/#include/cat/e' test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Notice that this doesn't take care of indentation, as the included files don't have any. An HTML prettifier like Tidy can help you further for this:
$ sed 's/#include/cat/e' test.html | tidy -iq --show-body-only yes
<div>
<span>banana</span>
<div>
content
</div><span>apple</span>
</div>
GNU has a command to read a file, r, but the filename can't be generated on the fly.
As Ed points out in his comment, this is vulnerable to shell command injection: if you have something like
#include $(date)
you'll notice that the date command was actually run. This can be prevented, but the conciseness if the original solution is out the window then:
sed 's|#include \(.*\)|cat "$(/usr/bin/printf "%q" '\''\1'\'')"|e' test.html
This still replaces #include with cat, but additionally wraps the rest of the line into a command substitution with printf "%q", so a line such as
#include include1.html
becomes
cat "$(/usr/bin/printf "%q" 'include1.html')"
before being executed as a command. This expands to
cat include1.html
but if the file were named $(date), it becomes
cat '$(date)'
(note the single quotes), preventing the injected command from being executed.
Because s///e seems to use /bin/sh as its shell, you can't rely on Bash's %q format specification in printf to exist, hence the absolute path to the printf binary. For readability, I've changed the / delimiters of the s command to | (so I don't have to escape \/usr\/bin\/printf).
Lastly, the quoting mess around \1 is to get a single quote into a single quoted string: '\'' becomes '.

You may use this bash script that uses a regex to detect line starting with #include and grabs include filename using a capture group:
re="#include +([^[:space:]]+)"
while IFS= read -r line; do
[[ $line =~ $re ]] && cat "${BASH_REMATCH[1]}" || echo "$line"
done < test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Alternatively you may use this awk script to do the same:
awk '$1 == "#include"{system("cat " $2); next} 1' test.html

Multiple occurrences in sed substitution

I am trying to retrieve some data within a specific div tag in my html file.
My current html code is in the following format.
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
When I try to extract tag in just the div with class2, using the bash code
sed -e ':a;N;$!ba
s/[[:space:]]\+/ /g
s/.*<div class\="class2">\(.*\).*/\1/g' test.html > out.html
I get the output html file with the code as
some text some text </div> Some more text </div> Too much text
I want all the data after the first </div> to be removed but instead the final one is being replaced.
Can someone please elaborate my mistake.

You could do this in awk:
awk '/class2/,/<\/div>/ {a[++i]=$0}END{for (j=2;j<i;++j) print a[j]}' file
Between the lines that match /class2/ and /<\/div>/, write the contents to an array. At the end of the file loop through the array, skipping the first and last lines.
Instead of making an array, you could check for the first and last lines using a regular expression:
awk '/class2/,/<\/div>/ {if (!/class2|<\/div>/) print}' file

This works for retrieving text inside the div class = "class2" tags
#!/bin/bash
htmlcode='
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
'
echo $htmlcode |
sed -e's,<,\
<,g' |
grep 'div class = "class2"' |
sed -e's,>,>\
,g'|
grep -v 'div class = "class2"'

REGEX pattern to increment a counter on replacements

In a replacement pattern, is there any way to print the NUMBER of the replacement as in a counter?
I have a series of code blocks I need to process in an HTML file, but in each replaced block I need to increment the counter by 1.
So
<p class-"foo">Some text</p>
<p class-"foo">Other text</p>
needs to be
<p id="1">Some text</p>
<p id="2">Other text</p>
I have many lines, I would love to avoid manually entering those numbers. How can i do this, the simplest way?

In Perl:
my $html = <<END;
<p class="foo">Some text</p>
<p class="foo">Other text</p>
END
my $n = 0;
$html =~ s/<p class="toc0">/'<p class="foo" id="'.++$n.'">'/eg;
print $html;
OUTPUT
<p id="1">Some text</p>
<p id="2">Other text</p>
For a command-line version to read from a file
perl -pe 's/<p class="toc0">/q(<p class="foo" id=").++$n.q(">)/eg' myfile.html

You can write:
perl -pe 's/<p class-"foo">/"<p id=\"" . (++$count) . "\">"/eg'
using the /e flag to treat the replacement as an expression rather than a string.

use awk
{ cat -<<EOS
<p class-"foo">Some text</p>
<p class-"foo">Other text</p>
EOS
} | awk '/<p class/{sub(/class-".*"/, "id=\""++i "\"");print}'
output
<p id-1>Some text</p>
<p id-2>Other text</p>
I hope this helps.

perl -pwe 's/<p\s+\Kclass-"foo">/ $i++; qq(id="$i">) /e' yourfile
The \K is used to keep whatever comes before it, which is convenient in this case. Using a replacement that is evaluated and contains several (two) statements is also convenient, to avoid concatenating and complicating quoting. It is only the return value of the entire statement that is inserted, i.e. the last statement.
When you've tried it out and want to alter the files, you can simply add the -i option. I would recommend using backups, like so:
perl -i.bak -pwe '....etc'
(Backup in filename.ext.bak)

irb(main):001:0> s = %Q{<p class-"foo">Some text</p>\n<p class-"foo">Other text</p>}
irb(main):002:0> id=0; s.gsub(/class-"foo"/) { id+=1; %Q[id="#{id}"] }
=> "<p id=\"1\">Some text</p>\n<p id=\"2\">Other text</p>"

Shell: Extract some code from HTML

I have the following code snippet from a HTML file:
<div id="rwImages_hidden" style="display:none;">
<img src="http://example.com/images/I/520z3AjKzHL._SL500_AA300_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/519z3AjKzHL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/31F-sI61AyL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/71k-DIrs-8L._AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/61CCOS0NGyL._AA30_.jpg" style="display:none;"/>
</div>
I want to extract the code
520z3AjKzHL
519z3AjKzHL
31F-sI61AyL
71k-DIrs-8L
61CCOS0NGyL
from the HTML.
Please note that: <img src="" style="display:none;"/> must be used because there are other similar urls in HTML file but I only what the ones between <img src="" style="display:none;"/>.
My Code is:
cat HTML | grep -Po '(?<img src="http://example.com/images/I/).*?(?=.jpg" style="display:none;"/>)'
Something seems to be wrong.

You can solve it by using positive look ahead / look behind:
cat HTML | grep -Po "(?<=<img src=\"http://example.com/images/I/).*?(?=\._.*.jpg\" style=\"display:none;\"/>)"
Demonstration:
ideone.com link
Regexp breakdown:
.*? match all characters reluctantly
(?<=<img src=...ges/I/) preceeded by <img .../I/
(?=\._...ne;\"/>) succeeded by ._...ne;\"/>

I assume you were looking for a lookbehind to start, which is what was throwing the error.
(?<=foo) not (?<foo).
This gives the result case you specified, but I do not know if you need up until the JPG or not:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/)[^.]*'
Up until and excluding the JPG would be:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/).*(?=.jpg)'

And if you consider gawk as being a valid bash solution:
awk -F'[/|\._]' -v img='/<img src="" style="display:none;"\/>/' '/img/{print $7}' file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parse HTML snippet with awk - bash

With gawk, and assuming that the only digits and commas within each <div> </div> block occur in the numeric portion of interest awk -v RS='<[/]?div[^>]*>' '/span/ && /people/{gsub(/[^[:digit:],]/, ""); print}' file.txt

Related

How to use sed to extract the specific substring?

Replacing filename placeholder with file contents in sed

Multiple occurrences in sed substitution

REGEX pattern to increment a counter on replacements

Shell: Extract some code from HTML

Categories

Resources