BASH - Select All Code Between A Multiline Div - bash

I have a div on all of my eCommerce site's pages holding SEO content. I'd like to count the number of words in that div. It's for diagnosing empty pages in a large crawl.
The div always starts as follows:
<div class="box fct-seo fct-text
It then contains <h1>, <p> and <a> tags.
it then, obviously, closes with </div>
How can I, using SED, AWK, WC, etc take all the code between the start of the div and its closing div and count how many words occur. If it's 90% accurate, I'm happy.
You'd somehow have to tell it to stop scanning before the first closing </div> it finds.
Here's an example page to work with:
http://www.zando.co.za/women/shoes/
Much appreciated.
-P

When it gets more complicated (like divs nested with in that div) the regex approach won't work anymore and you need a html parser, like in my Xidel. Then you can find the text
either with css:
xidel http://www.zando.co.za/women/shoes/ -e 'css(".fct-seo")' | wc -w
or pattern matching:
xidel http://www.zando.co.za/women/shoes/ -e '<div class="box fct-seo fct-text">{.}</div>' | wc -w
It will also only print the text, not the html tags. (if you/someone wanted them, you could add the --printed-node-format xml option)

In a Perl one-liner you can use the .. operator to specify the patterns that match the beginning and end of the region you're interested in:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html
You can then count the words with wc -w:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html | wc -w
If counting the ‘words’ in the HTML tags themselves is affecting the numbers enough to affect the accuracy, you can remove those from the count with something like:
$ perl -wne 'next unless /<div class="box fct-seo fct-text/ .. /<\/div>/; s/<.*?>//g; print' shoes.html | wc -w

Try:
grep -Pzo '(?<=<div)(.*?\n)*?.*?(?=</div)' -n inputFile.html | sed 's/^[^>]*>//'

Related

How can I use hxselect to generate array-ish result?

I'm using hxselect to process a HTML file in bash.
In this file there are multiple divs defined with the '.row' class.
In bash I want to extract these 'rows' into an array. (The divs are multilined so simply reading it line-by-line is not suitable.)
Is it possible to achieve this? (With basic tools, awk, grep, etc.)
After assigning rows to an array, I want to further process it:
for row in ROWS_EXTRACTED; do
PROCESS1($row)
PROCESS2($row)
done
Thank you!
One possibility would be to put the content of the tags in an array with each item enclosed in quotes. For example:
# Create array with " " as separator
array=`cat file.html | hxselect -i -c -s '" "' 'div.row'`
# Add " to the beginning of the string and remove the last
array='"'${array%'"'}
Then, processing in a for loop
for index in ${!array[*]}; do printf " %s\n\n" "${array[$index]}"; done
If the tags contain the quote character, another solution would be to place a separator character not found in the tags content (§ in my example) :
array=`cat file.html | hxselect -i -c -s '§' 'div.row'`
Then do a treatment with awk :
# Keep only the separators to count them with ${#res}
res="${array//[^§]}"
for (( i=1; i<=${#res}; i++ ))
do
echo $array2 | awk -v i="$i" -F § '{print $i}'
echo "----------------------------------------"
done
The following instructs hxselect to separate matches with a tab, deletes all newlines, and then translates the tab separators to newlines. This enables you to iterate over the divs as lines with read:
#!/bin/bash
divs=$(hxselect -s '\t' .row < "$1" | tr -d '\n' | tr '\t' '\n')
while read -r div; do
echo "$div"
done <<< "$divs"
Given the following test input:
<div class="container">
<div class="row">
herp
derp
</div>
<div class="row">
derp
herp
</div>
</div>
Result:
$ ./test.sh test.html
<div class="row"> herp derp </div>
<div class="row"> derp herp </div>

Parse HTML using shell

I have a HTML with lots of data and part I am interested in:
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
I try to use awk which now is:
awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"
but what I want is to have:
54
1
0
0
Right now I am getting:
'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'
Any suggestions?
awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.
Also you can use a programming language which is able to parse HTML
awk -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file
Output:
54
1
0
0
Another:
awk -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
while (getline > 0 && /<td /) {
gsub(/<b>/, ""); sub(/ .*/, "", $3)
print $3
}
exit
}' file
$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0
You really should to use some real HTML parser for this job, like:
perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'
prints:
54
1
0
0
But for this you need to have perl, and installed Mojolicious package.
(it is easy to install with:)
curl -L get.mojolicio.us | sh
BSD/GNU grep/ripgrep
For simple extracting, you can use grep, for example:
Your example using grep:
$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0
and using ripgrep:
$ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
54
1
0 (0/0)
0
Extracting outer html of H1:
$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>
Other examples:
Extracting the body:
$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...
Instead of xargs you can also use tr '\n' ' '.
For multiple tags, see: Text between two tags.
If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.
HTML-XML-utils
You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:
$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>
Here is the example with provided data:
$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>
Here is the final example with stripping out <b> tags:
$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0
For more examples, check the html-xml-utils.
I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.
cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF
Prints:
54
1
0 (0/0)
0
With xidel, a true HTML parser, and XPath:
$ xidel -s "input.html" -e '//td[#align="right"]'
54
1
0 (0/0)
0
$ xidel -s "input.html" -e '//td[#align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[#align="right"]/extract(.,"\d+")'
54
1
0
0
ex/vim
For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
Here is the command:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".
The substitution pattern consists of 3 parts:
Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
Select our main part till < (([^<]+)).
Select everything else after < for removal (<.*).
We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.
Finally, print the current buffer on the screen by +%p.
Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").
When tested, to replace file in-place, change -scq! to -scwq.
Here is another simple example which removes style tag from the header and prints the parsed output:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also:
How to parse hundred HTML source code files in shell?
Extract data from HTML table in shell script?
What about:
lynx -dump index.html

Hold buffer to rearrange texts

I don't know the good way to do this (see/awk/perl); I combined multiple chapters of html files and it has the following structure
title
title
title
<p>first chapter contents, multiple
pages</p>
title
title
title
<p>Second chapter contents, multiple pages
more informations</p>
title
title
title
<p>Third chapter contents, multiple pages
few more details</p>
I want them to reorganize like below
title
title
title
title
title
title
title
title
title
<p>first chapter contents, multiple
pages</p>
<p>Second chapter contents, multiple pages
more informations</p>
<p>Third chapter contents, multiple pages
few more details</p>
I have five chapters in a html to reorganize them. I was trying to adopt sed hold buffer but that seems to be difficult with my knowledge. I am not restricted to sed or awk. Any help will be highly appreciated, thanks.
Edit
Sorry altered the source file, it also has few lines that doesn't always start either with
<a or <p
is there anyway to have script like inverse selection in sed, something like
/^<a!/p/
How about running sed twice, first outputting the <a> tags, then the <p> tags:
sed -n '/^<a/p' input.txt
sed -n '/^<p/p' input.txt
Using holdspace it could be done like this:
sed -n '/^<a/p; /^<p/H; ${g; s/\n//; p}' input.txt
Print all <a> tags, put all <p> tags into holdspace, at the end of the document ($), get the holdspace and print it. H always adds a newline before appending to the holdspace, the first newline we don't want, that's why we remove it with s/\n//.
If you want to store the output, you can redirect it
sed -n '/^<a/p; /^<p/H; ${g; s/\n//; p}' input.txt > output.txt
To use directly sed -i, we need to restructure the code a bit:
sed -i '${x; G; s/\n//; p}; /^<p/{H;d}' input.txt
But this is getting a bit tedious.
If you have lines starting with other characters, and just want to move all starting with an <a> tag to the front, you can do
sed -n '/^<a/p; /^<a/! H; ${g; s/\n//; p}' input.txt
Grep works too:
(grep -F '<a' test.txt ; grep -F '<p' test.txt)
sed -n '/^ *<[aA]/ !H
/^ *<[aA]/ p
$ {x;s/\n//;p;}
' YourFile
if a <a href="#chapter to be more exact (and also allow cap and small variation) is not present at begin of the line, keep it into buffer.
if present, print the content
At the end, load buffer, remove first new line (we start with an append so there is a newx line at first keep) and print the content
Using awk
awk '{if ($0~/<a/) a[NR]=$0; else b[NR]=$0} END {for (i=1;i<=NR;i++) if (a[i]) print a[i];for (j=1;j<=NR;j++) if (b[j]) print b[j]}' file
title
title
title
title
title
title
title
title
title
<p>first chapter contents, multiple
pages</p>
<p>Second chapter contents, multiple pages
more informations</p>
<p>Third chapter contents, multiple pages
few more details</p>

sed multi-line replacement with line merging

This may be a bit complex, but here it goes:
Assuming I have an XML that looks as follows:
<a>
<b>000</b>
<c>111</c>
<b>222</b>
<d>333</d>
<c>444</c>
</a>
How can I, using sed on a mac, get a resulting an XML that looks as follows:
<a>
<b>111 000</b>
<b>222</b>
<d>333</d>
<c>444</c>
</a>
Basically:
Matching 2 consecutive lines that are of the form <b>...</b> followed by </c>...</c>
Taking the value between <c>...</c> and placing it (plus a space character) right after <b> on the line before it
Removing the second line <c>...</c>
Thank you.
If sed is too much for this, please advise anything else as long as I can run it from a mac shell.
Not the most beautiful solution but it seams to work :-)
$ tr '\n' # < input | sed 's#<b>\([0-9]\+\)</b>#<c>\([0-9]\+\)</c>#<b>\2 \1</b#g' | tr # '\n'
output:
<a>
<b>111 000</b
<b>222</b>
<d>333</d>
<c>444</c>
</a>
or a bit more general:
$ tr '\n' # < f1 | sed 's#<b>\([^<]*\)</b>#<c>\([^<]*\)</c>#<b>\2 \1</b#' | tr # '\n'
using [^<] to match anything between brackets
Ruby would support multi-line patterns:
ruby -e 'print gets(nil).sub(/<b>([^\n]*)<\/b>\n<c>([^\n]*)<\/c>/m,"<b>\\2 \\1</b>")' file.txt

Use the contents of a file to replace a string using SED

What would be the sed command for mac shell scripting that would replace all iterations of string "fox" with the entire string content of myFile.txt.
myFile.txt would be html content with line breaks and all kinds of characters. An example would be
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
Thanks!
EDIT 1
This is my actual code:
sed -i.bkp '/Q/{
s/Q//g
r /Users/ericbrotto/Desktop/question.txt
}' $file
When I run it I get:
sed in place editing only works for regular files.
And in my files the Q is replaced by a ton of chinese characters (!). Bizarre!
You can use the r command. When you find a 'fox' in the input...
/fox/{
...replace it for nothing...
s/fox//g
...and read the input file:
r f.html
}
If you have a file such as:
$ cat file.txt
the
quick
brown
fox
jumps
over
the lazy dog
fox dog
the result is:
$ sed '/fox/{
s/fox//g
r f.html
}' file.txt
the
quick
brown
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
jumps
over
the lazy dog
dog
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
EDIT: to alter the file being processed, just pass the -i flag to sed:
sed -i '/fox/{
s/fox//g
r f.html
}' file.txt
Some sed versions (such as my own one) require you to pass an extension to the -i flag, which will be the extension of a backup file with the old content of the file:
sed -i.bkp '/fox/{
s/fox//g
r f.html
}' file.txt
And here is the same thing as a one liner, which is also compatible with Makefile
sed -i -e '/fox/{r f.html' -e 'd}'
Ultimately what I went with which is a lot simpler than a lot of solutions I found online:
str=xxxx
sed -e "/$str/r FileB" -e "/$str/d" FileA
Supports templating like so:
str=xxxx
sed -e "/$str/r $fileToInsert" -e "/$str/d" $fileToModify
Another method (minor variation to other solutions):
If your filenames are also variable ( e.g. $file is f.html and the file you are updating is $targetfile):
sed -e "/fox/ {" -e "r $file" -e "d" -e "}" -i "$targetFile"

Resources