How can I use hxselect to generate array-ish result? - bash

I'm using hxselect to process a HTML file in bash.
In this file there are multiple divs defined with the '.row' class.
In bash I want to extract these 'rows' into an array. (The divs are multilined so simply reading it line-by-line is not suitable.)
Is it possible to achieve this? (With basic tools, awk, grep, etc.)
After assigning rows to an array, I want to further process it:
for row in ROWS_EXTRACTED; do
PROCESS1($row)
PROCESS2($row)
done
Thank you!

One possibility would be to put the content of the tags in an array with each item enclosed in quotes. For example:
# Create array with " " as separator
array=`cat file.html | hxselect -i -c -s '" "' 'div.row'`
# Add " to the beginning of the string and remove the last
array='"'${array%'"'}
Then, processing in a for loop
for index in ${!array[*]}; do printf " %s\n\n" "${array[$index]}"; done
If the tags contain the quote character, another solution would be to place a separator character not found in the tags content (§ in my example) :
array=`cat file.html | hxselect -i -c -s '§' 'div.row'`
Then do a treatment with awk :
# Keep only the separators to count them with ${#res}
res="${array//[^§]}"
for (( i=1; i<=${#res}; i++ ))
do
echo $array2 | awk -v i="$i" -F § '{print $i}'
echo "----------------------------------------"
done

The following instructs hxselect to separate matches with a tab, deletes all newlines, and then translates the tab separators to newlines. This enables you to iterate over the divs as lines with read:
#!/bin/bash
divs=$(hxselect -s '\t' .row < "$1" | tr -d '\n' | tr '\t' '\n')
while read -r div; do
echo "$div"
done <<< "$divs"
Given the following test input:
<div class="container">
<div class="row">
herp
derp
</div>
<div class="row">
derp
herp
</div>
</div>
Result:
$ ./test.sh test.html
<div class="row"> herp derp </div>
<div class="row"> derp herp </div>

Related

Replacing filename placeholder with file contents in sed

I'm trying to write a basic script to compile HTML file includes.
The premise goes like this:
I have 3 files
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
include1.html
<span>
banana
</span>
include2.html
<span>
apple
</span>
My desired output would be:
output.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
I've tried the following:
sed "s|#include \(.*)|$(cat \1)|" test.html >output.html
This returns cat: 1: No such file or directory
sed "s|#include \(.*)|cat \1|" test.html >output.html
This runs but gives:
output.html
<div>
cat include1.html
<div>content</div>
cat include2.html
</div>
Any ideas on how to run cat inside sed using group substitution? Or perhaps another solution.
I wrote this 15-20 years ago to recursively include files and it's included in the article I wrote about how/when to use getline under "Applications" then "d)". I tweaked it now to work with your specific "#include" directive, provide indenting to match the "#include" indentation, and added a safeguard against infinite recursion (e.g. file A includes file B and file B includes file A):
$ cat tst.awk
function read(file,indent) {
if ( isOpen[file]++ ) {
print "Infinite recursion detected" | "cat>&2"
exit 1
}
while ( (getline < file) > 0) {
if ($1 == "#include") {
match($0,/^[[:space:]]+/)
read($2,indent substr($0,1,RLENGTH))
} else {
print indent $0
}
}
close(file)
delete isOpen[file]
}
BEGIN{
read(ARGV[1],"")
exit
}
.
$ awk -f tst.awk test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Note that if include1.html itself contained a #include ... directive then it'd be honored too, and so on. Look:
$ for i in test.html include?.html; do printf -- '-----\n%s\n' "$i"; cat "$i"; done
-----
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
-----
include1.html
<span>
#include include3.html
</span>
-----
include2.html
<span>
apple
</span>
-----
include3.html
<div>
#include include4.html
</div>
-----
include4.html
<span>
grape
</span>
.
$ awk -f tst.awk test.html
<div>
<span>
<div>
<span>
grape
</span>
</div>
</span>
<div>content</div>
<span>
apple
</span>
</div>
With a non-GNU awk I'd expect it to fail after about 20 levels of recursion with a "too many open files" error so get gawk if you need to go deeper than that or you'd have to write your own file management code.
If you have GNU sed, you can use the e flag to the s command, which executes the current pattern space as a shell command and replaces it with the output:
$ sed 's/#include/cat/e' test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Notice that this doesn't take care of indentation, as the included files don't have any. An HTML prettifier like Tidy can help you further for this:
$ sed 's/#include/cat/e' test.html | tidy -iq --show-body-only yes
<div>
<span>banana</span>
<div>
content
</div><span>apple</span>
</div>
GNU has a command to read a file, r, but the filename can't be generated on the fly.
As Ed points out in his comment, this is vulnerable to shell command injection: if you have something like
#include $(date)
you'll notice that the date command was actually run. This can be prevented, but the conciseness if the original solution is out the window then:
sed 's|#include \(.*\)|cat "$(/usr/bin/printf "%q" '\''\1'\'')"|e' test.html
This still replaces #include with cat, but additionally wraps the rest of the line into a command substitution with printf "%q", so a line such as
#include include1.html
becomes
cat "$(/usr/bin/printf "%q" 'include1.html')"
before being executed as a command. This expands to
cat include1.html
but if the file were named $(date), it becomes
cat '$(date)'
(note the single quotes), preventing the injected command from being executed.
Because s///e seems to use /bin/sh as its shell, you can't rely on Bash's %q format specification in printf to exist, hence the absolute path to the printf binary. For readability, I've changed the / delimiters of the s command to | (so I don't have to escape \/usr\/bin\/printf).
Lastly, the quoting mess around \1 is to get a single quote into a single quoted string: '\'' becomes '.
You may use this bash script that uses a regex to detect line starting with #include and grabs include filename using a capture group:
re="#include +([^[:space:]]+)"
while IFS= read -r line; do
[[ $line =~ $re ]] && cat "${BASH_REMATCH[1]}" || echo "$line"
done < test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Alternatively you may use this awk script to do the same:
awk '$1 == "#include"{system("cat " $2); next} 1' test.html

How to replace digit between two html tags

I would like to replace a digit between two HTML tags, but still have a problem and command sed does not work:
string to replace - <p key=SaveFile>0</p>
new string - <p key=SaveFile>1</p>
Code:
sed -i 's/\<p key\=SaveFile\>0\<\/p\>/<p key=SaveFile>1<\/p>/' newfile.xml
It's easier if you use another delimiter for s like | or #:
echo "<p key=SaveFile>0</p>" | sed 's|<p key=SaveFile>0</p>|<p key=SaveFile>1</p>|'
If you want to replace any number between the two tags simply use [0-9]\+ or [0-9]+ (with option -r):
echo "<p key=SaveFile>1234</p>" | sed 's|<p key=SaveFile>[0-9]\+</p>|<p key=SaveFile>1</p>|'
Output:
<p key=SaveFile>1</p>
Application can be
sed -i 's|<p key=SaveFile>0</p>|<p key=SaveFile>1</p>|' newfile.xml
Or with g:
sed -i 's|<p key=SaveFile>0</p>|<p key=SaveFile>1</p>|g' newfile.xml

tr command not working as expected

I have the following string:
"Last updated Unknown </DIV> </DIV></DIV></TD></TR></TABLE></FORM></DIV></BODY></HTML>"
and I am trying a simple example to replace HTML with test
but if I try this example I get a unexpected results:
echo "Last updated Unknown</DIV></DIV></DIV></TD></TR></TABLE></FORM></DIV></BODY></HTML>" | tr "HTML" "test"
Result:
tast updated Unknown </DIV> </DIV></DIV></eD></ eR></eABtE></FORs></DIV></BODY></test>
tr is used to translate or delete characters. Try sed instead:
sed 's/HTML/test/g'
tr "HTML" "test" replaces H by t, T by e, M by s and L by t.
You could use sed instead.
$ echo "Last updated Unknown </DIV> </DIV></DIV></TD></TR></TABLE></FORM></DIV></BODY></HTML>" | sed 's/HTML/test/g'
Last updated Unknown </DIV> </DIV></DIV></TD></TR></TABLE></FORM></DIV></BODY></test>

BASH - Select All Code Between A Multiline Div

I have a div on all of my eCommerce site's pages holding SEO content. I'd like to count the number of words in that div. It's for diagnosing empty pages in a large crawl.
The div always starts as follows:
<div class="box fct-seo fct-text
It then contains <h1>, <p> and <a> tags.
it then, obviously, closes with </div>
How can I, using SED, AWK, WC, etc take all the code between the start of the div and its closing div and count how many words occur. If it's 90% accurate, I'm happy.
You'd somehow have to tell it to stop scanning before the first closing </div> it finds.
Here's an example page to work with:
http://www.zando.co.za/women/shoes/
Much appreciated.
-P
When it gets more complicated (like divs nested with in that div) the regex approach won't work anymore and you need a html parser, like in my Xidel. Then you can find the text
either with css:
xidel http://www.zando.co.za/women/shoes/ -e 'css(".fct-seo")' | wc -w
or pattern matching:
xidel http://www.zando.co.za/women/shoes/ -e '<div class="box fct-seo fct-text">{.}</div>' | wc -w
It will also only print the text, not the html tags. (if you/someone wanted them, you could add the --printed-node-format xml option)
In a Perl one-liner you can use the .. operator to specify the patterns that match the beginning and end of the region you're interested in:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html
You can then count the words with wc -w:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html | wc -w
If counting the ‘words’ in the HTML tags themselves is affecting the numbers enough to affect the accuracy, you can remove those from the count with something like:
$ perl -wne 'next unless /<div class="box fct-seo fct-text/ .. /<\/div>/; s/<.*?>//g; print' shoes.html | wc -w
Try:
grep -Pzo '(?<=<div)(.*?\n)*?.*?(?=</div)' -n inputFile.html | sed 's/^[^>]*>//'

Use the contents of a file to replace a string using SED

What would be the sed command for mac shell scripting that would replace all iterations of string "fox" with the entire string content of myFile.txt.
myFile.txt would be html content with line breaks and all kinds of characters. An example would be
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
Thanks!
EDIT 1
This is my actual code:
sed -i.bkp '/Q/{
s/Q//g
r /Users/ericbrotto/Desktop/question.txt
}' $file
When I run it I get:
sed in place editing only works for regular files.
And in my files the Q is replaced by a ton of chinese characters (!). Bizarre!
You can use the r command. When you find a 'fox' in the input...
/fox/{
...replace it for nothing...
s/fox//g
...and read the input file:
r f.html
}
If you have a file such as:
$ cat file.txt
the
quick
brown
fox
jumps
over
the lazy dog
fox dog
the result is:
$ sed '/fox/{
s/fox//g
r f.html
}' file.txt
the
quick
brown
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
jumps
over
the lazy dog
dog
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
EDIT: to alter the file being processed, just pass the -i flag to sed:
sed -i '/fox/{
s/fox//g
r f.html
}' file.txt
Some sed versions (such as my own one) require you to pass an extension to the -i flag, which will be the extension of a backup file with the old content of the file:
sed -i.bkp '/fox/{
s/fox//g
r f.html
}' file.txt
And here is the same thing as a one liner, which is also compatible with Makefile
sed -i -e '/fox/{r f.html' -e 'd}'
Ultimately what I went with which is a lot simpler than a lot of solutions I found online:
str=xxxx
sed -e "/$str/r FileB" -e "/$str/d" FileA
Supports templating like so:
str=xxxx
sed -e "/$str/r $fileToInsert" -e "/$str/d" $fileToModify
Another method (minor variation to other solutions):
If your filenames are also variable ( e.g. $file is f.html and the file you are updating is $targetfile):
sed -e "/fox/ {" -e "r $file" -e "d" -e "}" -i "$targetFile"

Resources