Replacing filename placeholder with file contents in sed - bash

I'm trying to write a basic script to compile HTML file includes.
The premise goes like this:
I have 3 files
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
include1.html
<span>
banana
</span>
include2.html
<span>
apple
</span>
My desired output would be:
output.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
I've tried the following:
sed "s|#include \(.*)|$(cat \1)|" test.html >output.html
This returns cat: 1: No such file or directory
sed "s|#include \(.*)|cat \1|" test.html >output.html
This runs but gives:
output.html
<div>
cat include1.html
<div>content</div>
cat include2.html
</div>
Any ideas on how to run cat inside sed using group substitution? Or perhaps another solution.

I wrote this 15-20 years ago to recursively include files and it's included in the article I wrote about how/when to use getline under "Applications" then "d)". I tweaked it now to work with your specific "#include" directive, provide indenting to match the "#include" indentation, and added a safeguard against infinite recursion (e.g. file A includes file B and file B includes file A):
$ cat tst.awk
function read(file,indent) {
if ( isOpen[file]++ ) {
print "Infinite recursion detected" | "cat>&2"
exit 1
}
while ( (getline < file) > 0) {
if ($1 == "#include") {
match($0,/^[[:space:]]+/)
read($2,indent substr($0,1,RLENGTH))
} else {
print indent $0
}
}
close(file)
delete isOpen[file]
}
BEGIN{
read(ARGV[1],"")
exit
}
.
$ awk -f tst.awk test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Note that if include1.html itself contained a #include ... directive then it'd be honored too, and so on. Look:
$ for i in test.html include?.html; do printf -- '-----\n%s\n' "$i"; cat "$i"; done
-----
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
-----
include1.html
<span>
#include include3.html
</span>
-----
include2.html
<span>
apple
</span>
-----
include3.html
<div>
#include include4.html
</div>
-----
include4.html
<span>
grape
</span>
.
$ awk -f tst.awk test.html
<div>
<span>
<div>
<span>
grape
</span>
</div>
</span>
<div>content</div>
<span>
apple
</span>
</div>
With a non-GNU awk I'd expect it to fail after about 20 levels of recursion with a "too many open files" error so get gawk if you need to go deeper than that or you'd have to write your own file management code.

If you have GNU sed, you can use the e flag to the s command, which executes the current pattern space as a shell command and replaces it with the output:
$ sed 's/#include/cat/e' test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Notice that this doesn't take care of indentation, as the included files don't have any. An HTML prettifier like Tidy can help you further for this:
$ sed 's/#include/cat/e' test.html | tidy -iq --show-body-only yes
<div>
<span>banana</span>
<div>
content
</div><span>apple</span>
</div>
GNU has a command to read a file, r, but the filename can't be generated on the fly.
As Ed points out in his comment, this is vulnerable to shell command injection: if you have something like
#include $(date)
you'll notice that the date command was actually run. This can be prevented, but the conciseness if the original solution is out the window then:
sed 's|#include \(.*\)|cat "$(/usr/bin/printf "%q" '\''\1'\'')"|e' test.html
This still replaces #include with cat, but additionally wraps the rest of the line into a command substitution with printf "%q", so a line such as
#include include1.html
becomes
cat "$(/usr/bin/printf "%q" 'include1.html')"
before being executed as a command. This expands to
cat include1.html
but if the file were named $(date), it becomes
cat '$(date)'
(note the single quotes), preventing the injected command from being executed.
Because s///e seems to use /bin/sh as its shell, you can't rely on Bash's %q format specification in printf to exist, hence the absolute path to the printf binary. For readability, I've changed the / delimiters of the s command to | (so I don't have to escape \/usr\/bin\/printf).
Lastly, the quoting mess around \1 is to get a single quote into a single quoted string: '\'' becomes '.

You may use this bash script that uses a regex to detect line starting with #include and grabs include filename using a capture group:
re="#include +([^[:space:]]+)"
while IFS= read -r line; do
[[ $line =~ $re ]] && cat "${BASH_REMATCH[1]}" || echo "$line"
done < test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Alternatively you may use this awk script to do the same:
awk '$1 == "#include"{system("cat " $2); next} 1' test.html

Related

How can I use hxselect to generate array-ish result?

I'm using hxselect to process a HTML file in bash.
In this file there are multiple divs defined with the '.row' class.
In bash I want to extract these 'rows' into an array. (The divs are multilined so simply reading it line-by-line is not suitable.)
Is it possible to achieve this? (With basic tools, awk, grep, etc.)
After assigning rows to an array, I want to further process it:
for row in ROWS_EXTRACTED; do
PROCESS1($row)
PROCESS2($row)
done
Thank you!
One possibility would be to put the content of the tags in an array with each item enclosed in quotes. For example:
# Create array with " " as separator
array=`cat file.html | hxselect -i -c -s '" "' 'div.row'`
# Add " to the beginning of the string and remove the last
array='"'${array%'"'}
Then, processing in a for loop
for index in ${!array[*]}; do printf " %s\n\n" "${array[$index]}"; done
If the tags contain the quote character, another solution would be to place a separator character not found in the tags content (§ in my example) :
array=`cat file.html | hxselect -i -c -s '§' 'div.row'`
Then do a treatment with awk :
# Keep only the separators to count them with ${#res}
res="${array//[^§]}"
for (( i=1; i<=${#res}; i++ ))
do
echo $array2 | awk -v i="$i" -F § '{print $i}'
echo "----------------------------------------"
done
The following instructs hxselect to separate matches with a tab, deletes all newlines, and then translates the tab separators to newlines. This enables you to iterate over the divs as lines with read:
#!/bin/bash
divs=$(hxselect -s '\t' .row < "$1" | tr -d '\n' | tr '\t' '\n')
while read -r div; do
echo "$div"
done <<< "$divs"
Given the following test input:
<div class="container">
<div class="row">
herp
derp
</div>
<div class="row">
derp
herp
</div>
</div>
Result:
$ ./test.sh test.html
<div class="row"> herp derp </div>
<div class="row"> derp herp </div>

Parse HTML snippet with awk

I am trying to parse an HTML document with awk.
The document contains several <div class="p_header_bottom"></div blocks
<div class="p_header_bottom">
<span class="fl_r"></span>
287,489 people
</div>
<div class="p_header_bottom">
<span class="fl_r"></span>
5 links
</div>
I am using
awk '/<div class="p_header_bottom">/,/<\/div>/'
to receive all such div's.
How I can get 287,489 number from first one?
Actually awk '/<\/span>/,/people/' doesn't work correctly.
With gawk, and assuming that the only digits and commas within each <div> </div> block occur in the numeric portion of interest
awk -v RS='<[/]?div[^>]*>' '/span/ && /people/{gsub(/[^[:digit:],]/, ""); print}' file.txt

REGEX pattern to increment a counter on replacements

In a replacement pattern, is there any way to print the NUMBER of the replacement as in a counter?
I have a series of code blocks I need to process in an HTML file, but in each replaced block I need to increment the counter by 1.
So
<p class-"foo">Some text</p>
<p class-"foo">Other text</p>
needs to be
<p id="1">Some text</p>
<p id="2">Other text</p>
I have many lines, I would love to avoid manually entering those numbers. How can i do this, the simplest way?
In Perl:
my $html = <<END;
<p class="foo">Some text</p>
<p class="foo">Other text</p>
END
my $n = 0;
$html =~ s/<p class="toc0">/'<p class="foo" id="'.++$n.'">'/eg;
print $html;
OUTPUT
<p id="1">Some text</p>
<p id="2">Other text</p>
For a command-line version to read from a file
perl -pe 's/<p class="toc0">/q(<p class="foo" id=").++$n.q(">)/eg' myfile.html
You can write:
perl -pe 's/<p class-"foo">/"<p id=\"" . (++$count) . "\">"/eg'
using the /e flag to treat the replacement as an expression rather than a string.
use awk
{ cat -<<EOS
<p class-"foo">Some text</p>
<p class-"foo">Other text</p>
EOS
} | awk '/<p class/{sub(/class-".*"/, "id=\""++i "\"");print}'
output
<p id-1>Some text</p>
<p id-2>Other text</p>
I hope this helps.
perl -pwe 's/<p\s+\Kclass-"foo">/ $i++; qq(id="$i">) /e' yourfile
The \K is used to keep whatever comes before it, which is convenient in this case. Using a replacement that is evaluated and contains several (two) statements is also convenient, to avoid concatenating and complicating quoting. It is only the return value of the entire statement that is inserted, i.e. the last statement.
When you've tried it out and want to alter the files, you can simply add the -i option. I would recommend using backups, like so:
perl -i.bak -pwe '....etc'
(Backup in filename.ext.bak)
irb(main):001:0> s = %Q{<p class-"foo">Some text</p>\n<p class-"foo">Other text</p>}
irb(main):002:0> id=0; s.gsub(/class-"foo"/) { id+=1; %Q[id="#{id}"] }
=> "<p id=\"1\">Some text</p>\n<p id=\"2\">Other text</p>"

Use the contents of a file to replace a string using SED

What would be the sed command for mac shell scripting that would replace all iterations of string "fox" with the entire string content of myFile.txt.
myFile.txt would be html content with line breaks and all kinds of characters. An example would be
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
Thanks!
EDIT 1
This is my actual code:
sed -i.bkp '/Q/{
s/Q//g
r /Users/ericbrotto/Desktop/question.txt
}' $file
When I run it I get:
sed in place editing only works for regular files.
And in my files the Q is replaced by a ton of chinese characters (!). Bizarre!
You can use the r command. When you find a 'fox' in the input...
/fox/{
...replace it for nothing...
s/fox//g
...and read the input file:
r f.html
}
If you have a file such as:
$ cat file.txt
the
quick
brown
fox
jumps
over
the lazy dog
fox dog
the result is:
$ sed '/fox/{
s/fox//g
r f.html
}' file.txt
the
quick
brown
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
jumps
over
the lazy dog
dog
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
EDIT: to alter the file being processed, just pass the -i flag to sed:
sed -i '/fox/{
s/fox//g
r f.html
}' file.txt
Some sed versions (such as my own one) require you to pass an extension to the -i flag, which will be the extension of a backup file with the old content of the file:
sed -i.bkp '/fox/{
s/fox//g
r f.html
}' file.txt
And here is the same thing as a one liner, which is also compatible with Makefile
sed -i -e '/fox/{r f.html' -e 'd}'
Ultimately what I went with which is a lot simpler than a lot of solutions I found online:
str=xxxx
sed -e "/$str/r FileB" -e "/$str/d" FileA
Supports templating like so:
str=xxxx
sed -e "/$str/r $fileToInsert" -e "/$str/d" $fileToModify
Another method (minor variation to other solutions):
If your filenames are also variable ( e.g. $file is f.html and the file you are updating is $targetfile):
sed -e "/fox/ {" -e "r $file" -e "d" -e "}" -i "$targetFile"

Shell: Extract some code from HTML

I have the following code snippet from a HTML file:
<div id="rwImages_hidden" style="display:none;">
<img src="http://example.com/images/I/520z3AjKzHL._SL500_AA300_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/519z3AjKzHL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/31F-sI61AyL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/71k-DIrs-8L._AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/61CCOS0NGyL._AA30_.jpg" style="display:none;"/>
</div>
I want to extract the code
520z3AjKzHL
519z3AjKzHL
31F-sI61AyL
71k-DIrs-8L
61CCOS0NGyL
from the HTML.
Please note that: <img src="" style="display:none;"/> must be used because there are other similar urls in HTML file but I only what the ones between <img src="" style="display:none;"/>.
My Code is:
cat HTML | grep -Po '(?<img src="http://example.com/images/I/).*?(?=.jpg" style="display:none;"/>)'
Something seems to be wrong.
You can solve it by using positive look ahead / look behind:
cat HTML | grep -Po "(?<=<img src=\"http://example.com/images/I/).*?(?=\._.*.jpg\" style=\"display:none;\"/>)"
Demonstration:
ideone.com link
Regexp breakdown:
.*? match all characters reluctantly
(?<=<img src=...ges/I/) preceeded by <img .../I/
(?=\._...ne;\"/>) succeeded by ._...ne;\"/>
I assume you were looking for a lookbehind to start, which is what was throwing the error.
(?<=foo) not (?<foo).
This gives the result case you specified, but I do not know if you need up until the JPG or not:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/)[^.]*'
Up until and excluding the JPG would be:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/).*(?=.jpg)'
And if you consider gawk as being a valid bash solution:
awk -F'[/|\._]' -v img='/<img src="" style="display:none;"\/>/' '/img/{print $7}' file

Resources