Replace token in a file with any content - bash

As a final result I want to copy several lines of text from file input.html to output.html.
input.html
<body>
<h1>Input File</h1>
<!-- START:TEMPLATES -->
<div>
<p>Lorem Ipsum & Lorem Ipsum</p>
<span>Path: /home/users/abc.txt</span>
</div>
<!-- END:TEMPLATES -->
<body>
template.html
<body>
<h1>Template File</h1>
<!-- INSERT:TEMPLATES -->
<p>This is a Text with & /</p>
<body>
I tried different things in Powershell and Bash to get this work done. But not with success.
Getting the input into a variable is successfuly done by:
content="$(sed -e '/BEGIN:TEMPLATES/,/END:TEMPLATES/!d' input.html)"`
But to replace in another file is impossible. I tried sed and awk. Both habe a lot of problems if the variable contains any special character like & /, ...
output.html
<body>
<h1>Output File</h1>
<div>
<p>Lorem Ipsum & Lorem Ipsum</p>
<span>Path: /home/users/abc.txt</span>
</div>
<p>This is a Text with & /</p>
<body>
Thank you for any inputs that helps solving my problem.

If the START/END comments are on separate lines I'd build a simple parser for the input file like this:
$inTemplate = $false
$template = switch -Wildcard -File '.\input.html' {
'*<!-- START:TEMPLATES -->*'{
$inTemplate = $true
}
'*<!-- END:TEMPLATES -->*'{
$inTemplate = $false
}
default{
if($inTemplate){
$_
}
}
}
Now we can do the same thing for the template file:
$output = switch -Wildcard -File '.\template.html' {
'*<!-- INSERT:TEMPLATES -->*'{
# return our template input
$template
}
default{
# otherwise return the input string as is
$_
}
}
# output to file
$output |Set-Content output.html

Solution with awk.
get all line from file input.html between <!-- START:TEMPLATES --> and <!-- END:TEMPLATES --> stored it in an array insert_var.
In END section get template.html printed line by line in while loop. If line contain <!-- INSERT:TEMPLATES --> then print contents of array insert_var.
The output get redirected to output.html
As far as I know awk not messing with those special characters.
awk -v temp_file="template.html" '
BEGIN{input_line_num=1}
/<!-- END:TEMPLATES -->/{linestart=0}
{ if(( linestart >= 1)) {insert_var[input_line_num]=$0; input_line_num++}}
/<!-- START:TEMPLATES -->/{linestart=1}
END{ while ((getline<temp_file) > 0)
{if (( $0 ~ "<!-- INSERT:TEMPLATES -->"))
{for ( i = 1;i < input_line_num; i++) {print insert_var[i]}}
else { print } }}
' input.html > output.html

Related

How to extract a fragment of text from file?

I have a file with the following text:
<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>
I want to do the following:
Whenever a line starts with <b>a:</b> <a class='a', I want to copy the text between the > symbol after <a class='a' and </a> — it must be stored in a[1];
Similarly, whenever a line starts with <b>b:</b> <a class='b', I want to copy the text between the > symbol after <a class='b' and </a> — it must be stored in b[1];
Whenever a line contains <div class='start'>, I want to create the variable t whose value starts with the text that occurs between <div class='start'> and the end of this line, then set flag to 1;
If the value of flag is already 1 and the current line does not start with <br><br><b>end</b>, I want to append the current line to the current value of the variable t (using the space symbol as separator);
If the value of flag is already 1 and the current line starts with <br><br><b>end</b>, I want to concatenate three current values of a[1], b[1] and t (using ; as separator) and print the result to the output file, then set flag to 0, then clear the variable t.
I used the following code (for gawk 4.0.1):
gawk 'BEGIN {flag = 0; t = ""; }
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ ) {
match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a) };
if ($0 ~ /^<b>b:<\/b> <a class=\x27b\x27/ ) {
match($0, /^<b>b:<\/b> <a class=\x27b\x27 href=\x27\/b\/[0-9]{1,}\x27>(.*)<\/a>/, b) };
if ($0 ~ /<div class=\x27start\x27>/ ) {
match($0, /^.*<div class=\x27start\x27>(.*)$/, s);
t = s[1];
flag = 1 };
if (flag == 1) {
if ($0 ~ /^<br><br><b>end<\/b>/) {
str = a[1] ";" b[1] ";" t;
print(str) > "output.txt";
flag = 0; str = ""; t = "" }
else {
t = t " " $0 }
}
}' input.txt
I was expecting the following output:
a1;b2;123 <br>ghij. <br>klmn
But the output is:
;;123 <b>d:</b> "ef"<br><br><div class='start'>123 <br>ghij. <br>klmn
Why are a[1] and b[1] empty? Why does <b>d:</b> "ef"<br><br><div class='start'> occur in the output? How to fix the code to obtain the expected output?
Here's the answers to your specific questions:
Q) Why are a[1] and b[1] empty?
A) They aren't when I try your script with gawk 5.1.1 so most likely either there's a bug in your awk version or some of the white space in your input isn't blanks as your script requires (maybe it's tabs), or you have some control chars or your awk version doesn't like using \x27 instead of \047 for 's.
Q) Why does <b>d:</b> "ef"<br><br><div class='start'> occur in the output?
A) Because you forgot a next in the block that matches on div so the next block is also executing and saving $0 from the div line.
Q) How to fix the code to obtain the expected output?
A) Here's how I'd approach your problem, using GNU awk for the 3rd arg to match() and \s shorthand for [:space:]:
$ cat tst.sh
#!/usr/bin/env bash
gawk '
BEGIN { OFS=";" }
match($0, /^<b>(.):<\/b>\s+<a\s+class=\047.\047\s+href=\047\/.\/[0-9]+\/?\047>(.*)<\/a>/, arr) {
vals[arr[1]] = arr[2]
}
match($0, /^.*<div\s+class=\047start\047>(.*)/, arr) {
vals["div"] = arr[1]
inDiv = 1
next
}
inDiv {
if ( /^<br><br><b>end<\/b>/ ) {
print vals["a"], vals["b"], vals["div"]
delete vals
inDiv = 0
}
else {
vals["div"] = vals["div"] " " $0
}
}
' 'input.txt' > 'output.txt'
$ ./tst.sh
$ cat output.txt
a1;b2;123 <br>ghij. <br>klmn
So
I'm using a single match() to capture all values for lines that look like your a, b, c lines for consistency, conciseness, and maintainability.
I'm always saving the match results in an array named arr rather than different arrays per occurrence so I don't have to remember to keep deleting those arrays and the code that uses the matches can all be homogenized.
I'm using a single associative array vals[] to hold all values indexed by the letter after <b> so we don't need to test those letters and create separate variables, it's easy to clear the data by just deleting the array rather than having to set multiple variables to null, and it's easy to add the c or any other similar values to the output later if desired.
I'm using \s+ instead of a single blank char for every space in the input to be agnostic about the actual space char(s) and number of spaces used.
I'm using \047 instead of \x27 to match 's for portability and robustness, see http://awk.freeshell.org/PrintASingleQuote.
I'm letting the shell handle all input/output rather than including output redirection in the awk script for consistency and improved robustness in error scenarios like files that can't be opened.
I named my flag variable inDiv rather than flag so it tells us what it means, i.e. that we're in the div block of the input, for improved clarity and easy of future maintenance. Naming a flag variable flag is like naming a numeric variable number instead of sum, count, ave, tot, diff or something else meaningful that'd improve your script. When you see people use f for the name of a flag variable, that f is shorthand for found, not for flag.
Demonstrating that gawk's regexes don't match perl's
perl:
$ echo aaaab | perl -nE '/a*(a+b)/ && say $1'
ab
$ echo aaaab | perl -nE '/a*?(a+b)/ && say $1'
aaaab
a*? matched the shortest sequence of zero or more a's, and the greedy a+ consumed the rest.
gawk
$ echo aaaab | gawk 'match($0, /a*(a+b)/, m) {print m[1]}'
ab
$ echo aaaab | gawk 'match($0, /a*?(a+b)/, m) {print m[1]}'
ab
Not the same behaviour: a*? is still greedy.
Why(...)a[1](...)empty?
match function does return 0 if not match was found, which allows to easy check if this is case, I selected part pertaining to filling a-array and altered it a bit
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}>(.*)<\/a>/, a);
}
}
then used it again
<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>
and got output
2 0
so condition in if worked as expected as line with <b>a:</b>... is 2nd line, however match was not found. This mean your regular expression is wrong, after examining, your regular expression is missing one single quote, it should be
/^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/
then
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a);
print a[1];
}
}
does give output
2 1
a1
(tested in gawk 4.2.1)

Inherit indentation when using cat in bash function to print multiple lines

In the middle of doing cat <<, if we invoke a bash function that uses cat << as well, the indentation is only inherited for the first line.
This is better explained using a simple example script:
#!/bin/bash
write_multiple_lines() {
cat <<_EOF_
<h1>Header</h1>
<p>Paragraph</p>
_EOF_
return
}
cat << _EOF_
<html>
$(write_multiple_lines)
</html>
_EOF_
The result is as follows (the <p> doesn't follow <h1>'s indentation).
<html>
<h1>Header</h1>
<p>Paragraph</p>
</html>
while the desired result is
<html>
<h1>Header</h1>
<p>Paragraph</p>
</html>
I was expecting the indentation would be inherited if cat << is used. Is there any workaround for this (other than manually adding indentation to subsequent lines as pointed out by #bob dylan in the comment)?
The only way to 'preserve' it is to change your input file. The reason why <p> is indented is because you've indented it here:
$(write_multiple_lines)
Since you don't want to change your input e.g.
write_multiple_lines() {
cat <<_EOF_
<h1>Header</h1>
<p>Paragraph</p>
_EOF_
return
}
You could change it to echo the spaces for you and then print each line e.g.
#!/bin/bash
write_multiple_lines() {
while read p; do
echo " " "$p"
done <<_EOF_
<h1>Header</h1>
<p>Paragraph</p>
_EOF_
return
}
cat << _EOF_
<html>
$(write_multiple_lines)
</html>
_EOF_
output:
<html>
<h1>Header</h1>
<p>Paragraph</p>
</html>
Though this is less dynamic / obvious then if you formatted it verbatim so I'd stick by my original suggestion before doing something like this.

Bash insert content into n th html tag

My Template.html contains two <pre> tags to which content from two different files needs to be inserted. The following inserts file content for all matches. How to insert only into 1st or 2nd <pre> tag?
sed -i -e '/<pre>/r file1.txt' Template.html
Template.html:
<html>
<body>
<h1>
<pre>
</pre>
<div>
<pre>
</pre>
</body>
</html>
file1.txt
hello
world
file2.txt
may
june
Expected Result:
<html>
<body>
<h1>
<pre>
hello
world
</pre>
<div>
<pre>
may
june
</pre>
</body>
</html>
sed is for doing simple s/old/new, that is all. It sounds like what you want would be something like this (set tgt to 1 or 2 or whichever <pre> you want the block to be inserted after):
awk -v tgt=1 '
NR==FNR { rec = rec $0 ORS; next }
{ print }
/<pre>/ && (++cnt == tgt) { printf "%s", rec }
' file1.txt Template.html
but with neither an example of file1.txt nor the expected output it's just an untested guess.
This might work for you (GNU sed):
sed -e '/<pre>/{x;s/^/x/;/^x\{1\}$/{x;r file1.txt' -e 'x};x}' Template.html
On encountering a line with the required tag, increment a counter in the hold space.
If the counter matches the required number (in this case 1) append the text file.
Thus the following will append the file after the third occurrence of the tag.
sed -e '/<pre>/{x;s/^/x/;/^x\{3\}$/{x;r file1.txt' -e 'x};x}' Template.html

Replacing filename placeholder with file contents in sed

I'm trying to write a basic script to compile HTML file includes.
The premise goes like this:
I have 3 files
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
include1.html
<span>
banana
</span>
include2.html
<span>
apple
</span>
My desired output would be:
output.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
I've tried the following:
sed "s|#include \(.*)|$(cat \1)|" test.html >output.html
This returns cat: 1: No such file or directory
sed "s|#include \(.*)|cat \1|" test.html >output.html
This runs but gives:
output.html
<div>
cat include1.html
<div>content</div>
cat include2.html
</div>
Any ideas on how to run cat inside sed using group substitution? Or perhaps another solution.
I wrote this 15-20 years ago to recursively include files and it's included in the article I wrote about how/when to use getline under "Applications" then "d)". I tweaked it now to work with your specific "#include" directive, provide indenting to match the "#include" indentation, and added a safeguard against infinite recursion (e.g. file A includes file B and file B includes file A):
$ cat tst.awk
function read(file,indent) {
if ( isOpen[file]++ ) {
print "Infinite recursion detected" | "cat>&2"
exit 1
}
while ( (getline < file) > 0) {
if ($1 == "#include") {
match($0,/^[[:space:]]+/)
read($2,indent substr($0,1,RLENGTH))
} else {
print indent $0
}
}
close(file)
delete isOpen[file]
}
BEGIN{
read(ARGV[1],"")
exit
}
.
$ awk -f tst.awk test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Note that if include1.html itself contained a #include ... directive then it'd be honored too, and so on. Look:
$ for i in test.html include?.html; do printf -- '-----\n%s\n' "$i"; cat "$i"; done
-----
test.html
<div>
#include include1.html
<div>content</div>
#include include2.html
</div>
-----
include1.html
<span>
#include include3.html
</span>
-----
include2.html
<span>
apple
</span>
-----
include3.html
<div>
#include include4.html
</div>
-----
include4.html
<span>
grape
</span>
.
$ awk -f tst.awk test.html
<div>
<span>
<div>
<span>
grape
</span>
</div>
</span>
<div>content</div>
<span>
apple
</span>
</div>
With a non-GNU awk I'd expect it to fail after about 20 levels of recursion with a "too many open files" error so get gawk if you need to go deeper than that or you'd have to write your own file management code.
If you have GNU sed, you can use the e flag to the s command, which executes the current pattern space as a shell command and replaces it with the output:
$ sed 's/#include/cat/e' test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Notice that this doesn't take care of indentation, as the included files don't have any. An HTML prettifier like Tidy can help you further for this:
$ sed 's/#include/cat/e' test.html | tidy -iq --show-body-only yes
<div>
<span>banana</span>
<div>
content
</div><span>apple</span>
</div>
GNU has a command to read a file, r, but the filename can't be generated on the fly.
As Ed points out in his comment, this is vulnerable to shell command injection: if you have something like
#include $(date)
you'll notice that the date command was actually run. This can be prevented, but the conciseness if the original solution is out the window then:
sed 's|#include \(.*\)|cat "$(/usr/bin/printf "%q" '\''\1'\'')"|e' test.html
This still replaces #include with cat, but additionally wraps the rest of the line into a command substitution with printf "%q", so a line such as
#include include1.html
becomes
cat "$(/usr/bin/printf "%q" 'include1.html')"
before being executed as a command. This expands to
cat include1.html
but if the file were named $(date), it becomes
cat '$(date)'
(note the single quotes), preventing the injected command from being executed.
Because s///e seems to use /bin/sh as its shell, you can't rely on Bash's %q format specification in printf to exist, hence the absolute path to the printf binary. For readability, I've changed the / delimiters of the s command to | (so I don't have to escape \/usr\/bin\/printf).
Lastly, the quoting mess around \1 is to get a single quote into a single quoted string: '\'' becomes '.
You may use this bash script that uses a regex to detect line starting with #include and grabs include filename using a capture group:
re="#include +([^[:space:]]+)"
while IFS= read -r line; do
[[ $line =~ $re ]] && cat "${BASH_REMATCH[1]}" || echo "$line"
done < test.html
<div>
<span>
banana
</span>
<div>content</div>
<span>
apple
</span>
</div>
Alternatively you may use this awk script to do the same:
awk '$1 == "#include"{system("cat " $2); next} 1' test.html

Multiple occurrences in sed substitution

I am trying to retrieve some data within a specific div tag in my html file.
My current html code is in the following format.
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
When I try to extract tag in just the div with class2, using the bash code
sed -e ':a;N;$!ba
s/[[:space:]]\+/ /g
s/.*<div class\="class2">\(.*\).*/\1/g' test.html > out.html
I get the output html file with the code as
some text some text </div> Some more text </div> Too much text
I want all the data after the first </div> to be removed but instead the final one is being replaced.
Can someone please elaborate my mistake.
You could do this in awk:
awk '/class2/,/<\/div>/ {a[++i]=$0}END{for (j=2;j<i;++j) print a[j]}' file
Between the lines that match /class2/ and /<\/div>/, write the contents to an array. At the end of the file loop through the array, skipping the first and last lines.
Instead of making an array, you could check for the first and last lines using a regular expression:
awk '/class2/,/<\/div>/ {if (!/class2|<\/div>/) print}' file
This works for retrieving text inside the div class = "class2" tags
#!/bin/bash
htmlcode='
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
'
echo $htmlcode |
sed -e's,<,\
<,g' |
grep 'div class = "class2"' |
sed -e's,>,>\
,g'|
grep -v 'div class = "class2"'

Resources