Match the same pattern n times on the same line using sed - shell

I have an input file div.txt that looks like this:
<div>a</div>b<div>c</div>
<div>d</div>
Now I want to pick all the div tags and the text between them using sed:
sed -n 's:.*\(<div>.*</div>\).*:\1:p' < div.txt
The result I get:
<div>c</div>
<div>d</div>
What I really want:
<div>a</div>
<div>c</div>
<div>d</div>
So the question is, how to match the same pattern n times on the same line?
(do not suggest me to use perl or python, please)

This might work for you (GNU sed):
sed 's/\(<\/div>\)[^<]*/\1\n/;/^</P;D' file
Replace a </div> followed by zero or more characters that are not a < by itself and a newline. Print only lines that begin with a <.

Sed is not the right tool to handle HTML.
But if you really insist, and you know your input will always have properly closed pairs of div tags, you can just replace everything that's not inside a div by a newline:
sed 's=</div>.*<div>=</div>\n<div>='

Related

Use SED to insert newline at specific points in .html file that is only one very long line

This is a probably a duplicate but I can't find the right question...
I have a HTML file that is just one line long, and I'd like to insert a newline just after each <div class=".+ uiBoxWhite noborder">, where .+ is a series of words which include special characters, and seem to be mostly random.
I thought that
sed -r 's/<div class=".+ uiBoxWhite noborder">/\n<div class="uiBoxWhite noborder">/g' old.html > new.html
would work, but it hasn't. Am I using the wrong wildcard? Or the wrong newline character?
This might work for you (GNU sed):
sed -E 's/(<div class=")[^>]*(uiBoxWhite noborder">)/\n\1\2/g' oldFile > newFile
Use [^>]* to restrict the remaining match within the current div.

Using sed or awk to select

I'm trying to select the lines between between two markers in an html file. I've tried using sed and awk but I think there's an issue with the way i'm escaping some of the characters. I have seen some similar questions and answers, but the examples given are simple, with no special characters. I think my escaping is the issue. I need the lines between
<div class="bread crumb">
and
</div>
There is no other div within the block and there are multiple lines within the block.
Do I need to escape the characters <, > and ? as below?
sed -n -e '/^\<div class=\"bread crumb\"\>$/,/^\<\/div\>$/{ /^\<div class=\"bread crumb\">$/d; /^\<\/div>$/d; p; }'
My awk attempt :
awk '/\<div class=\"bread crumb\"\>/{flag=1;next}/\<\/div\>/{flag=0}flag'
Actually, you just need to escape the / in the </div>, rest goes fine..
sed -n '/<div class="bread crumb">/,/<\/div>/{//!p}'
You should use a html parser for that job.
If you still want to do it with sed, don't escape < and > that are used for word boundary.
Try this:
sed -ne '/<div class="bread crumb">/,/<\/div>/{//!p;}' file
The //!p part outputs all the block except the lines matching the address patterns.
Just use string matches in awk:
awk '$0=="</div>"{f=0} f{print} $0=="<div class=\"bread crumb\">"{f=1} ' file

Need help in displaying the output using shell script [duplicate]

This question already has answers here:
sed whole word search and replace
(5 answers)
Closed 8 years ago.
Please help me in solving the below issue. I have a file:
mat rat
mat dog
mat matress
I need to display
rat
dog
matress
I have coded with sed command to display the output: sed "s/$up//g"
($up will contain mat) . But using this command, I am getting the output as
rat
dog
ress
What do I do to resolve this?.
Please help.
The /g flag tries to apply the substitution command multiple times for each line. First two lines are fine because the word only appears once, but for the third line it will remove both.
You can solve it being more specific using zero-width assertions, like ^, or the GNU extension \b, like:
sed "s/^$up//g"
or
sed "s/$up\b//g"
Although the easier could be to remove the flag, like:
sed "s/$up//"
In all three cases the result is the same, at least for this kind of simple examples.
Using awk
awk '{print $NF}' inputFile
Test:
$ cat text
mat rat
mat dog
mat matress
$ awk '{print $NF}' text
rat
dog
matress
Your current command will remove all instances of $up anywhere, including multiple occurrences in a line and occurrences in the middle of a line.
If you want to match only $up at the very beginning of a line, and only when it is a whole (whitespace-delimited) word, try the following command:
sed "s/^$up\>//"
In GNU sed, the assertion ^ matches to the beginning of a line, and \> matches the end of a word (the zero-width "character" between a non-whitespace character and whitespace character).
If there might be whitespace before $up, you can use
sed "s/\(\s*\)$up\>/\1/"
This will remove just the $up and preserve all whitespace.
If you don't want to keep the whitespace between $up and the text after it, you can replace \> with \s\+, which matches to one or more (\+) whitespace characters (\s); i.e.,
sed "s/^$up\s\+//"
sed "s/\(\s*\)$up\s\+/\1/"
sed 's/^mat //' /path/to/file should do the trick. Note that there is no g; it's s/foo/bar; not s/foo/bar/g. Also, the ^ pegs the replacement to the beginning of each line.
If you are indeed assigning a variable such as $up, you can use sed "s/^$up//" /path/to/file.

sed copy substring from fixed position and copy it in front of line

I'm dealing with many csv files and I can't find a way with sed to select a substring at a fixed position (chars 9-16) and copy it at the beginning of the line.
This is what I have:
ABC09638006924340017;SOME_TEXT;SOME_OTHER_TEXT
This is what I need:
00692434;ABC09638006924340017;SOME_TEXT;SOME_OTHER_TEXT
The following code in sed gives the substring I need (00692434) but overwrites the whole line:
sed 's/^.{8}(.{8}).*/\1/')
I'm already using sed to "clean" the linestrings and inserting some variables, called in a bash script that at the end imports data in postgres. This is why I would prefer to remain within sed, but any hint will be greatly appreciated as I'm not a real expert.
You need to escape the curly braces (\{\}) as well as the parentheses (\(\)) and also append the original string (&) in the replacement:
text="ABC09638006924340017;SOME_TEXT;SOME_OTHER_TEXT"
echo $text | sed "s/^.\{8\}\(.\{8\}\).*/\1;&/"
Output:
00692434;ABC09638006924340017;SOME_TEXT;SOME_OTHER_TEXT
Since you want to extract a fixed-length substring at a fixed position, you could also do this with just bash-builtins:
text="ABC09638006924340017;SOME_TEXT;SOME_OTHER_TEXT"
echo "${text:8:8};$text"
This migth work for you (GNU sed):
sed -r 's/^.{8}(.{8})/\1;&/' file

Using sed to replace the first instance of an entire line beginning with string

I am attempting to write a bash script that will use sed to replace an entire line in a text file beginning with a given string, and I only want it to perform this replacement for the first match.
For example, in my text file I may have:
hair=brown
age=25
eyes=blue
age=35
weight=177
And I may want to simply replace the first occurrence of a line beginning with "age" with a different number without affecting the 2nd instance of age:
hair=brown
age=55
eyes=blue
age=35
weight=177
So far, I've come up with
sed -i "0,/^PATTERN/s/^PATTERN/PATTERN=XY/" test.txt
but this will only replace the string "age" itself rather than the entire line. I've been trying to throw a "\c" in there somewhere to change the entire line but nothing is working so far. Does anyone have any ideas as to how this can be resolved? Thanks.
Like #ruakh suggests, you can use
sed -i "0,/^PATTERN/ s/^PATTERN=.*$/PATTERN=XY/" test.txt
A shorter and less repetitive way of doing the same would be
sed -i '0,/^\(PATTERN=\).*/s//\1XY/' test.txt
which takes advantage of backreferences and the fact that not specifying a pattern in an s-expression will use the previously matched pattern.
0,...-ranges only work in GNU sed. An alternative might be to use shell redirect with sed:
{ sed '/^\(PATTERN\).*/!n; s//\1VAL;q'; cat ;} < file
or use awk:
awk '$1=="LABEL" && !n++ {$2="VALUE"}1' FS=\\= OFS=\\= file

Resources