My problem: How to find lines with unmatched left angle brackets and replace these brackets with their HTML equivalents.
Example input:
<dd>
Pro 10g Flüssigkeit: 2g Wasserstoffperoxid <10% Tenside. ENTHÄLT: Sulfamidsäure,</dd>
Expected output by substituting the unmatched '<10%' string:
<dd>
Pro 10g Flüssigkeit: 2g Wasserstoffperoxid <10% Tenside. ENTHÄLT: Sulfamidsäure,</dd>
There are German 'Umlaute' included in my example text just in case they could 'mess something up'...
I would like to use sed or awk if possible.
I have read:
Use sed with regex and (, How to decrement (substract) number in file with sed and
sed - regex square brackets detection in Linux and other Q&A but I can't seem to get my head around regexes. Sorry!
Thanks a lot for your help!
This is a dangerous proposal, because sed works on a line-by-line basis, and for each line, there are several cases to consider:
There could be only the less-than character without any html tags:
<p>
x < 10
</p>
There could be, as in your example, a html tag after the less-than character
<p> x < 10 </p>
The less-than character could be inside a html tag.
<img src="..." alt="Graph for x < 10">
It could be a really long html tag which is closed in a later line.
<img
src="..."
alt="..."
>
What I'd do is to at first assume only the first two options are present, then use something like this:
sed -i.orig -r 's/<([^>]*($|<))/\<\1/g' file.
This will keep a backup of the original file with the new extension .orig, so that you can then run a diff program over both to see what has changed.
As for how this works:
s/AAA/BBB/g replaces any occurrence of AAA with BBB
s/A(CC)/B\1/g replaces ACC with BCC, that is the part in the parenthesis is inserted for the \1
[^>]* means zero or more of any characters other than >
($|<) is either the end of line or <, whichever comes first.
So it searches for a < without a > until either the next < or the end of the line, and replaces that part with < and everything that it found after the initial <
This might be good enough:
$ sed -E 's/<([^>]+<)/\<\1/g' file
<dd>
Pro 10g Flüssigkeit: 2g Wasserstoffperoxid <10% Tenside. ENTHÄLT: Sulfamidsäure,</dd>
If not then edit your question provide a more complete (but still concise and testable) example that truly represents your real input.
There's nothing special about an umlaute or any other input character btw.
Related
Situation: Using a shell script (bash/ksh), there is a message that should be shown in the console log, and subsequently sent via email.
Problem: There are newline characters in the message.
Example below:
ErrMsg="File names must be unique. Please correct and rerun.
Duplicate names are listed below:
File 1.txt
File 1.txt
File 2.txt
File 2.txt
File 2.txt"
echo "${ErrMsg}"
# OK. After showing the message in the console log, send an email
Question: How can these newline characters be translated into HTML line breaks for the email?
Constraint: We must use HTML email. Downstream processes (such as Microsoft Outlook) are too inconsistent for anything else to be of use. Simple text email is usually a good choice, but off the table for this situation.
To be clear, the newlines do not need to be completely removed, but HTML line breaks must be inserted wherever there is a newline character.
This question is being asked because I have already attempted to use several commands, such as sed, tr, and awk with varying degrees of success.
TL;DR: The following snippet will do the job:
ErrMsg=`echo "$ErrMsg"|awk 1 ORS='<br/>'`
Just make sure there are double quotes around the variable when using echo.
This turned out to be a tricky situation. Some notes of explanation are below.
Using sed
Turns out, sed reads through input line by line, which makes finding and replacing those newlines somewhat outside the norm. There were several clever tricks that appeared to work, but I felt they were far too complicated to apply appropriately to this rather simple situation.
Using tr
According to this answer the tr command should work. Unfortunately, this only translates character by character. The two character strings are not the same length, and I am limited to translating the newline into a space or other single character.
For the following:
ErrMsg="Line 1
Line 2
"
ErrMsg=`echo $ErrMsg| tr '\n' 'BREAK'`
# You might expect:
# "Line 1BREAKLine 2BREAK"
# But instead you get:
# "Line 1BLine 2B"
echo "${ErrMsg}"
Using awk
Using awk according to this answer initially appeared to work, but due to some other circumstances with echo there was a subtle problem. The solution is noted in this forum.
You must have double-quotes around your variable, or echo will strip out all newlines.(Of course, awk will receive the characters with a newline at the end, because that's what echo does after it echos stuff.)
This snippet is good: (line breaks in the middle are preserved and replaced correctly)
ErrMsg=`echo "$ErrMsg"|awk 1 ORS='<br/>'`
This snipped is bad: (newlines converted to spaces by echo, one line break at end)
ErrMsg=`echo $ErrMsg|awk 1 ORS='<br/>'`
You can wrap your message in HTML using <pre>, something like
<pre>
${ErrMsg}
and more.
</pre>
I have a simple sed script and I am replacing a bunch of lines in my application dynamically with a variable, the variable is a list of strings.My function works but does not keep the original indentation.the function deletes the line if it contains the certain string and replaces the line with a completely new line, I could not do a replace due to certain syntax restrictions.
How do I keep my original indentation when the line is replaced
Can I capitalize my variable and remove the underscore on the fly, i.e. the title is a capitalize and underscore removed version of the variableName, the list of items in the variable array is really long so I am trying to do this in one shot.
Ex: I want report_type -> Report Type done mid process
Is there a better way to solve this with sed? Thanks for any inputs much appreciated.
sed function is as follows
variableName=$1
sed -i "/name\=\"${variableName}\.name\" value\=model\.${variableName}\.name options\=\#lists\./c\\{\{\> \_dropdown title\=\"${variableName}\" required\=true name\=\"${variableName}\"\}\}" test
SAMPLE INPUT
{{> _select title="Report Type" required=true name="report_type.name" value=model.report_type.name options=#lists.report_type}}
SAMPLE EXPECTED OUPUT
{{> _dropdown title="Report Type" required=true name="report_type" value=model.report_type.name}}
sample input variable
report_type
Try this:
sed -E "s/^(\s+).*name\=\"(report_type)\.name\" value\=model\.report_type\.name options\=\#lists\..*$/\1\{\{\> \_dropdown title\=\"\2\" required\=true name\=\"\2\"\}\}/;T;s/\"(\w+)_(\w+)\"/\"\u\1 \u\2\"/g" input.txt > output.txt
I used "report_type" instead of ${variableName} for testing as an sed one-liner.
Please change back to ${variableName}.
Then go back to using -i (in addition to -E, which is for extended regex).
I am not sure whether I can do it without extended regex, let me know if that is necessary.
use s/// to replace fine tuned line
first capture group for the white space making the indentation
second capture group for the variable name
stop if that did not replace anything, T;
another s///
look for something consisting of only letters between "",
with a "_" between two parts,
seems safe enough because this step is only done on the already replaced line
replace by two parts, without "_"
\u for making camel case
Note:
Doing this on your sample input creates two very similar lines.
I assume that is intentional. Otherwise please provide desired output.
Using GNU sed version 4.2.1.
Interesting line of output:
{{> _dropdown title="Report Type" required=true name="Report Type"}}
I have a bit of html that looks like this:
`<p>Flannel</p><p>Plaid</p><p>Red</p>`
I want to strip the <p> and </p> tags and replace with a newline character so I end up with something like this:
Flannel
Plaid
Red
I am attempting to use this tr command:
tr '<[^>]*>' '\n'
but it is only removing the outer < and >, so I end up with this instead:
p
Flannel
/p
p
Plaid
/p
p
Red
/p
How can I modify it to remove the entire tag?
Note: I don't care if I end up with multiple newlines between the entires, those are easy to strip away later if necessary.
Unless this is a quick-and-dirty script, you should definitely use an HTML parser to handle all the intricacies of the HTML language.
A quick-and-dirty solution could be to apply this sed command :
sed 's/<[^>]*>/\n/g'
I think it does what you need with your specific example :
$ echo "<p>Flannel</p><p>Plaid</p><p>Red</p>" | sed 's/<[^>]*>/\n/g'
Flannel
Plaid
Red
Your solution doesn't work because tr doesn't work on strings but on characters : it's simply replacing every <[^>]* characters it finds, disregarding the fact you attempted to write a regular expression.
Try this -
echo "<p>Flannel</p><p>Plaid</p><p>Red</p>"|awk '{gsub(/<[^>]*>/,"\n"); print }'
Flannel
Plaid
Red
Put them all inside the same
<p>
tag, then use tags between each one to add a line break.
So, the code should be something like this:
<p>Flannel<br>Plaid<br>Red</p>
I have this variable in slim:
- foo = 'my \n desired multiline <br/> string'
#{foo}
When I parse the output using the slimrb command line command the contents of the variable are encoded:
my \n desired multiline <br/> string
How can I have slimrb output the raw contents in order to generate multi-line strings?
Note that neither .html_safe nor .raw are available.
There are two issues here. First in Ruby strings using single quotes – ' – don’t convert \n to newlines, they remain as literal \ and n. You need to use double quotes. This applies to Slim too.
Second, Slim HTML escapes the result of interpolation by default. To avoid this use double braces around the code. Slim also HTML escapes Ruby output by default (using =). To avoid escaping in that case use double equals (==).
Combining these two, your code will look something like:
- foo = "my \n desired multiline <br/> string"
td #{{foo}}
This produces:
<td>my
desired multiline <br/> string</td>
An easier way is to use Line Indicators as verbatim texts | . Documentation here . For example;
p
| This line is on the left margin.
This line will have one space in front of it.
I've got some source code like the following where I call a function in C:
void myFunction (
&((int) table[1, 0]),
&((int) table[2, 0]),
&((int) table[3, 0])
);
...the only problem is that the function has >300 parameters (it's an auto-generated wrapper for initialising and calling a whole module; it was given to me and I cannot change it). And as you can see: I began accessing the array with a 1 instead of a 0... Great times, modifying all the 300 parameters, i.e. decrasing 300 x the x-coordinate of the array, by hand.
The solution I am looking for is how I could force sed to to do the work for me ;)
EDIT: Please note that the syntax above for accessing a two-dimensional array in C is wrong anyway! Of course it should be [1][0]... (so don't just copy-and-paste ;))
Basically, the command I came up with, was the following:
sed -r 's/(.*)(table\[)([0-9]+)(,)(.*)/echo "\1\2$((\3-1))\4\5"/ge' inputfile.c > outputfile.c
Well, this does not look very intuitive on the first sight - and I was missing good explanations for nearly every example I found.
So I will try to give a detailed explanation on this:
sed
--> basic command
-r
--> most examples you find are using -e; however, the -r parameter (only works with GNU sed) enables extended regular expressions and brings support for the + in a regex. It basically means "one or more matches".
's/input/output/ge'
--> this is the basic replacement syntax. It basically means "replace 'input' by 'output'". The /g is a "global" flag, i.e. sed will replace all occurences and not only the first one. You can add an additional e to execute the result in the bash. This is what we want to do here to handle the calculation.
(.*)
--> this matches "everthing" from the last match to the next match
(table\[)
--> the \ is to escape the bracket. This part of the expression will match Strings like table[
([0-9]+)
--> this one matches numbers with at least one digit, however, it can also match higher numbers with more than only one digit.
(,)
--> this simply matches the comma ,
(.*)
--> and again: the rest of the line
And now the interesting part:
echo "\1\2$((\3-1))\4\5"
the echo is a bash command
the \n (you can use every value from \1 up to \9) is some kind of "variable" for the inputs: \1 will contain the first match, \2 the seconds match, ... --> this helps you to preserve parts of the input string
the $((1+1)) is a simple bash syntax to calculate the value of the term inside the double brackets (in the complete sed command above, the \3 will of course be automatically replaced by the 3rd match, i.e. the 1st part inside the brackets to access the table's cells)
please note that we use quotation marks around the echo content to also be able to process lines with characters like & which would otherwise not work
The already mentioned e of \ge at the end will trigger the execution of the result in the bash. E.g. the first two lines of the example source code in the question would produce the following bash statements:
echo "void myFunction ("
echo " &((int) table[$((1-1)), 0]),"
which is being executed and results in the following output:
void myFunction (
&((int) table[0, 0]),
...which is exatcly what I wanted :)
BTW:
text > output.c
is simple bash syntax to output text (or in this case the sed-processed source code) to a file called output.c.
Good links about this topic are:
sed basics
regular expressions basics
Ahh and one more thing: You can also use sed in the git-Bash on Windows - if you are "forced" to use Windows at work like me ;)
PS: In the meantime I could have easily done this by hand but using sed was a lot more fun ;)
Here's another way you could do it, using Perl:
perl -pe 's/(table\[)(\d+)(,)/$1.($2-1).$3/e' file.c
This uses the e modifier to execute an expression in the replacement. The capture groups are concatenated together but the middle group has 1 subtracted from its value.
This will output to standard output so you can check that it does what you want. When you're happy, you can add the -i switch to overwrite the original file.