how to replace character in html attribute value (shell / bash)? - bash

Sorry for the stupid question, but I have been stuck all afternoon with this simple problem. So I have a sample text file containing:
<product productId="123456" description="good apple, very green" publicPriceTTC="5,07" brand-id="152" />
<product productId="123457" description="fresh orange, very juicy" publicPriceTTC="12,47" brand-id="153" />
<product productId="123458" description="big banana, very yellow" publicPriceTTC="5,07" brand-id="154" />
And I'd like to modify this file into:
<product productId="123456" description="good apple, very green" publicPriceTTC="5.07" brand-id="152" />
<product productId="123457" description="fresh orange, very juicy" publicPriceTTC="12.47" brand-id="153" />
<product productId="123458" description="big banana, very yellow" publicPriceTTC="5.07" brand-id="154" />
Basically, I need to replace the "," (comma) by a "." (point) in all values of "publicPriceTTC". The trick here is that other attributes might have commas in their values ("description" in this example). I guess sed or awk can do that but I was unable to achieve it.
Can someone help me? Thank you very much for any help.

If you search for a comma to replace with a point, you will be doing a very coarse search/replace. Try something more especific. With sed, assume your input file is called xml:
sed -E 's/(publicPriceTTC="[0-9]+),([0-9]+")/\1.\2/' xml
You probably know that sed has the command s/<what you search>/<replacement>. We use that.
The -E option triggers the use of extended regular expressions. With that the s expression matches the whole tag + "=" + number within quotes, and uses the parenthesis to use the bit within them to be part of the substitution. \1 stands for the first bit between parenthesis block; \2 for the second.
You could of course make the search more robust to cope with whitespace between the tag and the equal sign and so on.

An awk solution to this might be:
awk '/<product/{for(i=1;i<=NF;i++){if($i~/^publicPriceTTC="/)sub(/,/,".",$i)}}1' file.xml
This steps through every whitespace-separated "field" on every <product>, looking for "words" that begin with the attribute you're trying to modify. If found, the entire attribute has its commas replaced with periods.
A simpler awk solution to emulate what others are doing with sed would be nice, except that awk does not support parenthesized subexpressions (i.e. \1 in your replacement string). Gawk supports them in the gensub() function, so the following might suffice:
gawk '{print gensub(/(publicPriceTTC="[0-9]+),/,"\\1.","g")}' file.xml
But ... you are solving the wrong problem here. Tools like sed and awk, which process files based on regular expressions, are not XML parsers. Either Javier's sed solution or my awk solutions could garble things accidentally, or miss certain things that are in perfectly valid XML files. Regex cannot be used to parse XML safely.
I recommend that you look into using python or perl or ruby or php or some other language with native XML support.
For example, turning your input into actual XML like this:
<p>
<product productId="123456" description="good apple, very green" publicPriceTTC="5,07" brand-id="152" />
<product productId="123457" description="fresh orange, very juicy" publicPriceTTC="12,47" brand-id="153" />
<product productId="123458" description="big banana, very yellow" publicPriceTTC="5,07" brand-id="154" />
</p>
We could run a PHP one-liner:
php -r '$x=new SimpleXMLElement(file_get_contents("file.xml")); foreach($x->product as $p) { $p["publicPriceTTC"]=str_replace(",",".",$p["publicPriceTTC"]); } print $x->asXML();'
Or split out for easier reading (and commenting):
<?php
// Read an XML file into an object
$x=new SimpleXMLElement(file_get_contents("file.xml"));
// Step through the object, fixing attributes as we find them
foreach($x->product as $p) {
$p["publicPriceTTC"] = str_replace(",",".",$p["publicPriceTTC"]);
}
// Print the result
print $x->asXML();

This will work on GNU
sed 's/\(publicPriceTTC="[0-9]*\),/\1./' fileName

Here using sub in awk is enough.
awk '{sub(/,/,".",$7)}1' file

Related

sed to remove section of text from a variable

So I think I've cracked the regex but just can't crack how to get sed to make the changes. I have a variable which is this:
MAKEVAR = EPICS_BASE=$CI_PROJECT_DIR/3.16/base IPAC=$CI_PROJECT_DIR/3.16/support/ipac SNCSEQ=$CI_PROJECT_DIR/3.16/support/seq
(All one line). But I want to delete the particular section defining IPAC so my regex looks like this:
(IPAC.+\s)
I know from using this tool that that should be correct:
https://www.regextester.com/98103
However when I run different iterations of trying out sed like:
sed 's/(IPAC.+\s)/\&/g' <<< "$MAKEVAR"
And then echo out MAKEVAR, the IPAC section still exists.
How can I update a particular section of text in a shell variable to remove a section beginning with IPAC up until the next space?
Thanks in advance
regextester (or any other online tool) is a great way to verify that a regexp works in that online tool. Unfortunately that doesn't mean it'll work in any given command-line tool. In particular your regexp includes \s which is specific to PCREs and some GNU tools, and uses (...) to delineate capture groups but that's only used in EREs and PCREs, not BREs such as sed supports by default where you'd have to use \(...\), and your replacement text is using '&' which is telling sed you want to replace the string that matches the regexp with a literal \& when in fact you just want to remove it.
This is how to do what I think you're trying to do using any sed:
$ sed 's/IPAC[^ ]* //' <<< "$MAKEVAR"
EPICS_BASE=$CI_PROJECT_DIR/3.16/base SNCSEQ=$CI_PROJECT_DIR/3.16/support/seq
Nevermind, found a workaround:
MAKEVAR=$(sed -E 's/(IPAC.+ipac)//' <<<"$MAKEVAR")
Use a shorter
MAKEVAR=$(sed 's/IPAC.*ipac//' <<< "$MAKEVAR")
IPAC.*ipac matches all the way from first IPAC to last ipac. The matched text is removed from the text.

Extract a substring (value of an HTML node tag) in a bash/zsh script

I'm trying to extract a tag value of an HTML node that I already have in a variable.
I'm currently using Zsh but I'm trying to make it work in Bash as well.
The current variable has the value:
<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>
and I would like to get the value of data-count (in this case 0, but could be any length integer).
I have tried using cut, sed and the variables expansion as explained in this question but I haven't managed to adapt the regexs, or maybe it has to be done differently for Zsh.
There is no reason why sed would not work in this situation. For your specific case, I would do something like this:
sed 's/.*data-count="\([0-9]*\)".*/\1/g' file_name.txt
Basically, it just states that sed is looking for the a pattern that contains data-count=, then saves everything within the paranthesis \(...\) into \1, which is subsequently printed in place of the match (full line due to the .*)
Could you please try following.
awk 'match($0,/data-count=[^ ]*/){print substr($0,RSTART+12,RLENGTH-13)}' Input_file
Explanation: Using match function of awk to match regex data-count=[^ ]* means match everything from data-count till a space comes, if this regex is TRUE(a match is found) then out of the box variables RSTART and RLENGTH will be set. Later I am printing current line's sub-string as per these variables values to get only value of data-count.
With sed could you please try following.
sed 's/.*data-count=\"\([^"]*\).*/\1/' Input_file
Explanation: Using sed's capability of group referencing and saving regex value in first group after data-count=\" which is its length, then since using s(substitution) with sed so mentioning 1 will replace all with \1(which is matched regex value in temporary memory, group referencing).
As was said before, to be on the safe side and handle any syntactically valid HTML tag, a parser would be strongly advised. But if you know in advance, what the general format of your HTML element will look like, the following hack might come handy:
Assume that your variable is called "html"
html='<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>'
First adapt it a bit:
htmlx="tag ${html%??}"
This will add the string tag in front and remove the final />
Now make an associative array:
declare -A fields
fields=( ${=$(tr = ' ' <<<$htmlx)} )
The tr turns the equal sign into a space and the ${= handles word splitting. You can now access the values of your attributes by, say,
echo $fields[data-count]
Note that this still has the surrounding double quotes. Yuo can easily remove them by
echo ${${fields[data-count]%?}#?}
Of course, once you do this hack, you have access to all attributes in the same way.

Sed keep original indentation and camel-casing a variable

I have a simple sed script and I am replacing a bunch of lines in my application dynamically with a variable, the variable is a list of strings.My function works but does not keep the original indentation.the function deletes the line if it contains the certain string and replaces the line with a completely new line, I could not do a replace due to certain syntax restrictions.
How do I keep my original indentation when the line is replaced
Can I capitalize my variable and remove the underscore on the fly, i.e. the title is a capitalize and underscore removed version of the variableName, the list of items in the variable array is really long so I am trying to do this in one shot.
Ex: I want report_type -> Report Type done mid process
Is there a better way to solve this with sed? Thanks for any inputs much appreciated.
sed function is as follows
variableName=$1
sed -i "/name\=\"${variableName}\.name\" value\=model\.${variableName}\.name options\=\#lists\./c\\{\{\> \_dropdown title\=\"${variableName}\" required\=true name\=\"${variableName}\"\}\}" test
SAMPLE INPUT
{{> _select title="Report Type" required=true name="report_type.name" value=model.report_type.name options=#lists.report_type}}
SAMPLE EXPECTED OUPUT
{{> _dropdown title="Report Type" required=true name="report_type" value=model.report_type.name}}
sample input variable
report_type
Try this:
sed -E "s/^(\s+).*name\=\"(report_type)\.name\" value\=model\.report_type\.name options\=\#lists\..*$/\1\{\{\> \_dropdown title\=\"\2\" required\=true name\=\"\2\"\}\}/;T;s/\"(\w+)_(\w+)\"/\"\u\1 \u\2\"/g" input.txt > output.txt
I used "report_type" instead of ${variableName} for testing as an sed one-liner.
Please change back to ${variableName}.
Then go back to using -i (in addition to -E, which is for extended regex).
I am not sure whether I can do it without extended regex, let me know if that is necessary.
use s/// to replace fine tuned line
first capture group for the white space making the indentation
second capture group for the variable name
stop if that did not replace anything, T;
another s///
look for something consisting of only letters between "",
with a "_" between two parts,
seems safe enough because this step is only done on the already replaced line
replace by two parts, without "_"
\u for making camel case
Note:
Doing this on your sample input creates two very similar lines.
I assume that is intentional. Otherwise please provide desired output.
Using GNU sed version 4.2.1.
Interesting line of output:
{{> _dropdown title="Report Type" required=true name="Report Type"}}

Find and replace in file with script

I want to find and replace the VALUE into a xml file :
<test name="NAME" value="VALUE"/>
I have to filter by name (because there are lot of lines like that).
Is it possible ?
Thanks for you help.
Since you tagged the question "bash", I assume that you're not trying to use an XML library (although I think an XML expert might be able to give you something like an XSLT processor command that solves this question very robustly), but that you're simply interested in doing search & replace from the commandline.
I am using perl for this:
perl -pi -e 's#VALUE#replacement#g' *.xml
See perlrun man page: Very shortly put, the -p switches perl into text processing mode, -i stands for "in-place", and -e let's you specify an expression to apply to all lines of input.
Also note (if you are not too familiar with that already) that you may use other characters than # (common ones are %, a comma, etc.) that don't clash with your search & replacement strings.
There is one small caveat: perl will read & write all files given on the commandline, even those that did not change. Thus, the files' modification times will be updated even if they did not change. (I usually work around that with some more shell magic, e.g. using grep -l or grin -l to select files for perl to work on.)
EDIT: If I understand your comments correctly, you also need help with the regular expression to apply. Let me briefly suggest something like this then:
perl -pi -e 's,(name="NAME" value=)"[^"]*",\1"NEWVALUE",g' *.xml
Related: bash XHTML parsing using xpath
You can use SED:
SED 's/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/' test.xml
where test.xml is the xml document containing the given node. This is very fragile, and you can work to make it more flexible if you need to do this substitution multiple times. For instance, the current statement is case sensitive, so it won't substitute the value on a node with the name="name", but you can add a case insensitivity flag to the end of the statement, like so:
('s/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/I').
Another option would be to use XSLT, but it would require you to download an external library. It's pretty versatile, and could be a viable option for more complex modifications to an XML document.

in bash, bash remove punctuation between pattern matches?

I am struggling with a conversion of a data file to csv when there is punctuation in the title field.
I have a bash script that obtains the file and processes it, and it almost works. What gets me is when there are commas in a free text title field, which then create extra fields.
I have tried some sed examples to replace between patterns but I have not gotten any of them to work. What I want to do is work between two patterns and replace commas with either nothing or perhaps a semicolon.
Taking this string:
name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,
Replacing with this:
name:A100040,title:Oatmeal is better with raisins dates and sugar,current_balance:50000,
I should probably use "title:" and ",current_" to denote the start and end of the block where I want to make the change to avoid situations like this:
name:A100040,title:Re-title current periodicals, recent books,current_balance:50000,
So far I have not gotten the substitution to match. In this case I am using !! to make the change obvious:
teststring="name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,"
echo $teststring |sed '/title:/,/current_/s/,/!!/g'
name:A100040!!title:Oatmeal is better with raisins!! dates!! and sugar!!current_balance:50000!!
Any help appreciated.
This is one way which could undoubtedly be refined:
perl -ple 'm/(.*?)(title:.*?)(current_balance:.*)/; $save = $part = $2; $part =~ s/,/!!/g; s/$save/$part/'
First, using sed or awk to parse CSV is almost always the wrong thing to do, because they do not allow field delimiters to be quoted. That said, it seems like a better approach would be to quote the fields so that your output would be:
name:"A100040",title:"Oatmeal ... , dates, and sugar",current_balance:50000
Using sed you can try: (this is fragile)
sed 's/:\([^:]*\),\([^,:]*\)/:"\1",\2/g'
If you insist on trying to parse the csv with "standard" tools and you consider perl to be standard, you could try:
perl -pe '1 while s/,([^,:]*),/ $1,/g'

Resources