How to add ","after every sed match? - bash

I have this code:
cat response_error.xml | sed -ne 's#\s*<[^>]*>\s*##gp' >> response_error.csv
but all sed match from xml are bonded, for exemple:
084521AntonioCallas
I want to get this effect
084521,Antonio,Callas,
is it possible?
I must write a script which collect XML documents from previous day, extract from them only data without <...> and save this information to csv file in this way: 084521,Antonio,Callas - information separated by commas. The XML look like this:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<GenerarInformeResponse xmlns="http://experian.servicios.CAIS">
<GenerarInformeResult>
<InformeResumen xmlns="http://experian.servicios.CAIS.V2">
<IdSuscriptor>084521</IdSuscriptor>
<ReferenciaConsulta>Antonio Callas 00000000</ReferenciaConsulta>
<Error>
<Codigo>0000</Codigo>
<Descripcion>OK</Descripcion>
</Error>
<Documento>
<TipoDocumento>
<Codigo>01</Codigo>
<Descripcion>NIF</Descripcion>
</TipoDocumento>
<NumeroDocumento>000000000</NumeroDocumento>
<PaisDocumento>
<Codigo>000</Codigo>
<Descripcion>ESPAÑA</Descripcion>
</PaisDocumento>
</Documento>
<Resumen>
<Nombre>
<Nombre1>XXX</Nombre1>
<Nombre2>XXX</Nombre2>
<ApellidosRazonSocial>XXX</ApellidosRazonSocial>
</Nombre>
<Direccion>
<Direccion>XXX</Direccion>
<NombreLocalidad>XXX</NombreLocalidad>
<CodigoLocalidad/>
<Provincia>
<Codigo>39</Codigo>
<Descripcion>XXX</Descripcion>
</Provincia>
<CodigoPostal>39012</CodigoPostal>
</Direccion>
<NumeroTotalOperacionesImpagadas>1</NumeroTotalOperacionesImpagadas>
<NumeroTotalCuotasImpagadas>0</NumeroTotalCuotasImpagadas>
<PeorSituacionPago>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPago>
<PeorSituacionPagoHistorica>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPagoHistorica>
<ImporteTotalImpagado>88.92</ImporteTotalImpagado>
<MaximoImporteImpagado>88.92</MaximoImporteImpagado>
<FechaMaximoImporteImpagado>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaMaximoImporteImpagado>
<FechaPeorSituaiconPagoHistorica>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaPeorSituaiconPagoHistorica>
<FechaAltaOperacionMasAntigua>
<DD>16</DD>
<MM>12</MM>
<AAAA>2015</AAAA>
</FechaAltaOperacionMasAntigua>
<FechaUltimaActualizacion>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaUltimaActualizacion>
</Resumen>
</InformeResumen>
</GenerarInformeResult>
</GenerarInformeResponse>
</s:Body>
</s:Envelope>

You can extract the IdSuscriptor using the following command :
xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml
And the ReferenciaConsulta using the following command :
xmllint --xpath '//*[local-name()="ReferenciaConsulta"]/text()' response_error.xml
To produce the desired IdSubscriptor,FirstName,LastName I would use the following script :
id_suscriptor=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
referencia_consulta=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
first_name=$(echo "$referencia_consulta" | cut -f1)
last_name=$(echo "$referencia_consulta" | cut -f2)
echo "$id_suscriptor,$first_name,$last_name"
Note that this assumes the ReferenciaConsulta field will always contain a string starting with the first name and last name separated with a space.

If you want to parse XML, use a dedicated XML parser like Saxon.
If you want to parse a strange text file with some funny unrelated angle brackets, try this:
#! /bin/sed -nf
s/^<IdSuscriptor>\([0-9]\+\)<\/IdSuscriptor>/\1,/
t match1
b next
: match1
h
b
: next
s/^<ReferenciaConsulta>\([^ ]\+\) \([^ ]\+\) [0-9]\+<\/ReferenciaConsulta>/\1,\2,/
t match2
b
: match2
H
g
s/\n//
p
Explanation
t jumps to match1, if the preceeding s command did a replacement. Otherwise b jumps to next.
In case of a match h copies the matching string into the hold space and b stops the processing of the current line.
The second s command works the same way with the difference, that in case of no match b continues with the next line.
In case of the second match H appends the pattern space to the hold space, g copies the hold space to the pattern space, s removes the newline between the two matches and p prints the result.
Conclusion
If you do not know how to do it with sed don't try it. Try to learn a real programming language like Perl or JavaScript or Python. sed is a relic of bygone times.

if your data in 'd' file, try gnu sed:
sed -Ez 's/<[^>]*>//g;s/\n+|\s+/,/g;' d

Related

Unix sed command - global replacement is not working

I have scenario where we want to replace multiple double quotes to single quotes between the data, but as the input data is separated with "comma" delimiter and all column data is enclosed with double quotes "" got an issue and the same explained below:
The sample data looks like this:
"int","","123","abd"""sf123","top"
So, the output would be:
"int","","123","abd"sf123","top"
tried below approach to get the resolution, but only first occurrence is working, not sure what is the issue??
sed -ie 's/,"",/,"NULL",/g;s/""/"/g;s/,"NULL",/,"",/g' inputfile.txt
replacing all ---> from ,"", to ,"NULL",
replacing all multiple occurrences of ---> from """ or "" or """" to " (single occurrence)
replacing 1 step changes back to original ---> from ,"NULL", to ,"",
But, only first occurrence is getting changed and remaining looks same as below:
If input is :
"int","","","123","abd"""sf123","top"
the output is coming as:
"int","","NULL","123","abd"sf123","top"
But, the output should be:
"int","","","123","abd"sf123","top"
You may try this perl with a lookahead:
perl -pe 's/("")+(?=")//g' file
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
Where input is:
cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
Breakup:
("")+: Match 1+ pairs of double quotes
(?="): If those pairs are followed by a single "
Using sed
$ sed -E 's/(,"",)?"+(",)?/\1"\2/g' input_file
"int","","123","abd"sf123","top"
"int","","NULL","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
In awk with your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of awk.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<=NF;i++){
if($i!~/^""$/){
gsub(/"+/,"\"",$i)
}
}
}
1
' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as , for all the lines of Input_file. Then traversing through each field of line, if a field is NOT NULL then Globally replacing all 1 or more occurrences of " with single occurrence of ". Then printing the line.
With sed you could repeat 1 or more times sets of "" using a group followed by matching a single "
Then in the replacement use a single "
sed -E 's/("")+"/"/g' file
For this content
$ cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
The output is
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
sed s'#"""#"#' file
That works. I will demonstrate another method though, which you may also find useful in other situations.
#!/bin/sh -x
cat > ed1 <<EOF
3s/"""/"/
wq
EOF
cp file stack
cat stack | tr ',' '\n' > f2
ed -s f2 < ed1
cat f2 | tr '\n' ',' > stack
rm -v ./f2
rm -v ./ed1
The point of this is that if you have a big csv record all on one line, and you want to edit a specific field, then if you know the field number, you can convert all the commas to carriage returns, and use the field number as a line number to either substitute, append after it, or insert before it with Ed; and then re-convert back to csv.

sed/awk between two patterns in a file: pattern 1 set by a variable from lines of a second file; pattern 2 designated by a specified charcacter

I have two files. One file contains a pattern that I want to match in a second file. I want to use that pattern to print between that pattern (included) up to a specified character (not included) and then concatenate into a single output file.
For instance,
File_1:
a
c
d
and File_2:
>a
MEEL
>b
MLPK
>c
MEHL
>d
MLWL
>e
MTNH
I have been using variations of this loop:
while read $id;
do
sed -n "/>$id/,/>/{//!p;}" File_2;
done < File_1
hoping to obtain something like the following output:
>a
MEEL
>c
MEHL
>d
MLWL
But have had no such luck. I have played around with grep/fgrep awk and sed and between the three cannot seem to get the right (or any output). Would someone kindly point me in the right direction?
Try:
$ awk -F'>' 'FNR==NR{a[$1]; next} NF==2{f=$2 in a} f' file1 file2
>a
MEEL
>c
MEHL
>d
MLWL
How it works
-F'>'
This sets the field separator to >.
FNR==NR{a[$1]; next}
While reading in the first file, this creates a key in array a for every line in file file.
NF==2{f=$2 in a}
For every line in file 2 that has two fields, this sets variable f to true if the second field is a key in a or false if it is not.
f
If f is true, print the line.
A plain (GNU) sed solution. Files are read only once. It is assumed that characters in File_1 needn't to be quoted in sed expression.
pat=$(sed ':a; $!{N;ba;}; y/\n/|/' File_1)
sed -E -n ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}" File_2
Explanation:
The first call to sed generates a regular expression to be used in the second call to sed and stores it in the variable pat. The aim is to avoid reading repeatedly the entire File_2 for each line of File_1. It just "slurps" the File_1 and replaces new-line characters with | characters. So the sample File_1 becomes a string with the value a|c|d. The regular expression a|c|d matches if at least one of the alternatives (a, b, c for this example) matches (this is a GNU sed extension).
The second sed expression, ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}", could be converted to pseudo code like this:
begin:
read next line (from File_2) or quit on end-of-file
label_a:
if line begins with `>` followed by one of the alternatives in `pat` then
label_b:
print the line
read next line (from File_2) or quit on end-of-file
if line begins with `>` goto label_a else goto label_b
else goto begin
Let me try to explain why your approach does not work well:
You need to say while read id instead of while read $id.
The sed command />$id/,/>/{//!p;} will exclude the lines which start
with >.
Then you might want to say something like:
while read id; do
sed -n "/^>$id/{N;p}" File_2
done < File_1
Output:
>a
MEEL
>c
MEHL
>d
MLWL
But the code above is inefficient because it reads File_2 as many times as the count of the id's in File_1.
Please try the elegant solution by John1024 instead.
If ed is available, and since the shell is involve.
#!/usr/bin/env bash
mapfile -t to_match < file1.txt
ed -s file2.txt <<-EOF
g/\(^>[${to_match[*]}]\)/;/^>/-1p
q
EOF
It will only run ed once and not every line that has the pattern, that matches from file1. Like say if you have a to z from file1,ed will not run 26 times.
Requires bash4+ because of mapfile.
How it works
mapfile -t to_match < file1.txt
Saves the entry/value from file1 in an array named to_match
ed -s file2.txt point ed to file2 with the -s flag which means don't print info about the file, same info you get with wc file
<<-EOF A here document, shell syntax.
g/\(^>[${to_match[*]}]\)/;/^>/-1p
g means search the whole file aka global.
( ) capture group, it needs escaping because ed only supports BRE, basic regular expression.
^> If line starts with a > the ^ is an anchor which means the start.
[ ] is a bracket expression match whatever is inside of it, in this case the value of the array "${to_match[*]}"
; Include the next address/pattern
/^>/ Match a leading >
-1 go back one line after the pattern match.
p print whatever was matched by the pattern.
q quit ed

How to add a character to the end of a line, when a find and replace is done to the beginning?

I am creating a simple script that converts a custom markup to TeX macros:
? What are four kinds of animals?
- elephants
- tigers
- bears
- fish
e
This becomes:
\QUESTION{What are four kinds of animals?}{
\ANSWER{elephants}
\ANSWER{tigers}
\ANSWER{bears}
\ANSWER{fish}
}
I have used a simple syntax to replace the items at the front:
sed 's#^? #\\QUESTION{#' file > temp1
sed 's#^\- #\\ANSWER{#' temp1 > temp2
sed 's#^e #\}{#' temp2 > temp3
How do I get it to also add the }{ to the end when "?" is found at the beginning, and add } to the end when "-" is found at the beginning of the line?
Match the whole line instead of its beginning, and use a replacement pattern referencing the content of the line :
sed -e 's#^? \(.*\)#\\QUESTION{\1}{' -e 's#^- \(.*\)#\\ANSWER{\1}#' -e 's#^e#}#'
In this command \(...\) are capturing groups and \1 refers to their content.
I also took the liberty of regrouping your multiple substitutions in a single sed command.
Like this:
sed -E 's/^(\? )(.*)/\\QUESTION{\2}{/;t;s/- (.*)/\ANSWER{\1}/;t;s/e/}/' file
Explanation:
s/^(\? )(.*)/\\QUESTION{\2}{/ Handle lines starting with ?
t means not further actions if the above s command replaced something
s/- (.*)/\ANSWER{\1}/ Handle lines starting with -
t means not further actions if the above s command replaced something
s/^e/}/ Handle lines starting with e.
You can "speed it up" a bit by reordering the commands by the complexity of the search pattern, like this:
sed -E 's/e/}/;t;s/- (.*)/\ANSWER{\1}/;t;s/^(\? )(.*)/\\QUESTION{\2}{/;' file
But yeah, probably micro-optimization.
You can try this sed too :
sed '/^- /s//\\ANSWER{/;/^e/s///;s/$/}/;/^? /{s//\\QUESTION{/;s/$/{/}' infile
sed '
/^- /s//\\ANSWER{/ # line with -
/^e/s/// # line with e
s/$/}/ # add } at the end of each line
/^? / { # line with ?
s//\\QUESTION{/
s/$/{/
}
' infile

Splitting on : with sed

I have a file that contains data like this
word0:secondword0
word1:secondword1
word2:secondword2
word3:secondword3
word4:secon:word4
I'd like to use sed to split that content to give me only the second word after the first colon.
The end result would look like
secondword0
secondword1
secondword2
secondword3
secon:word4
Notice how the last word has a second colon that is part of the word.
How would I write such a script that splits on only the fist colon but retains the rest?
Following sed could help you in same.
sed 's/\([^:]*\):\(.*\)/\2/' Input_file
Output will be as follows.
secondword0
secondword1
secondword2
secondword3
secon:word4
This can be done with gnu grep
grep -Po ':\K.*' <<END
word0:secondword0
word1:secondword1
word2:secondword2
word3:secondword3
word4:secon:word4
END
: matches the first occurence of : and \K keep : out of match .* matches the rest of the line, -o outputs only match

I need to remove with bash two characters from one long line xml string

I'm reading from stdin line by line strings like:
<xml version="1.0" encoding="UTF-8">\n<Datanode ....
I need to get rid of that \n , it is not a newline, just a nasty sequence.
I need to read it form pipe, process it and pipe further.
Usually I got help from tr or cut but against this sequence I cannot find the way, they either do not remove it, or remove some other "n"s from XML string as well.
So you want to remove the string made of '\' followed by 'n' ok?
Something like this should work:
... | sed 's/\\n//' | ...
or this if you want to remove multiple sequences:
... | sed 's/\\n//g' | ...
And, if you want to anchor the sequence to be removed:
... | sed 's/>\\n</></' | ...
UPDATE
In case you don't want to remove the sequence '\''n' but replace it with a real new line (and I did notice your tag osx), you might want to use the following:
... | sed -e 's/\\n/\'$'\n/' | ...
I'm assuming here that your document isn't valid XML on account of containing a text node outside the root, which would explain why you can't use conventional XML-centric tools.
To truly use only bash, and do this in a manner that's safe against corrupting your file (performs the replacement only for the exact header text only on the very first line):
correct_xml_header() {
local bad_header correct_header content
bad_header='<xml version="1.0" encoding="UTF-8">\n'
correct_header='<?xml version="1.0" encoding="UTF-8"?>'
IFS= read -r -d '' content
if [[ $content = "$bad_header"* ]]; then
content=${correct_header}${content#"$bad_header"}
fi
printf '%s' "$content"
}
You can then pipe through this function:
generate_bad_xml | correct_xml_header | consume_good_xml
If you want to add a literal newline, add $'\n' to the end of the definition of correct_header, as in:
correct_header='<?xml version="1.0" encoding="UTF-8"?>'$'\n'
Note that I'm also changing <xml ...> to <?xml ...?>, which is a change similarly necessary to make this tool's output parse correctly with XML-compliant tools.

Resources