For getting the attribute value from the below mentioned xml for attribute code from tag c
random.xml
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
Currently the logic is:
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How does the above logic work to get the values of code from tag c?
Getting the expected output:
abc
efg
Firstly observe that
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
is of dubious quality, as
egrep does not require standard input, it can read file itself, so you have useless use of cat
simple pattern is used in egrep which will work equally well in common grep, no need to summon ehanced grep, this usage is overkill
1 is set as field separator in awk, but code does not make any use of fields mechanism
after fixing these issue code looks following way
grep "<c.*/>" random.xml | awk ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How it does work: select lines which contain <c followed by zero-or-more any characters followed by />, then instruct awk that row are separated by qoutes (") when row does contain code= set f variable value to number of row, print such row that f is set to non-zero value and f value is equal to current number of lines minus one, which does mean print rows which are directly after row containing code=.
Observe that GNU AWK is poorly suited for working with XML and using regular expression against XML is very poor idea, as XML is not Chomsky Type 3 contraption.
If possible use proper tools for working with XML data, e.g. hxselect might be used following way, let file.xml content be
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
then
hxselect -c -s '\n' 'c[code]::attr(code)' < file.xml
gives output
abc
efg
Explanation: -c get just value rather than name and value, -s '\n' shear using newline, i.e. each value will be on own line c[code] is CSS3 selector meaning any c tag with attribute code, ::attr(code) is hxselect feature meaning get attribute named code. Observe that this solution is more robust than peculiar cat-egrep-awk pipeline as is immune to e.g. other whitespace usage in file (whitespaces outside tags in XML are optional).
This might be an awk question but parsing XML should be done with XML tools.
Here's an example with Xidel (available here for a few OSs) and a standard XPath expression:
xidel --xpath '//c[#code]/#code' random.xml
note: //c[#code] selects the c nodes that have a code attribute, and .../#code outputs the value of the code attribute.
Output
abc
efg
If your input always looks likes the sample XML then you can make the code attribute itself a field separator, and < the record separator, so that you can easily extract the value as the second field when the first field is the tag name c:
awk -F' .*code="|" ' -vRS='<' '$1=="c"{print $2}'
Demo: https://awk.js.org/?snippet=Lz6yx7
Related
div class="panel-body" id="current-conditions-body">
<!-- Graphic and temperatures -->
<div id="current_conditions-summary" class="pull-left" >
<img src="newimages/large/sct.png" alt="" class="pull-left" />
<p class="myforecast-current">Partly Cloudy</p>
<p class="myforecast-current-lrg">64°F</p>
<p class="myforecast-current-sm">18°C</p>
I try to extract the "64" in line 6, I was thinking to use awk '/<p class="myforecast-current-lrg">/{print}', but this only gave me the full line. Then I think I need to use sed, but i don't know how to use sed.
Assumptions:
input is nicely formatted as per the sample provided by OP so we can use some 'simple' pattern matching
Modifying OP's current awk code:
# use split() function to break line using dual delimiters ">" and "&"; print 2nd array entry
awk '/<p class="myforecast-current-lrg">/{ n=split($0,arr,"[>&]");print arr[2]}'
# define dual input field delimiter as ">" and "&"; print 2nd field in line that matches search string
awk -F'[>&]' ' /<p class="myforecast-current-lrg">/{print $2}'
Both of these generate:
64
One sed idea:
sed -En 's/.*<p class="myforecast-current-lrg">([^&]+)°.*/\1/p'
This generates:
64
I have this code:
cat response_error.xml | sed -ne 's#\s*<[^>]*>\s*##gp' >> response_error.csv
but all sed match from xml are bonded, for exemple:
084521AntonioCallas
I want to get this effect
084521,Antonio,Callas,
is it possible?
I must write a script which collect XML documents from previous day, extract from them only data without <...> and save this information to csv file in this way: 084521,Antonio,Callas - information separated by commas. The XML look like this:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<GenerarInformeResponse xmlns="http://experian.servicios.CAIS">
<GenerarInformeResult>
<InformeResumen xmlns="http://experian.servicios.CAIS.V2">
<IdSuscriptor>084521</IdSuscriptor>
<ReferenciaConsulta>Antonio Callas 00000000</ReferenciaConsulta>
<Error>
<Codigo>0000</Codigo>
<Descripcion>OK</Descripcion>
</Error>
<Documento>
<TipoDocumento>
<Codigo>01</Codigo>
<Descripcion>NIF</Descripcion>
</TipoDocumento>
<NumeroDocumento>000000000</NumeroDocumento>
<PaisDocumento>
<Codigo>000</Codigo>
<Descripcion>ESPAÑA</Descripcion>
</PaisDocumento>
</Documento>
<Resumen>
<Nombre>
<Nombre1>XXX</Nombre1>
<Nombre2>XXX</Nombre2>
<ApellidosRazonSocial>XXX</ApellidosRazonSocial>
</Nombre>
<Direccion>
<Direccion>XXX</Direccion>
<NombreLocalidad>XXX</NombreLocalidad>
<CodigoLocalidad/>
<Provincia>
<Codigo>39</Codigo>
<Descripcion>XXX</Descripcion>
</Provincia>
<CodigoPostal>39012</CodigoPostal>
</Direccion>
<NumeroTotalOperacionesImpagadas>1</NumeroTotalOperacionesImpagadas>
<NumeroTotalCuotasImpagadas>0</NumeroTotalCuotasImpagadas>
<PeorSituacionPago>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPago>
<PeorSituacionPagoHistorica>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPagoHistorica>
<ImporteTotalImpagado>88.92</ImporteTotalImpagado>
<MaximoImporteImpagado>88.92</MaximoImporteImpagado>
<FechaMaximoImporteImpagado>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaMaximoImporteImpagado>
<FechaPeorSituaiconPagoHistorica>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaPeorSituaiconPagoHistorica>
<FechaAltaOperacionMasAntigua>
<DD>16</DD>
<MM>12</MM>
<AAAA>2015</AAAA>
</FechaAltaOperacionMasAntigua>
<FechaUltimaActualizacion>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaUltimaActualizacion>
</Resumen>
</InformeResumen>
</GenerarInformeResult>
</GenerarInformeResponse>
</s:Body>
</s:Envelope>
You can extract the IdSuscriptor using the following command :
xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml
And the ReferenciaConsulta using the following command :
xmllint --xpath '//*[local-name()="ReferenciaConsulta"]/text()' response_error.xml
To produce the desired IdSubscriptor,FirstName,LastName I would use the following script :
id_suscriptor=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
referencia_consulta=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
first_name=$(echo "$referencia_consulta" | cut -f1)
last_name=$(echo "$referencia_consulta" | cut -f2)
echo "$id_suscriptor,$first_name,$last_name"
Note that this assumes the ReferenciaConsulta field will always contain a string starting with the first name and last name separated with a space.
If you want to parse XML, use a dedicated XML parser like Saxon.
If you want to parse a strange text file with some funny unrelated angle brackets, try this:
#! /bin/sed -nf
s/^<IdSuscriptor>\([0-9]\+\)<\/IdSuscriptor>/\1,/
t match1
b next
: match1
h
b
: next
s/^<ReferenciaConsulta>\([^ ]\+\) \([^ ]\+\) [0-9]\+<\/ReferenciaConsulta>/\1,\2,/
t match2
b
: match2
H
g
s/\n//
p
Explanation
t jumps to match1, if the preceeding s command did a replacement. Otherwise b jumps to next.
In case of a match h copies the matching string into the hold space and b stops the processing of the current line.
The second s command works the same way with the difference, that in case of no match b continues with the next line.
In case of the second match H appends the pattern space to the hold space, g copies the hold space to the pattern space, s removes the newline between the two matches and p prints the result.
Conclusion
If you do not know how to do it with sed don't try it. Try to learn a real programming language like Perl or JavaScript or Python. sed is a relic of bygone times.
if your data in 'd' file, try gnu sed:
sed -Ez 's/<[^>]*>//g;s/\n+|\s+/,/g;' d
i have text between html tags. For example:
<td>vip</td>
I will have any text between tags <td></td>
How can i cut any text from these tags and put any text between these tags.
I need to do it via bash/shell.
How can i do this ?
First of all, i tried to get this text, but without success
sed -n "/<td>/,/<\/td>/p" test.txt. But in a result i have
<td>vip</td>. but according to documentation, i should get only vip
You can try this:
sed -i -e 's/\(<td>\).*\(<\/td>\)/<td>TEXT_TO_REPLACE_BY<\/td>/g' test.txt
Note that it will only work for the <td> tags. It will replace everything between tags <td> (actually with them together and put the tags back) with TEXT_TO_REPLACE_BY.
You can use this to get the value vip
sed -e 's,.*<td>\([^<]*\)</td>.*,\1,g'
If you Input_file is same as shown example then following may help you too.
echo "<td>vip</td>" | awk -F"[><]" '{print $3}'
Simply printing the tag with echo then using awk to create a field separator >< then printing the 3rd field then which is your request.
d=$'<td>vip</td>\n<table>vip</table>\n<td>more data here</td>'
echo "$d"
<td>vip</td>
<table>vip</table>
<td>more data here</td>
awk '/<td>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>something</td>
<table>vip</table>
<td>something</td>
awk '/<table>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>vip</td>
<table>something</table>
<td>more data here</td>
I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)
This may be a bit complex, but here it goes:
Assuming I have an XML that looks as follows:
<a>
<b>000</b>
<c>111</c>
<b>222</b>
<d>333</d>
<c>444</c>
</a>
How can I, using sed on a mac, get a resulting an XML that looks as follows:
<a>
<b>111 000</b>
<b>222</b>
<d>333</d>
<c>444</c>
</a>
Basically:
Matching 2 consecutive lines that are of the form <b>...</b> followed by </c>...</c>
Taking the value between <c>...</c> and placing it (plus a space character) right after <b> on the line before it
Removing the second line <c>...</c>
Thank you.
If sed is too much for this, please advise anything else as long as I can run it from a mac shell.
Not the most beautiful solution but it seams to work :-)
$ tr '\n' # < input | sed 's#<b>\([0-9]\+\)</b>#<c>\([0-9]\+\)</c>#<b>\2 \1</b#g' | tr # '\n'
output:
<a>
<b>111 000</b
<b>222</b>
<d>333</d>
<c>444</c>
</a>
or a bit more general:
$ tr '\n' # < f1 | sed 's#<b>\([^<]*\)</b>#<c>\([^<]*\)</c>#<b>\2 \1</b#' | tr # '\n'
using [^<] to match anything between brackets
Ruby would support multi-line patterns:
ruby -e 'print gets(nil).sub(/<b>([^\n]*)<\/b>\n<c>([^\n]*)<\/c>/m,"<b>\\2 \\1</b>")' file.txt