How to read text with space in xpath - xpath

I have listBox with value "This is_a_test"
( there is a space after the This )
$x('//li[contains(#class,"myClass")][text()="This is_a_test"]')
When I run it I got empty list []
I tried also
$x('//li[contains(#class,"myClass")][text()="This<b></b> <b></bis_a_test"]')
What do I need to change in my expression ?
The XML
<li .... >
"This"
<b></b>
<b></b>
"is_a_test"
</li>

The browser "shows" an space but in the command line you get
xmllint --html --xpath "//li[contains(#class,'myclass')]/text()" test.html
"This"
"is_a_test"
So there are 3 new lines on the result which are also part of text() output.
Removing new lines from the html, this XPath works (reversing quotes for simplicity)
echo '<li class="myclass">"This"<b></b> <b></b>"is_a_test"</li>' | \
xmllint --html --xpath "//li[contains(#class,'myclass')][.='\"This\" \"is_a_test\"']/text()" -
Result:
"This" "is_a_test"
Please note the dot . operator instead of text().
It's not easy to represent new lines on an xpath expression. Also, you may want to check this answer for more info on the difference between dot and text().

Related

Awk to get the attribute value from XML file

For getting the attribute value from the below mentioned xml for attribute code from tag c
random.xml
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
Currently the logic is:
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How does the above logic work to get the values of code from tag c?
Getting the expected output:
abc
efg
Firstly observe that
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
is of dubious quality, as
egrep does not require standard input, it can read file itself, so you have useless use of cat
simple pattern is used in egrep which will work equally well in common grep, no need to summon ehanced grep, this usage is overkill
1 is set as field separator in awk, but code does not make any use of fields mechanism
after fixing these issue code looks following way
grep "<c.*/>" random.xml | awk ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How it does work: select lines which contain <c followed by zero-or-more any characters followed by />, then instruct awk that row are separated by qoutes (") when row does contain code= set f variable value to number of row, print such row that f is set to non-zero value and f value is equal to current number of lines minus one, which does mean print rows which are directly after row containing code=.
Observe that GNU AWK is poorly suited for working with XML and using regular expression against XML is very poor idea, as XML is not Chomsky Type 3 contraption.
If possible use proper tools for working with XML data, e.g. hxselect might be used following way, let file.xml content be
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
then
hxselect -c -s '\n' 'c[code]::attr(code)' < file.xml
gives output
abc
efg
Explanation: -c get just value rather than name and value, -s '\n' shear using newline, i.e. each value will be on own line c[code] is CSS3 selector meaning any c tag with attribute code, ::attr(code) is hxselect feature meaning get attribute named code. Observe that this solution is more robust than peculiar cat-egrep-awk pipeline as is immune to e.g. other whitespace usage in file (whitespaces outside tags in XML are optional).
This might be an awk question but parsing XML should be done with XML tools.
Here's an example with Xidel (available here for a few OSs) and a standard XPath expression:
xidel --xpath '//c[#code]/#code' random.xml
note: //c[#code] selects the c nodes that have a code attribute, and .../#code outputs the value of the code attribute.
Output
abc
efg
If your input always looks likes the sample XML then you can make the code attribute itself a field separator, and < the record separator, so that you can easily extract the value as the second field when the first field is the tag name c:
awk -F' .*code="|" ' -vRS='<' '$1=="c"{print $2}'
Demo: https://awk.js.org/?snippet=Lz6yx7

Look for more then one value using xmllint

I need to retrieve more then one value from several XML-blocks inside a XML-file. How can I use xmllint to do this?
I noticed this solution (xml_grep get attribute from element) and tried to extend it. Unfortunately without any luck so far.
xmllint --xpath 'string(//identity/#name #placeofbirth #photo)' file.xml
Example XML file:
<eid>
<identity>
<name>Menten</name>
<firstname>Kasper</firstname>
<middlenames>Marie J</middlenames>
<nationality>Belg</nationality>
<placeofbirth>Sint-Truiden</placeofbirth>
<photo>base64-string</photo>
</identity>
<identity>
<name>Herbal</name>
<firstname>Jane</firstname>
<middlenames>Helena</middlenames>
<nationality>Frans</nationality>
<placeofbirth>Paris</placeofbirth>
<photo>notavailable</photo>
</identity>
</eid>
Output wanted
Kasper, Sint-Truiden, base64-string
Jane, Paris, notavailable
One way to do that is
# Read xml into variable
xmlStr=$(cat test.xml)
# Count identity nodes
nodeCount=$(echo "$xmlStr" | xmllint --xpath "count(//identity)" -)
# Iterate the nodeset by index
for i in $(seq 1 $nodeCount);do
echo "$xmlStr" | xmllint --xpath "concat((//identity)[$i]/name,', ',(//identity)[$i]/placeofbirth, ', ', (//identity)[$i]/photo)" - ; echo
done
Result:
Menten, Sint-Truiden, base64-string
Herbal, Paris, notavailable

How to add ","after every sed match?

I have this code:
cat response_error.xml | sed -ne 's#\s*<[^>]*>\s*##gp' >> response_error.csv
but all sed match from xml are bonded, for exemple:
084521AntonioCallas
I want to get this effect
084521,Antonio,Callas,
is it possible?
I must write a script which collect XML documents from previous day, extract from them only data without <...> and save this information to csv file in this way: 084521,Antonio,Callas - information separated by commas. The XML look like this:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<GenerarInformeResponse xmlns="http://experian.servicios.CAIS">
<GenerarInformeResult>
<InformeResumen xmlns="http://experian.servicios.CAIS.V2">
<IdSuscriptor>084521</IdSuscriptor>
<ReferenciaConsulta>Antonio Callas 00000000</ReferenciaConsulta>
<Error>
<Codigo>0000</Codigo>
<Descripcion>OK</Descripcion>
</Error>
<Documento>
<TipoDocumento>
<Codigo>01</Codigo>
<Descripcion>NIF</Descripcion>
</TipoDocumento>
<NumeroDocumento>000000000</NumeroDocumento>
<PaisDocumento>
<Codigo>000</Codigo>
<Descripcion>ESPAÑA</Descripcion>
</PaisDocumento>
</Documento>
<Resumen>
<Nombre>
<Nombre1>XXX</Nombre1>
<Nombre2>XXX</Nombre2>
<ApellidosRazonSocial>XXX</ApellidosRazonSocial>
</Nombre>
<Direccion>
<Direccion>XXX</Direccion>
<NombreLocalidad>XXX</NombreLocalidad>
<CodigoLocalidad/>
<Provincia>
<Codigo>39</Codigo>
<Descripcion>XXX</Descripcion>
</Provincia>
<CodigoPostal>39012</CodigoPostal>
</Direccion>
<NumeroTotalOperacionesImpagadas>1</NumeroTotalOperacionesImpagadas>
<NumeroTotalCuotasImpagadas>0</NumeroTotalCuotasImpagadas>
<PeorSituacionPago>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPago>
<PeorSituacionPagoHistorica>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPagoHistorica>
<ImporteTotalImpagado>88.92</ImporteTotalImpagado>
<MaximoImporteImpagado>88.92</MaximoImporteImpagado>
<FechaMaximoImporteImpagado>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaMaximoImporteImpagado>
<FechaPeorSituaiconPagoHistorica>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaPeorSituaiconPagoHistorica>
<FechaAltaOperacionMasAntigua>
<DD>16</DD>
<MM>12</MM>
<AAAA>2015</AAAA>
</FechaAltaOperacionMasAntigua>
<FechaUltimaActualizacion>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaUltimaActualizacion>
</Resumen>
</InformeResumen>
</GenerarInformeResult>
</GenerarInformeResponse>
</s:Body>
</s:Envelope>
You can extract the IdSuscriptor using the following command :
xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml
And the ReferenciaConsulta using the following command :
xmllint --xpath '//*[local-name()="ReferenciaConsulta"]/text()' response_error.xml
To produce the desired IdSubscriptor,FirstName,LastName I would use the following script :
id_suscriptor=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
referencia_consulta=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
first_name=$(echo "$referencia_consulta" | cut -f1)
last_name=$(echo "$referencia_consulta" | cut -f2)
echo "$id_suscriptor,$first_name,$last_name"
Note that this assumes the ReferenciaConsulta field will always contain a string starting with the first name and last name separated with a space.
If you want to parse XML, use a dedicated XML parser like Saxon.
If you want to parse a strange text file with some funny unrelated angle brackets, try this:
#! /bin/sed -nf
s/^<IdSuscriptor>\([0-9]\+\)<\/IdSuscriptor>/\1,/
t match1
b next
: match1
h
b
: next
s/^<ReferenciaConsulta>\([^ ]\+\) \([^ ]\+\) [0-9]\+<\/ReferenciaConsulta>/\1,\2,/
t match2
b
: match2
H
g
s/\n//
p
Explanation
t jumps to match1, if the preceeding s command did a replacement. Otherwise b jumps to next.
In case of a match h copies the matching string into the hold space and b stops the processing of the current line.
The second s command works the same way with the difference, that in case of no match b continues with the next line.
In case of the second match H appends the pattern space to the hold space, g copies the hold space to the pattern space, s removes the newline between the two matches and p prints the result.
Conclusion
If you do not know how to do it with sed don't try it. Try to learn a real programming language like Perl or JavaScript or Python. sed is a relic of bygone times.
if your data in 'd' file, try gnu sed:
sed -Ez 's/<[^>]*>//g;s/\n+|\s+/,/g;' d

How do I get a selection from the output of a grep

I have the following text in a file :
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div><script type="text/javascript">document.getElementById('duration').innerHTML = "Finished in <strong>1m31.846s seconds</strong>";</script><script type="text/javascript">document.getElementById('totals').innerHTML = "1
What I want to do is obtain the stuff after the src i.e. Logs/P2P2014-04-10_14-24-49.txt. I tried the following and put it into a variable in ruby or so :
I tried doing :
text = `grep 'Logs\/.*txt\"'`
But that returns the entire damn line instead of only the text. How do I get this done?
Try to use
text=$(grep -o 'Logs\/.*txt\"')
It should return only matching part of the line.
Using Nokogiri, see how easy to solve the problem :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div>
html
doc.at('#img_1')['src'] # => "Logs/P2P2014-04-10_14-24-49.txt"
Read tutorials to understand and learn Nokogiri.
Using sed
sed -n 's/.*src="\([^"]*\)".*/\1/p' file
Using gnu grep if support -P option
grep -Po '(?<=src=")[^"]*' file

BASH - Select All Code Between A Multiline Div

I have a div on all of my eCommerce site's pages holding SEO content. I'd like to count the number of words in that div. It's for diagnosing empty pages in a large crawl.
The div always starts as follows:
<div class="box fct-seo fct-text
It then contains <h1>, <p> and <a> tags.
it then, obviously, closes with </div>
How can I, using SED, AWK, WC, etc take all the code between the start of the div and its closing div and count how many words occur. If it's 90% accurate, I'm happy.
You'd somehow have to tell it to stop scanning before the first closing </div> it finds.
Here's an example page to work with:
http://www.zando.co.za/women/shoes/
Much appreciated.
-P
When it gets more complicated (like divs nested with in that div) the regex approach won't work anymore and you need a html parser, like in my Xidel. Then you can find the text
either with css:
xidel http://www.zando.co.za/women/shoes/ -e 'css(".fct-seo")' | wc -w
or pattern matching:
xidel http://www.zando.co.za/women/shoes/ -e '<div class="box fct-seo fct-text">{.}</div>' | wc -w
It will also only print the text, not the html tags. (if you/someone wanted them, you could add the --printed-node-format xml option)
In a Perl one-liner you can use the .. operator to specify the patterns that match the beginning and end of the region you're interested in:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html
You can then count the words with wc -w:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html | wc -w
If counting the ‘words’ in the HTML tags themselves is affecting the numbers enough to affect the accuracy, you can remove those from the count with something like:
$ perl -wne 'next unless /<div class="box fct-seo fct-text/ .. /<\/div>/; s/<.*?>//g; print' shoes.html | wc -w
Try:
grep -Pzo '(?<=<div)(.*?\n)*?.*?(?=</div)' -n inputFile.html | sed 's/^[^>]*>//'

Resources