How to select only nodes that are not only spaces using Nokogiri? - ruby

I have the following XML document:
<w:p w14:paraId="572705D7" w14:textId="77777777" w:rsidP="00CA0169" w:rsidR="00CA0169" w:rsidRDefault="00CA0169" w:rsidRPr="00777A35">
<w:r>
<w:t xml:space="preserve"/>
</w:r>
<w:r>
<w:t>synthesized in cyanobacteria under unsuitable condition</w:t>
</w:r>
</w:p>
I currently select all nodes that begin with as follows:
text_nodes = p.xpath('w:r')
However, I would like to select only those text nodes that contain text and are not only spaces as the first node is as shown in the xml sample above.
I have extended the String Class to test for spaces as follows:
class String
def spaces?
x = self =~ /^\s+$/
x == 0
end
end
So I can do:
element.text.spaces?
I just don't know how to put it together with the p.xpath('w:r')to select only nodes that are NOT only spaces.

w:r[normalize-space(.) != '']
as your XPath expression should do.

Related

xpath: How to select only nearby parent siblings of current child node?

Im using xpath 1.0 and i want to select a siblings of a parent of a current child node.
The current node that im in is <w:instrText> and how can i get the parent sibling nodes <w:r> that only in range of <!-- SELECT FROM HERE --> and <!-- SELECT TILL HERE -->?
<w:p>
<w:pPr>
<w:rPr>
<w:lang w:val='en-US'/>
</w:rPr>
</w:pPr>
<w:r>
<w:t>TEXT 1</w:t>
</w:r>
<w:r w:rsidR='002D5530'>
<w:t xml:space='preserve'> </w:t>
</w:r>
<!-- SELECT FROM HERE -->
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='begin'/>
</w:r>
<w:r w:rsidR='002D5530'>
<w:instrText xml:space='preserve'> SOME DYNAMIC TEXT THAT CAN CHANGE </w:instrText> <!-- My cursor is here! -->
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='separate'/>
</w:r>
<w:r w:rsidR='00E5783C'>
<w:t>SOME DYNAMIC TEXT</w:t>
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='end'/>
</w:r>
<!-- SELECT TILL HERE -->
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='begin'/>
</w:r>
<w:r w:rsidR='002D5530'>
<w:instrText xml:space='preserve'> SOME DYNAMIC TEXT THAT CAN CHANGE </w:instrText>
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='separate'/>
</w:r>
<w:r w:rsidR='00E5783C'>
<w:t>SOME DYNAMIC TEXT</w:t>
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='end'/>
</w:r>
</w:p>
This is what i tried but got only 3 nodes...
../preceding-sibling::w:r[w:fldChar/#w:fldCharType='begin'] | ../following-sibling::w:r[w:fldChar/#w:fldCharType='end']
Given the limitations of XPath 1.0 and the assumption that the order of the sets of five w:r elements is consistent you could use a somewhat brute-force approach like:
(
(../preceding-sibling::w:r[w:fldChar/#w:fldCharType='begin'])[last()],
(../following-sibling::w:r[w:fldChar/#w:fldCharType='separate'])[position() = 1],
(../following-sibling::w:r[w:t])[position() = 1],
(../following-sibling::w:r[w:fldChar/#w:fldCharType='end'])[position() = 1]
)
Which is forming a sequence of four elements:
the last of the preceding "begin" elements
the first of the following "separate" elements
the first of the following "t" elements (w:r containing a w:t)
the first of the following "end" elements
If you also need to include the immediate parent of your context, then just add .. or parent::* or parent::w:r to the sequence:
(
(../preceding-sibling::w:r[w:fldChar/#w:fldCharType='begin'])[last()],
..,
(../following-sibling::w:r[w:fldChar/#w:fldCharType='separate'])[position() = 1],
(../following-sibling::w:r[w:t])[position() = 1],
(../following-sibling::w:r[w:fldChar/#w:fldCharType='end'])[position() = 1]
)
However, this may not be efficient if your input document is huge or performance is a significant factor.
If you are curious why your attempt returned three elements instead of two it is probably because you selected both "end" elements that were beyond the position of your context. The answer here may not be precisely what you need, but I suspect the last()/position() predicate examples will set you in the right direction.
BTW, if you ever have the opportunity to use XQuery for something like this, it is a perfect use case for the XQuery 3 "tumbling window" syntax which makes it very easy to identify the boundaries of sibling groups and iterate over the groups.

How to select an element only if the next element is of a specific type using Nokogiri

Given the following XML:
<w:p>
<w:pPr>
<w:lang w:val="en-CA"/>
</w:pPr>
<w:ins>
<w:t>sections</w:t>
</w:ins>
<w:pPr>
<w:lang w:val="en-CA"/>
</w:pPr>
<w:r>
<w:t>I am</w:t>
</w:r>
</w:p>
I want to select the w:pPr elements that are followed by either w:ins or w:del elements.
I tried:
doc.xpath("w:pPr[following-sibling::w:ins[1] or following-sibling::w:del[1]]")
which still returns the second w:pPr element which is followed by a w:r element so it's not what I'm looking for.
How can I do this?
Try this XPath expression:
//w:pPr[following-sibling::*[position()=1][name()='w:del' or name()='w:ins']]

How to select only paragraphs that contain certain child elements with nokogiri?

I have the following XML:
<w:p w14:paraId="07E73137" w14:textId="77777777" w:rsidP="00D279DF" w:rsidR="00D279DF" w:rsidRDefault="00D279DF">
</w:p>
<w:p w14:paraId="07E73138" w14:textId="77777777" w:rsidP="00D279DF" w:rsidR="00D279DF" w:rsidRDefault="00D279DF>
<w:r w:rsidRPr="00922473">
<w:t xml:space="preserve">Visual attributes </w:t>
</w:r>
<w:ins w:author="RKH RKH" w:date="2016-12-17T16:40:00Z" w:id="0">
<w:r>
<w:t>an</w:t>
</w:r>
</w:ins>
<w:del w:author="RKH RKH" w:date="2016-12-17T16:40:00Z" w:id="1">
<w:r w:rsidDel="008B2A6A">
<w:delText>the</w:delText>
</w:r>
</w:del>
</w:p>
The first <w:p> element does not contain any <w:ins> and <w:del> children elements.
However, the second <w:p> does contain <w:ins> and <w:del> elements.
I am currently selecting all paragraph elements using the following:
#all_paragraph_nodes = #file.xpath('//w:p')
I would like to select only paragraph elements that contain at least one <w:ins> element or <w:del> element.
How can I do this using Nokogiri?
You can use :
#all_paragraph_nodes = #file.xpath('//w:p[w:ins or w:del]')
Note that you have a typo in the 3rd lines of your XML :
w:rsidRDefault="00D279DF
isn't closed.

How to get the first node prior to the current node that contains text using Nokogiri?

I have a node that contains the text 'The f':
<w:r w:rsidR="00BC78BF">
<w:t>e takes out his phone and calls a friend.</w:t>
</w:r>
<w:r w:rsidR="00CB49B6">
<w:t xml:space="preserve"/>
</w:r>
<w:ins w:author="Mitchell Gould" w:date="2016-11-14T14:23:00Z" w:id="8">
<w:r w:rsidR="00BC7F15">
<w:t>The f</w:t>
</w:r>
</w:ins>
I want to get the first occurrence of text that exists before this text node.
I tried using:
node.previous_element.text
=> " "
and
previous_node = node.xpath('preceding-sibling::w:r').last
=> " "
This is because sometimes the previous_element is just a space as shown above, and it is possible that there could be many of these elements that are just spaces.
How can I get the first prior sibling that contains text?
I'd start with:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<r>
<t>e takes out his phone and calls a friend.</t>
</r>
<r>
<t/>
</r>
<ins>
<r>
<t>The f</t>
</r>
</ins>
</xml>
EOT
doc.search('//text()').map { |t| t.text.strip }.reject(&:empty?)
# => ["e takes out his phone and calls a friend.", "The f"]
Then it becomes a question of identifying the element prior to "The f" which I'll leave as a task for you. It isn't hard but, in a big document, could definitely affect code performance.
//text() is the XPath way to find all the text nodes in a document. // means "search from the top down" basically. A text node isn't just things like "The f", it can also be the new-line following a closing tag in a pretty-printed XML file.
text.strip followed by reject is done to remove any XML formatting between nodes, spaces and empty lines.

Extract attributes conditionally and recursively for word document xml using xpath in ruby

[I have a word doc xml like :
<w:document>
<w:body>
<w:p w14:paraId="5B6351BB" w14:textId="0D9644FF" w:rsidR="00432348" w:rsidRDefault="00432348" w:rsidP="00432348">
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:bookmarkStart w:id="27" w:name="_Toc435537885"/>
<w:r>
<w:t>TESTPLAN</w:t>
</w:r>
<w:bookmarkEnd w:id="26"/>
<w:r w:rsidR="00B46E57">
<w:t xml:space="preserve"> – PART I</w:t>
</w:r>
<w:bookmarkEnd w:id="27"/>
</w:p>
</w:body>
<w:document>
I want to extract the text for Heading1 , thus wrote the following code, but it does not seem to work.
#doc.xpath('//w:document//w:body//w:p[w:pPr//w:pStyle[#val]="Heading1"]//w:r//w:t')
In place of #val , I have tried #w:val and also in place of "Heading" I have tried 'Heading' for comparison. But still it returns a nil value.
You have typed wrong the w:pStyle[#w:val]="Heading1" part. It should be w:pStyle/#w:val="Heading1". Then it will work.
Correct XPATH:
//w:document//w:body//w:p[w:pPr//w:pStyle/#w:val="Heading1"]//w:r//w:t

Resources