I have an xml document with fragments like the following:
<x>
abcd
<z>ef</z>
ghij
</x>
I want to find the text "defg" inside the node, and modify that node to the following:
<x>
abc
<y>
d<z>ef</z>g
</y>
hij
</x>
This means creating a new node that has bit of x.text and other children inside.
I can find the node which includes the text, but I don't know how to break it up, and wrap just the matching section inside the <y> tags.
Any ideas that can point me in the right direction are most appreciated. Thanks.
What about turning it into a sting and then use a regex to change it, and then parse it with nokogiri again.
sting = some_xml.to_s
# => '<x>abcd<z>ef</z>ghij</x>'
splits = sting.match(/(.)<z>(.*)<\/z>(.)/)
new_string = sting.gsub(splits[1], "<y>#{splits[1]}").gsub(splits[3], "#{splits[3]}</y>")
Nokogiri::XML(new_string)
Related
I have this HTML fragment:
<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>
I need to replace the word plane with
plane
but only when it's outside of an <a></a> anchor tag, and outside a heading, <h1-h6></h> tag.
This is what I've tried:
require 'Nokogiri'
h = '<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse
# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content
# Try 2: The below line removes headings permanently - I need them to remain
# doc.search(".//h2").remove
# Try 3: This just comes out empty - why?
# doc.xpath('text()')
# doc.xpath('//text()')
# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html
I tried various other variations of xpath to no avail. What am I doing wrong?
After some playing around, it appears you needed to use the XPath selector p/text(). Things then got more complicated because you're trying to replace normal text with a link element.
When I just tried using gsub, Nokogiri was escaping the new link, so I needed to split the text element into multiple sibling elements where I could replace some of the siblings with link elements instead of text nodes.
doc.xpath('p/text()').grep(/plane/) do |node|
node_content, *remaining_texts = node.content.split(/(plane)/)
node.content = node_content
remaining_texts.each do |text|
if text == 'plane'
node = node.add_next_sibling('plane').last
else
node = node.add_next_sibling(text).last
end
end
end
puts doc
# <p>Yes. No. Both. Maybe a plane?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes. No. Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird? Is it a plane? No, it’s Superman.</p>
A more general purpose XPath selector for all elements, except headings and links, might be:
*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()
You may need to tweak this some as I'm not an XML or Nokogiri expert, but it appears to me to be working for the provided example, at least, so it should get you going.
I'm using Ruby, XPath and Nokogiri and trying to retrieve d1 from the following XML:
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
This is my code in a loop:
rs = doc.xpath("//a/b1/c/d1").inner_text
puts rs
It returns nothing (No error).
I want to get the text in <d1>.
You don't ask for the text content in your xpath query:
rs = doc.xpath('//a/b1/c/d1/text()')
You're misusing XPath:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
EOT
doc.at('/a/b1/c/d1').text # => "01/11/2001"
doc.at('//d1').text # => "01/11/2001"
// in XPath-ese means start at the top and look anywhere in your document. Instead, if you're supplying an explicit/absolute selector, start at the top of the document and drill down using '/a/b1/c/d1'. Or, do the simple thing and let the parser search through the document for that particular node using //d1. You can do that if you know there's a single instance of that node.
In my code above, I used at instead of xpath. at returns the first matching node, which is similar to using xpath('//d1').first. xpath returns a NodeSet, which is like an array of nodes, whereas at returns a Node only. Using inner_text on a NodeSet is likely to not give you the results you want, which would be the text of a particular node, so be careful there.
doc.xpath('/a/b1/c/d1/text()').class # => Nokogiri::XML::NodeSet
doc.xpath('//c').inner_text # => "\n 01/11/2001\n 02/02/2004\n "
doc.xpath('/a/b1/c/d1').first.text # => "01/11/2001"
Look at the following lines. Instead of using XPath selectors, I used CSS, which tends to be more readable. Nokogiri supports both.
doc.at('d1').text # => "01/11/2001"
doc.at('a b1 c d1').text # => "01/11/2001"
Also, notice the type of data returned from these two lines:
doc.at('/a/b1/c/d1/text()').class # => Nokogiri::XML::Text
doc.at('/a/b1/c/d1').text.class # => String
While it might seem good/smart to tell the parser to locate the text() node inside <d1>, what will be returned isn't text, and will need to be accessed further to make it usable, so consider forgoing the use of text() unless you know exactly why you need it:
doc.at('/a/b1/c/d1/text()').text # => "01/11/2001"
Finally, Nokogiri has many methods used for locating nodes. As I said above, xpath returns a NodeSet and at returns a Node. xpath is really an XPath-specific version of Nokogiri's search method. search, css and xpath all return NodeSets. at, at_css and at_xpath all return Nodes. The CSS and XPath variants are useful when you have an ambiguous selector that you need to be used as CSS or XPath specifically. Most of the time Nokogiri can figure whether it's CSS or XPath on its own and will do the right thing, so it's OK to use the generic search and at for the majority of your coding. Use the specific versions when you have to specify one or the other.
I have the following XML tree and need to get out the first name and surname only for the contrib tags with child xref nodes of ref-type "corresp".
<pmc-articleset>
<article>
<front>
<article-meta>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Wereszczynski</surname>
<given-names>Jeff</given-names>
</name>
<xref rid="aff1" ref-type="aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Andricioaei</surname>
<given-names>Ioan</given-names>
</name>
<xref rid="aff1" ref-type="aff"/>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
</contrib-group>
</article-meta>
</front>
</article>
</pmc-articleset>
I saw "Getting the siblings of a node with Nokogiri" which points out the CSS sibling selectors that can be used in Nokogiri, but, following the example given, my code gives siblings indiscriminately.
require "Net/http"
require "nokogiri"
url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=PMC1637560&db=pmc"
xml_data = Net::HTTP.get_response(URI.parse(url)).body
parsedoc = Nokogiri::XML.parse(xml_data)
corrdetails = parsedoc.at('contrib:has(xref[text()="*"])')
puts surname = corrdetails.xpath( "//surname" ).text
puts givennames = corrdetails.xpath("//given-names").text
=> WereszczynskiAndricioaei
=> JeffIoan
I only want the sibling node under the condition that <xref ref-type="corresp">*</> , that is an output of:
=> Andricioaei
=> Ioan
I've currently implemented this without referring to ref-type but rather selecting the asterisk within the xref tag (either is appropriate).
The problem is actually with your XPath for getting the the surname and given name, i.e., the XPath is incorrect for the lines:
puts surname = corrdetails.xpath( "//surname" ).text
puts givennames = corrdetails.xpath("//given-names").text
Starting the XPath with // means to look for the node anywhere in the document. You only want to look within the corrdetails node, which means the XPath needs to start with a dot, e.g., .//.
Change the two lines to:
puts surname = corrdetails.xpath( ".//surname" ).text
puts givennames = corrdetails.xpath(".//given-names").text
Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try
I am trying to find a certain text in any text node in a document, so far my statement looks like this:
doc.xpath("//text() = 'Alliance Consulting'") do |node|
...
end
This obviously does not work, can anyone suggest a better alternative?
This expression //text() = 'Alliance Consulting' evals to a boolean.
In case of this test sample:
<r>
<t>Alliance Consulting</t>
<s>
<p>Test string
<f>Alliance Consulting</f>
</p>
</s>
<z>
Alliance Consulting
<y>
Other string
</y>
</z>
</r>
It will return true of course.
Expression you need should evaluate to node-set, so use:
//text()[. = 'Alliance Consulting']
E.g. expression:
count(//text()[normalize-space() = 'Alliance Consulting'])
against the above document will return 3.
To select text nodes which contain 'Alliance Consulting' in the whole string value (e.g. 'Alliance Consulting provides great services') use:
//text()[contains(.,'Alliance Consulting')]
Do note that adjacent text nodes should become one after parser gets to the document.