I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?
I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.
Related
I have the following XML -
<d><m:properties xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">
<d:AllTexts/>
<d:BomFlag/>
<d:OrderNumber>9489</d:OrderNumber>
<d:LineNumber>000000</d:LineNumber>
<d:VcFlag>Y</d:VcFlag>
<d:PricingFlag/>
<d:TextType>H</d:TextType>
<d:TextId>ZC01</d:TextId>
<d:TextLineNo>1</d:TextLineNo>
<d:TextLine>ecom header text 1</d:TextLine>
and trying to retrieve the TextLine nodelist as based on TextId = ZC01 -
<TextLine>ecom header text1</TextLine>
when I applied the xpath as --> //m:properties[d:TextId = 'ZC01']/d:TextLine
I get the output as -
<d:TextLine xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">ecom header text 1</d:TextLine>
how can I remove the prefix and namespace? I tried using local-name(), but that didn't work
May be used it wrong way.
Thank you for your help!
Thanks
Sugata
XPath is a selection language: it can only retrieve nodes that are actually there, it can't change them in any way. If the selected element has a prefix and namespace in the original, then it will have a prefix and namespace in the result.
However, you need to distinguish what the XPath selects (a node) from the way it the result is displayed. This depends on the application that is evaluating the XPath. The two popular ways of displaying a node selected by an XPath expression are (a) by serialising the node as XML (which is what we see in your case), and (b) by showing a path to the selected node, such as /d/m:properties/d:TextLine. You haven't told us how you are evaluating the XPath expression or displaying its result, and you may have options here.
But perhaps you should consider XSLT or XQuery, which (unlike XPath) allow you to construct new XML that differs from your original.
I have a certain XPATH-query which I use to get the height from a certain HTML-element which returns me perfectly the desired value when I execute it in Chrome via the XPath Helper-plugin.
//*/div[#class="BarChart"]/*[name()="svg"]/*[name()="svg"]/*[name()="g"]/*[name()="rect" and #class="bar bar1"]/#height
However, when I use the same query via the Get Element Attribute-keyword in the Robot Framework
Get Element Attribute//*/div[#class="BarChart"]/*[name()="svg"]/*[name()="svg"]/*[name()="g"]/*[name()="rect" and #class="bar bar1"]/#height
... then I got an InvalidSelectorException about this XPATH.
InvalidSelectorException: Message: u'invalid selector: Unable to locate an
element with the xpath expression `//*/div[#class="BarChart"]/*[name()="svg"]/*
[name()="svg"]/*[name()="g"]/*[name()="rect" and #class="bar bar1"]/`
So, the Robot Framework or Selenium removed the #-sign and everything after it. I thought it was an escape -problem and added and removed some slashes before the #height, but unsuccessful. I also tried to encapsulate the result of this query in the string()-command but this was also unsuccessful.
Does somebody has an idea to prevent my XPATH-query from getting broken?
It looks like you can't include the attribute axis in the XPath itself when you're using Robot. You need to retrieve the element by XPath, and then specify the attribute name outside that. It seems like the syntax is something like this:
Get Element Attribute xpath=(//*/div[#class="BarChart"]/*[name()="svg"]/*[name()="svg"]/*[name()="g"]/*[name()="rect" and #class="bar bar1"])#height
or perhaps (I've never used Robot):
Get Element Attribute xpath=(//*/div[#class="BarChart"]/*[name()="svg"]/*[name()="svg"]/*[name()="g"]/*[name()="rect" and #class="bar bar1"])[1]#height
This documentation says
attribute_locator consists of element locator followed by an # sign and attribute name, for example "element_id#class".
so I think what I've posted above is on the right track.
You are correct in your observation that the keyword seems to removes everything after the final #. More correctly, it uses the # to separate the element locator from the attribute name, and does this by splitting the string at that final # character.
No amount of escaping will solve the problem as the code isn't doing any parsing at this point. This is the exact code (as of this writing...) that performs that operation:
def _parse_attribute_locator(self, attribute_locator):
parts = attribute_locator.rpartition('#')
...
The simple solution is to drop that trailing slash, so your xpath will look like this:
//*/div[#class="BarChart"]/... and #class="bar bar1"]#height`
I've been hacking away at this one for hours and I just can't figure it out. Using XPath to find text values is tricky and this problem has too many moving parts.
I have a webpage with a large table and a section in this table contains a list of users (assignees) that are assigned to a particular unit. There is nearly always multiple users assigned to a unit and I need to make sure a particular user is assigned to any of the units on the table. I've used XPath for nearly all of my selectors and I'm half way there on this one. I just can't seem to figure out how to use contains with text() in this context.
Here's what I have so far:
//td[#id='unit']/span [text()='asdfasdfasdfasdfasdf (Primary); asdfasdfasdfasdfasdf, asdfasdfasdfasdf; 456, 3456'; testuser]
The XPath Query above captures all text in the particular section I am looking at, which is great. However, I only need to know if testuser is in that section.
text() gets you a set of text nodes. I tend to use it more in a context of //span//text() or something.
If you are trying to check if the text inside an element contains something you should use contains on the element rather than the result of text() like this:
span[contains(., 'testuser')]
XPath is pretty good with context. If you know exactly what text a node should have you can do:
span[.='full text in this span']
But if you want to do something like regular expressions (using exslt for example) you'll need to use the string() function:
span[regexp:test(string(.), 'testuser')]
In this XML snippet I need to replace the data in the UID for some of the blocks. The actual file contains more than 100 similar blocks.
Although I have been able to extract subsets based on name="Track (Timeline)", I am struggling to reduce this subset to the specific block I need by also using the data in the <TrackID>, if name="Track (TimeLine)" and the text of <TrackID> is 0x1200 then set UID to xxxx.
I am new to Nokogiri and, although I write test scripts, I do not consider myself a programmer.
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>32-04-25-67-E7-A7-86-4A-9B-28-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1200</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x17010101</TrackNumber>
<UID>34-C1-B9-B9-5F-07-A4-4E-8F-F4-53-6F-66-74-65-6C</UID>
</StructuralMetadata>
<StructuralMetadata key="06.0E.2B.34.02.53.01.01.0D.01.01.01.01.01.3B.00" length="116" name="Track (TimeLine)">
<EditRate>25/1</EditRate>
<Origin>0</Origin>
<Sequence>35-12-2D-86-E6-74-0B-4C-B4-24-53-6F-66-74-65-6C</Sequence>
<TrackID>0x1300</TrackID>
<TrackName>Softel VBI Data</TrackName>
<TrackNumber>0x0</TrackNumber>
<UID>37-0C-80-34-4C-8D-CE-41-85-F3-53-6F-66-74-65-6C</UID>
</StructuralMetadata>
Using xpath:
//StructuralMetadata
will select all StructuralMetadata elements in your XML. The double slash at the start means to select nodes wherever they appear in the document.
You don't want all the nodes though, you can filter the ones you want with a predicate:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]
This will select all StructuralMetadata elements that have a name attribute with the value Track (TimeLine), and a TrackID child element with contents 0x1200.
As you're interested in the UID element, you can further refine the expression:
//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID
This expression will match all the UID elements that are children of StructuralMetadata elements that match the predicate described above.
Putting this to use:
require 'nokogiri'
# Parse the document, assuming xml_file is a File object containing the XML
doc = Nokogiri::XML(xml_file)
# I'm assuming there is only one element in the document that matches
# the criteria, so I'm using at_xpath
node = doc.at_xpath('//StructuralMetadata[#name="Track (TimeLine)" and TrackID="0x1200"]/UID')
# At this point, doc contains a representation of the xml, and node points to
# the UID node within that representation. We can update the contents of
# this node
node.content = 'XXX'
# Now write out the updated XML. This just writes it to standard output,
# you could write it to a file or elsewhere if needed
puts doc.to_xml
A great way to approach this problem is with the ‘map reduce’ style of programming, which works to take a large list of things and narrow it down and combine it into the result you're after. Specifically, Array#find and Array#select are really useful for this sort of problem. Check out this example:
require 'nokogiri'
xml = Nokogiri::XML.parse(File.read "sample.xml")
element = xml.css('StructuralMetadata').find { |item|
item['name'] == "Track (TimeLine)" and item.css('TrackID').text == "0x1200"
}
puts element.to_xml
This little program first uses the CSS selector to get all of the <StructuralMetadata> elements in the document. It returns an array, which we can filter to just what we want using the Array#find method. Array#select is its cousin which returns an array of all the matching objects instead of the first one it happens to find.
Inside the block we have a test to check if the <StructuralMetadata> tag is the one we’re after. Then it puts the element.to_xml string to the console so you can see which thing it found if you run this as a command-line script. Now you can find the element, you can modify it in the usual way and save out a new XML file or whatever.
I have not found any documentation nor tutorial for that. Does anything like that exist?
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
The code above will get me any table, anywhere, that has a tbody child with the attribute id equal to "threadbits_forum_251". But why does it start with double //? Why there is /tr at the end? See "Ruby Nokogiri Parsing HTML table II" for more details.
Can anybody tell me how to extract href, id, alt, src, etc., using Nokogiri?
td[3]/div[1]/a/text()' <--- extracts text
How can I extract other things?
Seems you need to read a XPath Tutorial
Your //table/tbody[#id="threadbits_forum_251"]/tr expression means:
// - Anywhere in your XML document
table/tbody - take a table element with a tbody child
[#id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
tr - and take its tr elements
So, basically, you need to know:
attributes begins with #
conditions go inside [] brackets
If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/#href if there is just one <a> element.
Your XPath is correct and you seem to have answered your own question's first part (almost):
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
"the code above will get me any table table's tr, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251"
// means the following element can appear anywhere in the document.
/tr at the end means, get the tr node of the matching element.
You dont need to extract each attribute one by one. Just get the entire node containing all four attributes in Nokogiri, and get the attributes using:
theNode['href']
theNode['src']
Where theNode is your Nokogiri Node object.
Edit:
Sorry I haven't used these libraries, but I think the XPath evaluation and parsing is being done by Mechanize. So here's how you would get the entire element and its attributes in one go.
doc.xpath("td[3]/div[1]/a").each do |anchor|
puts anchor['href']
puts anchor['src']
...
end