I’m using Rails 4.2.7 with Nokogiri. I’m having trouble creating a child node. I have the following code
general = doc.xpath("//lomimscc:general")
description = Nokogiri::XML::Node.new "lomimscc:description", doc
string = Nokogiri::XML::Node.new "lomimscc:string", doc
string.content = scenario.abstract
string['language'] = 'en'
description << string
general << description
I want the “description” element to be a child element of the “general” element (and similarly I want the “string” element to be a child of the “description” element). However what is happening is that the description element is appearing as a sibling of the general element. How do I make the element appear as a child instead of a sibling?
The tutorials show how to do this in "Creating new nodes", but the simple example is:
require 'nokogiri'
doc = Nokogiri::XML('<root/>')
doc.at('root').add_child('<foo/>')
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo/>\n</root>\n"
Nokogiri makes it easy to build nodes using a string that contains the markup or nodes you want to add.
You should be able to build upon this easily.
This is also noted throughout the Node documentation any place you see "node_or_tags".
When I changed
general = doc.xpath("//lomimscc:general")
to
general = doc.xpath("//lomimscc:general").first
then everything worked as far as creating child nodes.
Related
While trying to help another user out with some question, I ran into the following problem myself:
The object is to find the country of origin of a list of wines on the page. So we start with:
import requests
from lxml import etree
url = "https://www.winepeople.com.au/wines/Dry-Red/_/N-1z13zte"
res = requests.get(url)
content = res.content
res = requests.get(url)
tree = etree.fromstring(content, parser=etree.HTMLParser())
tree_struct = etree.ElementTree(tree)
Next, for reasons I'll get into in a separate question, I'm trying to compare the xpath of two elements with certain attributes. So:
wine = tree.xpath("//div[contains(#class, 'row wine-attributes')]")
country = tree.xpath("//div/text()[contains(., 'Australia')]")
So far, so good. What are we dealing with here?
type(wine),type(country)
>> (list, list)
They are both lists. Let's check the type of the first element in each list:
type(wine[0]),type(country[0])
>> (lxml.etree._Element, lxml.etree._ElementUnicodeResult)
And this is where the problem starts. Because, as mentioned, I need to find the xpath of the first elements of the wine and country lists. And when I run:
tree_struct.getpath(wine[0])
The output is, as expected:
'/html/body/div[13]/div/div/div[2]/div[6]/div[1]/div/div/div[2]/div[2]'
But with the other:
tree_struct.getpath(country[0])
The output is:
TypeError: Argument 'element' has incorrect type (expected
lxml.etree._Element, got lxml.etree._ElementUnicodeResult)
I couldn't find much information about _ElementUnicodeResult), so what is it? And, more importantly, how do I fix the code so that I get an xpath for that node?
You're selecting a text() node instead of an element node. This is why you end up with a lxml.etree._ElementUnicodeResult type instead of a lxml.etree._Element type.
Try changing your xpath to the following in order to select the div element instead of the text() child node of div...
country = tree.xpath("//div[contains(., 'Australia')]")
I'm trying to scrape the cell values from an HTML table. Randomly, some of these cells are empty, and I can't guess which ones with any reliability.
Is there a way to fill a default value in for Nokogiri when it comes across an empty cell?
Thanks for any advice you can provide. Here's my code:
def scrape_stats
stats = []
(2002..2012).to_a.each do |year|
url = "website/#{year}"
doc = Nokogiri::HTML(open(url))
rows = doc.at_css("body tbody").text.split(" ")
(rows.count / 25).times do |i| # there are 25 columns per row
stats << rows.shift(25)
end
end
It sounds like you want something like:
doc.search('td:empty').each{|n| n.content = 'default value'}
This would basically involve using the Nokogiri::XML::Node#add_child method (or the shorter version, Nokogiri::XML::Node#<<) to add a new child node containing the text you want to add to the empty cell.
See this question for an example:
How to add child nodes in NodeSet using Nokogiri
I'm trying to add a few elements to an already existing XML document. The following code is successful at adding the desired nodes and content, however it doesn't format the inserted elements. All the added elements end up on one line instead of with line breaks and indentations after each element.
Any suggestions about how I could add this formatting?
The code is:
doc.xpath("//tei:div[#xml:id='versionlog']", {"tei" => "http://www.tei-c.org/ns/1.0"}).each do |node|
new_entry = Nokogiri::XML::Node.new "div", doc
new_entry["xml:id"] = "v_#{ed_no}"
head = Nokogiri::XML::Node.new "head", doc
head.content = "Description of changes for #{ed_no}"
new_entry.add_child(head)
para = Nokogiri::XML::Node.new "p", doc
para.content = "#{version_description}"
new_entry.add_child(para)
node.add_child(new_entry)
end
Why is it important that the XML not be on one line? It's purely cosmetic having "pretty-printed" XML, and not required by the XML spec or the parser when the XML is reloaded. Personally, I'd recommend having no formatting for your transfer speed and reduced disk size, but YMMV.
You can either run the XML through an XML beautifier, or play a game with Nokogiri along the lines of:
new_entry.add_child(para.to_xml + "\n")
The line break will be added as a text node between the tags, but it's benign and not significant to XML's ability to deliver its payload.
If you insist, "How do I pretty-print HTML with Nokogiri?" describes how to get there.
I have a webpage whose DOM structure I do not know...but i know the text which i need to find in that particular webpage..so in order to get its xpath what i do is :
doc = Nokogiri::HTML(webpage)
doc.traverse { |node|
if node.text?
if node.content == "my text"
path << node.path
end
end
}
puts path
now suppose i get an output like ::
html/body/div[4]/div[8]/div/div[38]/div/p/text()
so that later on when i access this webpage again i can do this ::
doc.xpath("#{path[0]}")
instead of traversing the whole DOM tree everytime i want the text
I want to do some further processing , for that i need to know which of the element nodes in the above xpath output have attributes associated with them and what are their attribute values. how would i achieve that? the output that i want is
#=> output desired
{ p => p_attr_value , div => div_attr_value , div[38] => div[38]_attr_value.....so on }
I am not facing the problem in searching the nodes where "my text" lies.. I wanted to have the full xpath of "my text" node..thts why i did the whole traversal...now after finding the full xpath i want the attributes associated with the each element node that I came across while getting to the "my text" node
constraints are ::I cant use any of the developer tools available in a web browser
PS :: I am newbie in ruby and nokogiri..
To select all attributes of an element that is selected using the XPath expression someExpr, you need to evaluate a new XPath expression:
someExpr/#*
where someExpr must be substituted with the real XPath expression used to select the particular element.
This selects all attributes of all (we assume that's just one) elements that are selected by the Xpath expression someExpr
For example, if the element we want is selected by:
/a/b/c
then all of its attributes are selected by:
/a/b/c/#*
I'm pretty confused about this one. Given the following xml:
<sch:eventList>
<sch:event>
<sch:eventName>Event One</sch:eventName>
<sch:locationName>Location One</sch:locationName>
</sch:event>
<sch:event>
<sch:eventName>Event Two</sch:eventName>
<sch:locationName>Location Two</sch:locationName>
</sch:event>
</sch:eventList>
When using JDOM using the following code:
XPath eventNameExpression = XPath.newInstance("//sch:eventName");
XPath eventLocationExpression = XPath.newInstance("//sch:eventLocation");
XPath eventExpression = XPath.newInstance("//sch:event");
List<Element> elements = eventExpression.selectNodes(requestElement);
for(Element e: elements) {
System.out.println(eventNameExpression.valueOf(e));
System.out.println(eventLocationExpression.valueOf(e));
}
The console shows this:
Event One
Location One
Event One
Location One
What am I missing?
Don't use '//' it starts always searching at the root node. Use e.g. './sch:eventName' it is relative to the current node.