I'm learning XPath with Nokogiri. The XPath is like this:
xml_doc = Nokogiri::XML(open("test.xml"))
result = xml_doc.xpath("//x:foo", 'x' => 'www.example.com')
I could get the results. But when I perform this call:
result = xml_doc.xpath("//x:node()", 'x' => 'www.example.com')
I get an error:
Nokogiri::XML::XPath::SyntaxError: Invalid expression: //x:node()
Am I doing something wrong?
Different from elements, you don't need to use a namespace prefix to match by node(). The following will return all nodes in any namespace just fine:
result = xml_doc.xpath("//node()")
There are several types of nodes in XPath, namely text node, comment node, element node, so on. node() is a node tests which simply returns true for any node type whatsoever. Compare to text() which is another type of node tests that returns true only for text nodes. (See "w3.org > Xpath > Node Tests")
In my understanding, the notion of local name and namespace are only exists in the context of element nodes, so using a namespace prefix along with the node() test simply doesn't make sense.
If you meant to select all elements in a specific namespace use * instead of node():
result = xml_doc.xpath("//x:*", 'x' => 'www.example.com')
Related
While trying to help another user out with some question, I ran into the following problem myself:
The object is to find the country of origin of a list of wines on the page. So we start with:
import requests
from lxml import etree
url = "https://www.winepeople.com.au/wines/Dry-Red/_/N-1z13zte"
res = requests.get(url)
content = res.content
res = requests.get(url)
tree = etree.fromstring(content, parser=etree.HTMLParser())
tree_struct = etree.ElementTree(tree)
Next, for reasons I'll get into in a separate question, I'm trying to compare the xpath of two elements with certain attributes. So:
wine = tree.xpath("//div[contains(#class, 'row wine-attributes')]")
country = tree.xpath("//div/text()[contains(., 'Australia')]")
So far, so good. What are we dealing with here?
type(wine),type(country)
>> (list, list)
They are both lists. Let's check the type of the first element in each list:
type(wine[0]),type(country[0])
>> (lxml.etree._Element, lxml.etree._ElementUnicodeResult)
And this is where the problem starts. Because, as mentioned, I need to find the xpath of the first elements of the wine and country lists. And when I run:
tree_struct.getpath(wine[0])
The output is, as expected:
'/html/body/div[13]/div/div/div[2]/div[6]/div[1]/div/div/div[2]/div[2]'
But with the other:
tree_struct.getpath(country[0])
The output is:
TypeError: Argument 'element' has incorrect type (expected
lxml.etree._Element, got lxml.etree._ElementUnicodeResult)
I couldn't find much information about _ElementUnicodeResult), so what is it? And, more importantly, how do I fix the code so that I get an xpath for that node?
You're selecting a text() node instead of an element node. This is why you end up with a lxml.etree._ElementUnicodeResult type instead of a lxml.etree._Element type.
Try changing your xpath to the following in order to select the div element instead of the text() child node of div...
country = tree.xpath("//div[contains(., 'Australia')]")
I have many statements like this in my test.xml file
<House name="bla"><Room id="bla" name="black" ></Room></House>
How do I print all Rooms with name="black". I am using CSS selector but Only House and Room attributes are taken by the selector.
I started with trying to print all name's, doesn't matter House or Room.
nodes = doc.css("name"). But it gives null as the output. So I am not able to proceed.
In CSS you have a syntax for matching elements by an attribute key-val pair:
nodes = doc.css("[name='black']")
For future reference you can also chain attribute selectors
nodes = doc.css(".my-class[name='black'][foo='bar']")
Or omit the val and match any element where the attribute is present:
nodes = doc.css("[name]")
I have a webpage whose DOM structure I do not know...but i know the text which i need to find in that particular webpage..so in order to get its xpath what i do is :
doc = Nokogiri::HTML(webpage)
doc.traverse { |node|
if node.text?
if node.content == "my text"
path << node.path
end
end
}
puts path
now suppose i get an output like ::
html/body/div[4]/div[8]/div/div[38]/div/p/text()
so that later on when i access this webpage again i can do this ::
doc.xpath("#{path[0]}")
instead of traversing the whole DOM tree everytime i want the text
I want to do some further processing , for that i need to know which of the element nodes in the above xpath output have attributes associated with them and what are their attribute values. how would i achieve that? the output that i want is
#=> output desired
{ p => p_attr_value , div => div_attr_value , div[38] => div[38]_attr_value.....so on }
I am not facing the problem in searching the nodes where "my text" lies.. I wanted to have the full xpath of "my text" node..thts why i did the whole traversal...now after finding the full xpath i want the attributes associated with the each element node that I came across while getting to the "my text" node
constraints are ::I cant use any of the developer tools available in a web browser
PS :: I am newbie in ruby and nokogiri..
To select all attributes of an element that is selected using the XPath expression someExpr, you need to evaluate a new XPath expression:
someExpr/#*
where someExpr must be substituted with the real XPath expression used to select the particular element.
This selects all attributes of all (we assume that's just one) elements that are selected by the Xpath expression someExpr
For example, if the element we want is selected by:
/a/b/c
then all of its attributes are selected by:
/a/b/c/#*
How can I get H1,H2,H3 contents in one single xpath expression?
I know I could do this.
//html/body/h1/text()
//html/body/h2/text()
//html/body/h3/text()
and so on.
Use:
/html/body/*[self::h1 or self::h2 or self::h3]/text()
The following expression is incorrect:
//html/body/*[local-name() = "h1"
or local-name() = "h2"
or local-name() = "h3"]/text()
because it may select text nodes that are children of unwanted:h1, different:h2, someWeirdNamespace:h3.
Another recommendation: Always avoid using // when the structure of the XML document is statically known. Using // most often results in significant inefficiencies because it causes the complete document (sub)tree roted in the context node to be traversed.
I'm pretty confused about this one. Given the following xml:
<sch:eventList>
<sch:event>
<sch:eventName>Event One</sch:eventName>
<sch:locationName>Location One</sch:locationName>
</sch:event>
<sch:event>
<sch:eventName>Event Two</sch:eventName>
<sch:locationName>Location Two</sch:locationName>
</sch:event>
</sch:eventList>
When using JDOM using the following code:
XPath eventNameExpression = XPath.newInstance("//sch:eventName");
XPath eventLocationExpression = XPath.newInstance("//sch:eventLocation");
XPath eventExpression = XPath.newInstance("//sch:event");
List<Element> elements = eventExpression.selectNodes(requestElement);
for(Element e: elements) {
System.out.println(eventNameExpression.valueOf(e));
System.out.println(eventLocationExpression.valueOf(e));
}
The console shows this:
Event One
Location One
Event One
Location One
What am I missing?
Don't use '//' it starts always searching at the root node. Use e.g. './sch:eventName' it is relative to the current node.