I played around with nokogiri in ruby and the XML searching feature, e.g.:
a = Nokogiri.XML(open 'a.xml')
x = a.search('//div[#class="foo"]').text
which works quite nice.
But how can I specify to match the next (brother) element on the same level (and only the next)?
For example for this input:
<div>
<div>...</div>
<div>...</div>
<div class="foo"></div>
<div>EXTRACT ME</dev>
...
</div>
The actual input is some non-XHTML html, but so far Nokogiri.XML does not complain.
Btw, what filter syntax f.search actually expects? xpath?
Taking the hint from Brian Agnew and DevNull I guess that f.search actually expects xpath syntax and using the following-sibling predicate the following expression matches what was asked:
a = x.search('//div[#class="foo"]/following-sibling::div[1]')
I think you want XPath's following-sibling predicate.
Related
I have html code with div having same matching text in class name as menu1 and text like
Berlin
and
Berlin Germany
for which when i use below code returns ambiguous elements
find(:xpath, "//div[contains(text(), \"Berlin\") and contains(#class, \"menu1\")]")
Note: I want both class and text to be in my xpath
Suggestions will be appreciated, thanks in advance.
If by partial class name you mean something like <div class="blah menu1 other">Berlin</div> then you could just do it in a readable way with something like
find('div.menu1', exact_text: 'Berlin')
or
find('div.menu1', text: 'Berlin', exact: true)
If it's more like <div class="blah menu1_part other">Berlin</div> you can still do it with a more readable CSS selector like
find('div[class*=menu1]', exact_text: 'Berlin')
If you actually need to do it all in one XPath for performance reasons (a LOT of div.menu1 elements on the page, where you can't scope to a limited section of the page for some crazy reason) then you could do something like
find(:xpath, './/div[text()="Berlin"][contains(#class, "menu1")]')
Note the leading . in the XPath expression. 99.9% of the time when using Capybara, and manually writing your own XPath expressions, you want to start your XPath expressions with .//, otherwise you are defeating any scoping you have done - see https://github.com/teamcapybara/capybara#beware-the-xpath--trap
Another option is to use the xpath gem Capybara uses internally for generating XPaths, which would be something like
find(:xpath, XPath.css('div.menu1')[XPath.string.n.is('Berlin')], exact: true)
or
find(:xpath, XPath.css('div[class *= "menu1"]')[XPath.string.n.is('Berlin')], exact: true)
depending on exactly what you mean by partial class name. The benefit of doing something like that is the meaning of the is method can be changed from contains to equals depending on the value of the exact option, and it also handles all the normalizing and escaping of strings as necessary if your strings weren't as simple as 'Berlin'
<div class="main">
<p>Peter got some troubles.</p>
<p>I gave him my hand.</p>
<p>But Sam didn't.</p>
</div>
How can I extract all texts in the div.main with xpath?
I've tried string(//div[#class="main"]/p), but it only extracted the first line:
Peter got some troubles.
But I hope I can process all lines like:
Peter got some troubles.
I gave him my hand.
But Sam didn't.
The string value of the div element should give you what you want. In other words, take off the /p at the end of your XPath expression. The problem with your expression is that string() takes only the first node in the nodeset.
This is for XPath 1.0.
Here is an example of the mark up that I am matching against. The actual number of elements is not known ahead of time and thus varies, but following this sort of of pattern:
<div class="entry">
<p><iframe /></p>
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
<p><iframe /></p>
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
</div>
I am trying to to match every <p> that does not contain an <iframe>, up until the next <p> that does contain an <iframe> or until the end of the enclosing <div> element.
To make things slightly more complicated, for specific reasons I need to use each <iframe> as the base, a la //div[#class='entry']//iframe, so that each nodeset is based from
(//div[#class='entry']//iframe)[1]
(//div[#class='entry']//iframe)[2]
...
and thus, in this case, matching
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
and
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
respectively.
I tried some of the following for testing to no avail:
(//div[#class='entry']//iframe)/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(or for testing):
(//div[#class='entry']//iframe)[1]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(//div[#class='entry']//iframe)[2]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
and some variations thereof but what happens for the first set is it gets all <iframe>-less <p> elements all the way to the end instead of stopping at the next <p> that contains a <iframe>.
I've been at this for a while and even though I'm usually quite handy with this sort of thing, I can't quite work my way thorigh this one and none of the search results from Google and such have helped.
Thanks. Any help is always appreciated.
Edit: It can be assumed that there is only one occurrence of <div class="entry"> in the document.
What you are asking for can't be done in one single XPath 1.0 expression without help. The problem is that the question you want to ask is
Starting from an element X (the p-containing-an-iframe), find the other p elements for which that element's nearest preceding p-with-an-iframe is the original node X
If we had a variable $x holding a reference to the top-level context node (the p[iframe] we're starting from) then you could say something like the following (in XPath 2.0)
following-sibling::p[not(iframe)][preceding-sibling::p[iframe][1] is $x]
XPath 1.0 doesn't have an is operator to compare node identity but there are other proxies you can use for this, for example
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe])
= (count($x/preceding-sibling::p[iframe]) + 1)]
i.e. those following p elements that have one more preceding-sibling::p[iframe] than $x has.
The nub of the problem then is how to get at the outer context node from inside the inner predicate - pure XPath 1.0 has no way to do this. In XSLT you have the current() function, but otherwise you have two basic choices:
If your XPath library allows you to provide variable bindings to your expressions, then inject a variable $x containing the context node and use the expression I've given above.
If you can't inject variables then use two separate XPath queries in sequence.
First execute the expression
count(preceding-sibling::p[iframe]) + 1
with the relevant p[iframe] as context node, and take the result as a number. Or alternatively, if you're already iterating over these p[iframe] elements in your host language then just take the iteration number from there directly, you don't need to count it up using XPath. Either way, you can then build a second expression dynamically:
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe]) = N]
(where N is the result of the first expression/iteration counter) and evaluate that with the same context node, taking the final result as a node set.
I'm not sure I understood completely, but sometimes it helps to comment on an attempted solution rather than trying to explain.
Please try the following XPath expression:
//div[#class='entry']//iframe//p[not(descendant::iframe)]
And let me know if this yields the correct result.
If not,
explain how the result differs from what you need
please show a more complete HTML sample: a reasonable document with multiple div elements, and more than one where div[#class = 'entry'] - and otherwise covering all the complexity you describe.
explain why you added [1] and [2] to your expressions
give more details about the platform you're using XPath with, perhaps post code
I've looked around and can't seem to find the answer for this.
Very simplified:
<a>
<b>
<div class=label>
<label="here"/>
</div>
</b>
<div id="something">
<b>
<div class=label>
<label="here"/>
</div>
</div>
</a>
so I'm trying to grab the second "here" label. What I want to do is do the id to get to the "something" part
//.[#id="something”]
and then from that point search for the label with something like
//.[#class="label" and label="here"]
But from reading a few other replies it doesn't appear that something like
//.[#id="something”][#class="label" and label="here"]
works and I was just wondering if I'm just missing something in how it's working? I know I can get the above really simply with another method, it's just an example to ask how to do two predicate statements after each other (if it is indeed possible).
Thanks!
I think you need something like this instead :
//.[#id="something”]//.[#class="label" and label="here"]
The point is that the // means : Selects nodes in the document from the current node that match the selection no matter where they are
ref : http://www.w3schools.com/xpath/xpath_syntax.asp
The syntax //*[#x='y'] is more idiomatic than //.[#x='y'], probably because it's valid in both XPath 1.0 and XPath 2.0, whereas the latter is only allowed in XPath 2.0. Disallowing predicates after "." was probably an accidental restriction in XPath 1.0, and I think some implementations may have relaxed the restriction, but it's there in the spec: a predicate can only follow either a NodeTest or a PrimaryExpr, and "." is neither.
In XPath 2.0, //* selects all element nodes in the tree, while //. selects all nodes of all kinds (including the document root node), but in this example the effect is the same because the predicate [#x='y'] can only be matched by an element node (for all other node kinds, #x selects nothing and therefore cannot be equal to anything).
I'm trying to extract elements with an attribute, and not extract the descendants separately that have the same attribute.
Using the following html:
<html><body>
<div box>
some text
<div box>
some more text
</div>
</div>
<div box>
this needs to be included as well
</div>
</body></html>
I want to be able to extract the two outer <div box> and its descendants including the inner <div box>, but don't want to have the inner <div box> extracted separately.
I have tried using all sorts of different expressions but think I am missing something quite fundamental. The main expression I have been trying is: //[#box and not(ancestor::#box) but this still returns two elements.
I am trying to do this using the 'Hpricot' (0.8.3) Gem in Ruby 1.9.2 as follows:
# Assuming html is set to the html above
doc = Hpricot(html)
elements = doc.search('//[#box and not(ancestor::#box)]')
# The following is returning 3 instead of 2
elements.size
Any help on this would be great.
Your XPATH is invalid. You have to address something in order to use the predicate filter(e.g. []). Otherwise, there isn't anything to filter.
This XPATH works:
//div[#box and not(ancestor::div/#box)]
If the elements aren't all guarenteed to be <div>, you can use a more generic match for elements:
//*[#box and not(ancestor::*/#box)]
Using elements = doc.search('//[#box and not(ancestor::#box)]') isn't correct.
Use elements = doc.at('//div[#box]') which will find the first occurrence.
I'd recommend using Nokogiri over Hpricot. Nokogiri is well supported, very flexible and more robust.
EDIT: Added because original question changed:
Thanks that worked perfectly, except I forget to mention that I want to return multiple outer elements. Sorry about that, I have updated the question. I will look into Nokogiri further, I didn't choose it originally because Hpricot seemed more approachable.
Remember that XPath acts like accessing a file in a directory at its simplest form, so you can drill down and search in "subdirectories". If you only want the outer <div> tags, then look inside the <body> level and no further:
doc.search('/html/body/div')
or, if you might have unadorned div tags along with the targets:
doc.search('/html/body/div[#box]')
Regarding Hpricot seeming more approachable:
Nokogiri implements a superset of Hpricot's accessors, allowing you to drop it into place for most uses. It supports XPath and CSS accessors allowing more intuitive ways of getting at data if you live in CSS and HTML and don't grok XPath. In addition there are many methods to find your desired target:
doc.search('body > div[box]')
(doc / 'body > div[box]')
doc.css('body > div[box]')
Nokogiri supports the at and % synonym found in Hpricot also, along with css_at, if you only want the first occurrence of something.
I started using Nokogiri after running into some situations where Hpricot exploded because it couldn't handle malformed news-feeds in the wilds.