<div class='postbodytop">
<a class="xxxxxxxxxxxxxxxx" href="xxxxxxxxxxxxxx">tonyd</a>
"posted this 4 minutes ago "
<span class="hidden-xs"> </span>
</div>
Hello, I want to extract the "posted this 4 minutes ago" or just "4 minutes" using xpath. Can anybody help me? Thank you
The div whose class equals postbodytop contains three child nodes: a span, a text node, and another span. Your path should start at the div and then select the child text node, for which the appropriate test is text().
div/text()
Of course this is just a fragment of a bigger page, and your XPath may need to have something at the start e.g. /html/body/ etc. and if there are other div elements at the same level as the <div class=postbodytop>, then you should be more specific about the div, e.g. div[#class="postbodytop"] instead of just div in that XPath expression.
Related
There are a number of labels, I want to specify them in xpath and then grab the text after them, example:
<div class="info-row">
<div class="info-label"><span>Variant:</span></div>
<div class="info-content">
<p>750 ml</p>
</div>
</div>
So in this case, I want to say "after the span named 'Variant' grab the p tag:
Result: 750ml
I tried:
//span[text()='Variant:']/following-sibling::p
and variations of this but to no avail.
'following-sibling' function selects all siblings after the current node,
there no siblings for span with text 'Variant:', and correct to search siblings for span parent.
Here is an example which will work
//span[text()='Variant:']/ancestor::div[#class="info-label"]/following-sibling::div/p
With the help of this SO question I have an almost working xpath:
//div[contains(#class, 'measure-tab') and contains(., 'someText')]
However this gets two divs: in one it's the child td that has someText, the other it's child span.
How do I narrow it down to the one with the span?
<div class="measure-tab">
<!-- table html omitted -->
<td> someText</td>
</div>
<div class="measure-tab"> <-- I want to select this div (and use contains #class)
<div>
<span> someText</span> <-- that contains a deeply nested span with this text
</div>
</div>
To find a div of a certain class that contains a span at any depth containing certain text, try:
//div[contains(#class, 'measure-tab') and contains(.//span, 'someText')]
That said, this solution looks extremely fragile. If the table happens to contain a span with the text you're looking for, the div containing the table will be matched, too. I'd suggest to find a more robust way of filtering the elements. For example by using IDs or top-level document structure.
You can use ancestor. I find that this is easier to read because the element you are actually selecting is at the end of the path.
//span[contains(text(),'someText')]/ancestor::div[contains(#class, 'measure-tab')]
You could use the xpath :
//div[#class="measure-tab" and .//span[contains(., "someText")]]
Input :
<root>
<div class="measure-tab">
<td> someText</td>
</div>
<div class="measure-tab">
<div>
<div2>
<span>someText2</span>
</div2>
</div>
</div>
</root>
Output :
Element='<div class="measure-tab">
<div>
<div2>
<span>someText2</span>
</div2>
</div>
</div>'
You can change your second condition to check only the span element:
...and contains(div/span, 'someText')]
If the span isn't always inside another div you can also use
...and contains(.//span, 'someText')]
This searches for the span anywhere inside the div.
<div>
<p>BBC Radio 1</p>
<p>BBC Radio 1Xtra</p>
</div>
I want to locate the first element(containing text BBC Radio 1) using the xpath which contains tthe paragraph text. Something like : "//div[contains(text(),'BBC Radio 1')]".
However this xpath is pointing to both the <p> nodes. Is there a way to point to the first <p> node only using the node text in this situation ?
You can limit result of your XPath by using index 1 :
(your_initial_xpath_here)[1]
(//p[contains(text(),'BBC Radio 1')])[1]
Im trying to select the content inside a div, this div has some text inside and some additional tags. I dont want to select the first div inside. I was trying with this selector, but only gives me the tags, without text
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/*[not(#class='subHeadline')]
the div that is giving me problems is this one:
<div class="viewHalfWidthSize">
.......
</div>
<div class="viewHalfWidthSize">
<div class="subHeadline firefinder-match">The Fine Print</div> <----------Except this div I want everything inside of this div!!
<strong class="firefinder-match">Validity: </strong>
Expires 27 June 2013.
<br class="firefinder-match">
<strong class="firefinder-match">Purchase: </strong>
Limit 1 per 2 people. May buy multiple as gifts.
<br class="firefinder-match">
<strong class="firefinder-match">Redemption: </strong>
Booking required online at
<a target="_blank" href="http://grouponbookings.co.uk/lautre-pied-march/" class="firefinder-match">http://grouponbookings.co.uk/lautre-pied-march/</a>
. 48-hour cancellation policy; late cancellation incurs a £30 surcharge per person.
<br class="firefinder-match">
<strong class="firefinder-match">Further information: </strong>
Valid Mon-Sun midday-2.45pm; Mon-Wed 6pm-10.45pm. Must be 18 or older, ID may be requested. Valid only on set tasting menu only; menu is dependent on market changes and seasonality and is subject to change. Max. two hours seating time. Discretionary service charge will be added to the bill based on original price. Original value verified 19 March 2013 at 9.01am.
<br class="firefinder-match">
<a target="_blank" href="http://www.groupon.co.uk/universal-fine-print" style="color: #339933;" class="firefinder-match">See the rules</a>
that apply to all deals.
</div>
The * matches element nodes and not text nodes. Try replacing * with node() to select all node types.
To break down what your XPath is doing:
You are looking anywhere in the document (//) for a div with class 'contentDealDescriptionFacts cf'.
Then you are looking for the 2nd div under that which also has the class viewHalfWidthSize. Note, this is not the 2nd div that has the class but the div that is 2nd AND has that class, so if the divs with that class are the 3rd and 4th it wouldn't match anything as the 2nd div with the class has position() = 4. If you want the 2nd viewHalfWidthSize div then you'll want [#class='viewHalfWidthSize'][position()=2].
Finally, you are returning a nodelist of all elements without the class subHeadline. If you change the * to node() then you will get a nodelist of all nodes.
The following XPath:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[not(name(.)='div' and position() = 1)]
should return what you want as long as the first child node is the div you want to ignore.
If you change it to:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[position() != count(../div[1]/preceding-sibling::node()) + 1]
then it should work regardless. It returns your nodelist, then works out how many preceding nodes there are before the first div, and checks the position isn't one greater than that (i.e. position of first div) and excludes that from the list.
As yet another alternative you could just modify your original solution but instead of doing not(#class='subHeadline') you should do
not(contains(concat(' ', #class, ' '), ' subHeadline '))
which will check if the class attribute contains subHeadline anywhere in the string on the assumption that your classes are space separated. This would then match your fragment which has the class "subHeadline firefinder-match"
I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...
<root>
<div class="textfield">
<div class="header">First item</div>
Here is the text of the <strong>first</strong> item.
<div class="header">Second item</div>
<span>Here is the text of the second item.</span>
<div class="header">Third item</div>
Here is the text of the third item.
</div>
<div class="textfield">
Footer text
</div>
</root>
I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:
//text()[preceding::*[#class='header' and contains(text(),'First item')] and following::*[#class='header' and contains(text(),'Second item')]]
However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').
Any help on how to adapt my XPath query would be greatly appreciated.
Found it!
//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[#class='header'][1][contains(text(),'First item')]]]
Indeed your solution, Aleh, won't work for tags inside the text.
Now, the one remaining case is the last item, which is not followed by an element with class=header; so it will include all text found 'till the end of the document. Ideas?
//*[#class='header' and contains(text(),'First item')]/following::text()[1] will select first text node after <div class="header">First item</div>.
//*[#class='header' and contains(text(),'Second item')]/following::text()[1] will select first text node after <div class="header">Second item</div> and so on
EDIT: Sorry, this will not work for <strong> cases. Will update my answer
EDIT2: Used #Michiel part. Looks like omg but works: //div[#class='textfield'][1]//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[#class='header' and contains(text(),'First item')]])]
Seems that this should be solved with a better solution :)
For the sake of completeness, the final query, composed of various suggestions throughout the thread:
//*[
#class='textfield' and position() = 1
]
//text() [
preceding::*[
#class='header' and contains(text(),'First item')
]
][
following::*[
preceding::*[
#class='header'
][1][
contains(text(),'First item')
]
]
]