I would like to match the main paragraph content of the following code, omitting the child nodes p, div, h3.
<div class="content">
sunday, monday, tuesday,
<br>
<br>
wednesday, thursday,
<br>
friday, saturday
<div class ="tags">sunday</div>
<h3>Days</h3>
<p>....</p>
<div class="style">monday to friday</div>
</div>
I tried Xpaths like //div[#class="content"]/*[not(self::p)] and //div[#class="content"]/*[not(name()="p")] , but none of them works. Then I tried //div[#class="content"]/node()[not(div)] and //div[#class="content"]/node()[not(h3)] it only matched the first text.
I need the text below
sunday, monday, tuesday,
<br>
<br>
wednesday, thursday,
<br>
friday, saturday
by omitting the children div class ="tags", h3, p, div class = style.
This should do the trick:
//div[#class="content"]/*[not(self::p) and not(self::h3) and not(self::div)]|//div[#class="content"]/text()
Demo
Explanation:
//div[#class="content"] selecting the node in question
*[not(self::p) and not(self::h3) and not(self::div)] omitting child elemnts: h3, p, div
(or instead of any div and not(self::div[#class="style"]) and not(self::div[#class="tags"])] if you really need to filter div class ="tags" and div class = style).
|//div[#class="content"]/text() then, join with the blank text()
Actually, this is a bit complicated. Maybe you are better off just selecting the text or do some DOM manipulation on the node.
Related
<div class='postbodytop">
<a class="xxxxxxxxxxxxxxxx" href="xxxxxxxxxxxxxx">tonyd</a>
"posted this 4 minutes ago "
<span class="hidden-xs"> </span>
</div>
Hello, I want to extract the "posted this 4 minutes ago" or just "4 minutes" using xpath. Can anybody help me? Thank you
The div whose class equals postbodytop contains three child nodes: a span, a text node, and another span. Your path should start at the div and then select the child text node, for which the appropriate test is text().
div/text()
Of course this is just a fragment of a bigger page, and your XPath may need to have something at the start e.g. /html/body/ etc. and if there are other div elements at the same level as the <div class=postbodytop>, then you should be more specific about the div, e.g. div[#class="postbodytop"] instead of just div in that XPath expression.
I would like to show an example.
This how the page looks:
<a class="aclass">
<div class="divclass"></div>
<div id="innerclass">
<span class="spanclass">Hello</span>
</div>
</a>
<a class="aclass">
<div class="divclass"></div>
<div id="innerclass">
<span class="spanclass">Pick Delivery Location</span>
</div>
</a>
I want to select anchor tags that have a child (direct or non-direct) span that has the text 'Hello'.
Right now, I do something like this:
//a[#class='aclass'][div/span[text() = 'Hello']]
I want to be able to select without having to select direct children (div in this case), like this:
//a[#class='aclass'][//span[text() = 'Hello']]
However, the second one finds all the anchor tags with the class 'aclass' rather than the one with the span with 'Hello' text.
I hope I worded my question clearly. Please feel free to edit if necessary.
In your attempt, // goes back to the root of the document - effectively you are saying "Give me the as for which there is a span anywhere in the document", which is why you get them all.
What you need is the descendant axis :
//a[#class='aclass' and descendant::span[text() = 'Hello']]
Note I have joined the conditions with and, but two separate conditions would also work.
There are a number of labels, I want to specify them in xpath and then grab the text after them, example:
<div class="info-row">
<div class="info-label"><span>Variant:</span></div>
<div class="info-content">
<p>750 ml</p>
</div>
</div>
So in this case, I want to say "after the span named 'Variant' grab the p tag:
Result: 750ml
I tried:
//span[text()='Variant:']/following-sibling::p
and variations of this but to no avail.
'following-sibling' function selects all siblings after the current node,
there no siblings for span with text 'Variant:', and correct to search siblings for span parent.
Here is an example which will work
//span[text()='Variant:']/ancestor::div[#class="info-label"]/following-sibling::div/p
Im trying to select the content inside a div, this div has some text inside and some additional tags. I dont want to select the first div inside. I was trying with this selector, but only gives me the tags, without text
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/*[not(#class='subHeadline')]
the div that is giving me problems is this one:
<div class="viewHalfWidthSize">
.......
</div>
<div class="viewHalfWidthSize">
<div class="subHeadline firefinder-match">The Fine Print</div> <----------Except this div I want everything inside of this div!!
<strong class="firefinder-match">Validity: </strong>
Expires 27 June 2013.
<br class="firefinder-match">
<strong class="firefinder-match">Purchase: </strong>
Limit 1 per 2 people. May buy multiple as gifts.
<br class="firefinder-match">
<strong class="firefinder-match">Redemption: </strong>
Booking required online at
<a target="_blank" href="http://grouponbookings.co.uk/lautre-pied-march/" class="firefinder-match">http://grouponbookings.co.uk/lautre-pied-march/</a>
. 48-hour cancellation policy; late cancellation incurs a £30 surcharge per person.
<br class="firefinder-match">
<strong class="firefinder-match">Further information: </strong>
Valid Mon-Sun midday-2.45pm; Mon-Wed 6pm-10.45pm. Must be 18 or older, ID may be requested. Valid only on set tasting menu only; menu is dependent on market changes and seasonality and is subject to change. Max. two hours seating time. Discretionary service charge will be added to the bill based on original price. Original value verified 19 March 2013 at 9.01am.
<br class="firefinder-match">
<a target="_blank" href="http://www.groupon.co.uk/universal-fine-print" style="color: #339933;" class="firefinder-match">See the rules</a>
that apply to all deals.
</div>
The * matches element nodes and not text nodes. Try replacing * with node() to select all node types.
To break down what your XPath is doing:
You are looking anywhere in the document (//) for a div with class 'contentDealDescriptionFacts cf'.
Then you are looking for the 2nd div under that which also has the class viewHalfWidthSize. Note, this is not the 2nd div that has the class but the div that is 2nd AND has that class, so if the divs with that class are the 3rd and 4th it wouldn't match anything as the 2nd div with the class has position() = 4. If you want the 2nd viewHalfWidthSize div then you'll want [#class='viewHalfWidthSize'][position()=2].
Finally, you are returning a nodelist of all elements without the class subHeadline. If you change the * to node() then you will get a nodelist of all nodes.
The following XPath:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[not(name(.)='div' and position() = 1)]
should return what you want as long as the first child node is the div you want to ignore.
If you change it to:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[position() != count(../div[1]/preceding-sibling::node()) + 1]
then it should work regardless. It returns your nodelist, then works out how many preceding nodes there are before the first div, and checks the position isn't one greater than that (i.e. position of first div) and excludes that from the list.
As yet another alternative you could just modify your original solution but instead of doing not(#class='subHeadline') you should do
not(contains(concat(' ', #class, ' '), ' subHeadline '))
which will check if the class attribute contains subHeadline anywhere in the string on the assumption that your classes are space separated. This would then match your fragment which has the class "subHeadline firefinder-match"
The divs below appear in that order in the HTML I am parsing.
//div[contains(#class,'top-container')]//font/text()
I'm using the xpath expression above to try to get any data in the first div below in which a hyphen is used to delimit the data:
Wednesday - Chess at Higgins Stadium
Thursday - Cook-off
The problem is I am getting data from the second div below such as:
Monday 10:00 - 11:00
Tuesday 10:00 - 11:00
How do I only retrieve the data from the first div? (I also want to exclude any elements in the first div that do not contain this hyphenated data)?
<div class="top-container">
<div dir="ltr">
<div dir="ltr"><font face="Arial" color="#000000" size="2">Wednesday - Chess at Higgins Stadium</font></div>
<div dir="ltr"><font face="Arial" size="2">Thursday - Cook-off</font></div>
<div dir="ltr"><font face="Arial" size="2"></font> </div>
<div dir="ltr"> </div>
<div dir="ltr"><font face="Arial" color="#000000" size="2"></font> </div>
</div>
<div dir="ltr">
<div RE><font face="Arial">
<div dir="ltr">
<div RE><font face="Arial" size="2"><strong>Alex Dawkin </strong></font></div>
<div RE><font face="Arial" size="2">Monday 10:00 - 11:00 </font></div>
<div RE><font size="2">Tuesday 10:00 - 11:00 </font></div>
<div RE>
<div RE><font face="Arial" size="2"></font></div><font face="Arial" size="2"></font></div>
<div RE> </div>
<div RE> </div>
Your XPATH was matching on any font element that is a descendant of <div class="top-container">.
div[1] will address the first div child element of the "top-container" element. If you add that to your XPATH, it will return the desired results.
//div[contains(concat(' ',#class,' '),' top-container '))]/div[1]//font/text()
If you want to ensure that only text() nodes that contain "-" are addressed, then you should also add a predicate filter to the text().
//div[contains(concat(' ',#class,' '),' top-container '))]/div[1]//font/text()[contains(.,'-')]
Instead of checking only for nodes
that contain "-", how would you modify
the last expression to just check for
non-empty strings?
If you want to return any text() node with a value, then the predicate filter on text() is not necessary. If a text node doesn't have content, then it isn't a text node and won't be selected.
However, if you only want to select text() nodes that contain text other than whitespace, you could use this expression:
//div[contains(concat(' ',#class,' '),' top-container '))]/div[1]//font/text()[normalize-space()]
normalize-space() removes any leading and trailing whitespace characters. So, if the text() only contained whitespace(including ), the result would be nothing and evaluate to false() in the predicate filter, so only text() containing something other than whitespace will be selected.