XPATH Exclude a class or start scraping after it - xpath

<p class="region-list"><a class="parent xh-highlight" href="/mediterranean-yacht-charters-1548.htm" title="Mediterranean Yachts for Charter - Summer">Mediterranean</a>Croatia, Italy, Montenegro</p>
I have this query:
//div[#class='hide-for-small']/div/div/div/div[1]/div/div/div/p[#class='region-list']/a
Which returns:
Mediterranean
Croatia
Italy
Montenegro
However, I want to exclude the parent which is "Mediterranean" so I want to either say:
Skip the first <a> and grab the rest OR 2) Exclude the a <a class="parent">
I have been wrestling with this #class!="parent"] but can't seem to get this to work.

Actually, you can do both:
Skip the first <a> node:
//p[#class='region-list']/a[position()>1]/text()
or skip the <a> node with the specific class attribute value:
//p[#class='region-list']/a[not(#class='parent xh-highlight')]/text()

Related

XPATH - grab content of div after named element

There are a number of labels, I want to specify them in xpath and then grab the text after them, example:
<div class="info-row">
<div class="info-label"><span>Variant:</span></div>
<div class="info-content">
<p>750 ml</p>
</div>
</div>
So in this case, I want to say "after the span named 'Variant' grab the p tag:
Result: 750ml
I tried:
//span[text()='Variant:']/following-sibling::p
and variations of this but to no avail.
'following-sibling' function selects all siblings after the current node,
there no siblings for span with text 'Variant:', and correct to search siblings for span parent.
Here is an example which will work
//span[text()='Variant:']/ancestor::div[#class="info-label"]/following-sibling::div/p

XPath Exclude Text From Child Element

I'm looking to get the output:
50ml milk
From the following code:
<ul class="ingredients-list__group">
<li>50ml <a href="/glossary/milk" class="tooltip-processed">milk
<div class="tooltip">
<h2
class="node-title">Milk</h2> <span class="fonetic">mill-k</span>
<p>One of the most widely used ingredients, milk is often referred to as a complete food. While cow…</p>
</div>
</a>
</li>
</ul>
Currently I'm using the XPATH:
//ul[#class="ingredients-list__group"]/li
But getting:
50ml milk Milk mill-kOne of the most widely used ingredients, milk is often referred to as a complete food. While cow…
How do I exclude the stuff within the div/tooltip?
With xpath 2.0:
//ul[#class="ingredients-list__group"]/li/concat(./text()[1], ./a/text()[1])
With xpath 1.0:
concat(//ul[#class="ingredients-list__group"]/li/text()[1], //ul[#class="ingredients-list__group"]/li/a/text()[1])'
You can select the relevant text nodes using
//ul[#class="ingredients-list__group"]//
text()[not(ancestor::div[#class='tooltip'])]
If you're in XPath 2.0 you can then put this in a call of string-join() to join these into a single string. If you're stuck with 1.0, you'll have to return multiple text nodes to the calling application and concatenate them together in the host language code.

Xpath - matching based on node() contains() content

I have the following HTML structure (there are many blocks using the same architecture):
<span id="mySpan">
<i>
Price
<b>
3 900
<small>€</small>
</b>
</i>
</span>
Now, I want to get the content of <b> using Xpath which I tried like so:
//span[#id="mySpan"]/i/node()[1][contains(text(),"Price")]
which does match anything. How can I match this using the node()[1] text as anchor?
Regarding the Xpath you tried, instead of text() which return text node child, simply use . :
//span[#id="mySpan"]/i/node()[1][contains(.,"Price")]
For the ultimate goal, I'd suggest this XPath :
//span[#id="mySpan"]/i[contains(.,"Price")]/b
or if you want specifically to match against the first node within <i> :
//span[#id="mySpan"]/i[contains(node(),"Price")]/b

xpath getting the name in a certain pattern

I want to get a class name like the following:
class="hostHostGrid0_body"
The integer in between hostHostGrid and _body can change, but everything else I want it just like that in the order.
How can I achieve this?
In XPath 1.0 you can use this:
//*[starts-with(#class,'hostHostGrid') and substring-after(#class,'_') = 'body']
to select any element containing one class. It will match tags in any context. It will match all three elements below:
<div class="hostHostGrid0_body">
<span class="hostHostGrid123_body"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
Limitations: it doesn't restrict what is between them to a number. It can be anything, including spaces (ex: it will also match this: class="hostHostGrid xyz abc_body")
This one allows for the class occurring among other classes:
//*[contains(substring-before(#class,'_body'),'hostHostGrid')]
It will match:
<div class="other-class hostHostGrid0_body">
<span class="hostHostGrid123_body other-class"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
(it also has the same limitations - will match anything between 'hostHostGrid' and '_body')

xpath trying to select content inside a div except one, with text included

Im trying to select the content inside a div, this div has some text inside and some additional tags. I dont want to select the first div inside. I was trying with this selector, but only gives me the tags, without text
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/*[not(#class='subHeadline')]
the div that is giving me problems is this one:
<div class="viewHalfWidthSize">
.......
</div>
<div class="viewHalfWidthSize">
<div class="subHeadline firefinder-match">The Fine Print</div> <----------Except this div I want everything inside of this div!!
<strong class="firefinder-match">Validity: </strong>
Expires 27 June 2013.
<br class="firefinder-match">
<strong class="firefinder-match">Purchase: </strong>
Limit 1 per 2 people. May buy multiple as gifts.
<br class="firefinder-match">
<strong class="firefinder-match">Redemption: </strong>
Booking required online at
<a target="_blank" href="http://grouponbookings.co.uk/lautre-pied-march/" class="firefinder-match">http://grouponbookings.co.uk/lautre-pied-march/</a>
. 48-hour cancellation policy; late cancellation incurs a £30 surcharge per person.
<br class="firefinder-match">
<strong class="firefinder-match">Further information: </strong>
Valid Mon-Sun midday-2.45pm; Mon-Wed 6pm-10.45pm. Must be 18 or older, ID may be requested. Valid only on set tasting menu only; menu is dependent on market changes and seasonality and is subject to change. Max. two hours seating time. Discretionary service charge will be added to the bill based on original price. Original value verified 19 March 2013 at 9.01am.
<br class="firefinder-match">
<a target="_blank" href="http://www.groupon.co.uk/universal-fine-print" style="color: #339933;" class="firefinder-match">See the rules</a>
that apply to all deals.
</div>
The * matches element nodes and not text nodes. Try replacing * with node() to select all node types.
To break down what your XPath is doing:
You are looking anywhere in the document (//) for a div with class 'contentDealDescriptionFacts cf'.
Then you are looking for the 2nd div under that which also has the class viewHalfWidthSize. Note, this is not the 2nd div that has the class but the div that is 2nd AND has that class, so if the divs with that class are the 3rd and 4th it wouldn't match anything as the 2nd div with the class has position() = 4. If you want the 2nd viewHalfWidthSize div then you'll want [#class='viewHalfWidthSize'][position()=2].
Finally, you are returning a nodelist of all elements without the class subHeadline. If you change the * to node() then you will get a nodelist of all nodes.
The following XPath:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[not(name(.)='div' and position() = 1)]
should return what you want as long as the first child node is the div you want to ignore.
If you change it to:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[position() != count(../div[1]/preceding-sibling::node()) + 1]
then it should work regardless. It returns your nodelist, then works out how many preceding nodes there are before the first div, and checks the position isn't one greater than that (i.e. position of first div) and excludes that from the list.
As yet another alternative you could just modify your original solution but instead of doing not(#class='subHeadline') you should do
not(contains(concat(' ', #class, ' '), ' subHeadline '))
which will check if the class attribute contains subHeadline anywhere in the string on the assumption that your classes are space separated. This would then match your fragment which has the class "subHeadline firefinder-match"

Resources