Xpath extraction between two nodes - xpath

I have several web pages that have the following structure
<div id="w_33086" class = "eight columns">
<h2 id="about">About<span itemprop="name">Name of Place</span></h2>
<p style="margin-top: 0px;">
.
.
.
<p class="contactAdvisor"><a href="http://www.example.com/contact">
The text and markup between the two paragraphs is quite variable as it is created through individual users. In some cases it includes markup and in some cases it does not. When it does include markup, the mark up can be quite variable.
I'm trying to select all of the text and markup between these two <p> but have not been successful.
The best result I've achieved comes from //div[id='w_33086']/node()
However, that is dropping the <p> tags when those are present. It also picks up the <h2> tag and the <p class="contactAdvisor"> that I would rather exclude.
I'm using Google Sheets (and or Screaming Frog) to apply the xpath

If the two <p> elements are siblings of each other, then you could try
//div[id='w_33086']/p[#style="margin-top: 0px;"]/
following-sibling::node()[following-sibling::p[#class="contactAdvisor"]]
This is assuming that there is only one <p> under <div id="w_33086"> that has the style or class attribute value (respectively) that we're using to identify them.
Please note that XPath does not select any "markup". It selects nodes such as text nodes, elements (which have descendants attached), and attributes. Those nodes can be serialized as markup, but that's not XPath's business.

Related

Detect first non-empty element

After reading the most relevant Xpath questions about detecting empty nodes, I still can not find the first non-empty element. The dataset looks like:
<div>
<p>
<elem> </elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem>   </elem>
</p>
<p>
<elem>Application</elem>
</p>
<p>
<elem>Other text that should not be detected.</elem>
</p>
<p>
<elem> </elem>
</p>
<p>
<elem>Second application</elem>
</p>
</div>
Basically the empty elements should not be taken into account, and we only want to detect the first Application element. We've been testing a lot with normalize-space, and related functions but can not get this working.
The main problem are the empty elements. The check we have right now solves the positioning flawlessly, but fails once the html contains elements:
/div/p[position() < 3]//*[normalize-space()='Application']
So, how can we ignore empty elements? This only is possible via an additional step in between?
In my definition an empty element does not have any child nodes so //*[not(node()] would select all empty elements by that definition. If you want to allow certain text content then you could check normalize-space after removing them e.g. //*[not(*) and not(normalize-space(translate(., ' ', '')))]. Basically you need to list all characters as the second argument of the translate call that you want to remove before checking with normalize-space. And the XPath expression I have written would work inside XSLT where the numeric character reference is parsed by an XML parser, in general it depends on the host language you use XPath with how to escape characters.

XPath expression -hierarchy

<div class="summary-item">
<label >Price</label>
<div class="value">
0.99 GBP
</div>
</div>
<div class="summary-item">
<label >Other info</label>
<div class="value">
All languages
</div>
</div>
I am trying to get the "0.99 GBP" using an XPath expression, so far I have reached the label using this (note there is another class by the name summary-item, therefore I need to uniquely identify with the label name Price)
sel.xpath('//*/div[#class="summary-item"]/label[text()="Price"]').extract()
However, I am unable to get to the class, I tried using following-sibling, but I did not succeed, any help will be appreciated.
The existence of child nodes can be part of the predicate. Put the test for label into a predicate for the parent, either as a separate predicate (adding the target node as well):
//div[#class="summary-item"][label[text()="Price"]]/div[#class="value"]
or joined with and:
//div[#class="summary-item" and label[text()="Price"]]/div[#class="value"]
(Note you don’t need //*/div at the start.)
You could use following-sibling if you wanted, it would look like this:
//div[#class="summary-item"]/label[text()="Price"]/following-sibling::div[#class="value"]
(here the label div isn’t part of the predicate).
One more thing to be aware of, using XPath to select HTML classes doesn’t work the same as using CSS – XPath will only match the exact string whereas CSS matches even if the element is in more than one class. In this case it works out okay but you should watch out for it. Search StackOverflow if it will be an issue, there are a few answers descibing it.

Microdata markup with properties on multiple pages

I'm creating a web page and currently I'm adding Microdata markup to the code. I’m using schema.org’s MusicGroup.
I have an index.html page from where I'd like to take the name and the image properties for this band:
<div class="container" itemscope itemtype="http://schema.org/MusicGroup">
...
<img itemprop="image" src="img/logo.png" alt="logo" />
<p>We are <span itemprop="name">NAME OF THE BAND</span>.</p>
...
</div>
However on the about_us.html page there is a short description which I'd also like to use:
<div class="container" itemscope itemtype="http://schema.org/MusicGroup">
...
<p itemprop="description">A description of the band.</p>
...
</div>
When I use the code like this, search enginges (understandably) treat them as two different MusicGroups:
MusicGroup 1:
Image: .../img/logo.png
Name: NAME OF THE BAND
MusicGroup 2:
Description: A description of the band.
How can I link these properties into one item?
Microdata’s name-value pairs are per webpage, not per website.
So on a website about a music group, it can be expected that each page contains an "own" MusicGroup item, which is, however, actually always about the same music group. But from the Microdata or schema.org perspective, these different items would not be semantically connected that way (consumers might guess this however, e.g. by comparing property values).
Microdata’s itemid attribute could be used to uniquely identify each item. But it is required that the used vocabulary supports "global identifiers for items" (itemid is used for some types on schema.org (e.g., in the example for MedicalScholarlyArticle), but it’s not clear to me if it’s really supported as required by Microdata for other types, like MusicGroup).
So in your case, you could:
leave it as it is
duplicate the information, so that each item has all relevant content (possibly using meta/link elements)
move all information on one page (possibly using itemref)
(if it should be allowed for general use with schema.org) use itemid to state that several items are actually about the same thing

xpath accessing information in nodes

i need to scrap information form a website contain the property details.
<div class="inner">
<div class="col">
<h2>House in Digana </h2>
<div class="meta">
<div class="date"></div>
<span class="category">Houses</span>,
<span class="location">Kandy</span>
</div>
</div>
<div class="attr polar">
<span class="data">Rs. 3,600,000</span>
</div>
what is the xpath notation for "Kandy" and "Rs. 3,600,000" ?
It is not wise to address text nodes directly using text() because of nuances in an XML document.
Rather, addressing an element node directly returns the concatenation of all descendant text nodes as the element value, which is what people usually want (and think they are getting when they address text nodes).
The canonical example I use in the classroom is this example of OCR'ed content as XML:
<cost>39<!--that 9 may be an 8-->.22</cost>
The value of the element using the XPath address cost is "39.22", but in XSLT 1.0 the value of the XPath address cost/text() is "39" which is not complete. In XSLT 2.0 (which is how the question is tagged), you get two text nodes "39" and ".22", which if you concatenate them it looks correct. But, if you pass them to a function requiring a singleton argument, you will get a run-time error. When you address an element, the text returned is concatenated into a single string, which is suitable for a singleton argument.
I tell students that in all of my professional work there are only very (very!) few times that I ever have to use text() in my stylesheets.
So //span[#class='location' or #class='data'] would find the two fields if those were the only such elements in the entire document. You may need to use ".//span" from a location inside of the document tree.

Google Spreadsheet importxml timestamp

I been trying for over 2 hours to import timestamp from zap2it.com link to my google spreasheet.
Here is link I am trying to importxml from.
http://affiliate.zap2it.com/tvlistings/ZCGrid.do?zipcode=78238&lineupId=DISH641:-
Here is what I am tryign to import
Here is what I tried so far
=importxml("http://affiliate.zap2it.com/tvlistings/ZCGrid.do?aid=dish&pkg=8388608&fromProvider=true&zipcode=78238&x=52&y=18"&B1,"//body//div[3]/div/div/div[3]/div/div")
EDIT
I was able to improve and get better results
//body//div[3]/div/div/div[1]//*
but it shows timestamp from all over the page. not exactly what I need.
[The first complication is that the data stream returned from dereferencing that URI is not actually XML; it has several thousand well-formedness errors (unescaped ampersands in URIs, unescaped ampersands and less-than signs in scripts, some embedded HTML, some miscellaneous errors). Since you're not reporting problems from that, however, I'll assume that somewhere between the server and your XPath expression someone is doing some tidying.]
I think you'll get better results if you use the id and class attributes that are extensively used in the document. The material you want looks like this in the source (you can use any browser-based debugging tool to find it; I used the 'Web Inspector' in Safari); I have indented to make the structure more visible, and fixed some well-formedness errors in one of the a elements (missing whitespace between attribute-value pairs).
<div class="zc-tn" id="zc-tn-top">
<div class="zc-tn-i">
<a href="ZCGrid.do?fromTimeInMillis=1355781600000"
class="zc-tn-l"
title="Move the grid three hours earlier"></a>
<div class="zc-tn-c">
<span class="zc-tn-z"
title="Central Standard Time">CST</span>
<div class="zc-tn-t">7:00 PM</div>
<div class="zc-tn-t">7:30 PM</div>
<div class="zc-tn-t">8:00 PM</div>
<div class="zc-tn-t">8:30 PM</div>
<div class="zc-tn-t">9:00 PM</div>
<div class="zc-tn-t">9:30 PM</div>
</div>
<a href="ZCGrid.do?fromTimeInMillis=1355803200000"
class="zc-tn-r"
title="Advance the grid three hours"></a>
</div>
</div>
A simple search verifies that the value zc-tn-top is indeed unique as an ID value in the document. Given that, a simple XPath expression to retrieve all the elements whose display is circled in your image is (assuming xhtml is bound to the XHTML namespace):
//xhtml:div[#id='zc-tn-top']//xhtml:div[#class='zc-tn-t']
It looks from your question as if your XPath evaluator is namespace-challenged or namespace-oblivious, so you may need to write this as
//div[#id='zc-tn-top']//div[#class='zc-tn-t']

Resources