Possible to run two completely different x-path - xpath

Can anyone please help me here ?
I want to run two xpath together and store the value, I am not sure if it is possible.
My one xpath is fetching City and second is state
//div[(text()='city')]/following-sibling::div
//div[contains(text(),'state')]/following-sibling::div
As xpath is telling name of city and state is provided in next div of city and state. I want to run both and capture output in string format.
On side note: both xpath is working fine for me.
<div>
<div>City</div>
<div>London</div>
</div>
<--In between some other elements like p, section other divs-->
<div>
<div>state</div>
<div>England</div>
</div>

It sounds like you want to convert the results of the two XPath expressions to strings, and concatenate those strings. The expression below concatenates them (with a single space between) using the XPath concat function.
concat(
//div[(text()='city')]/following-sibling::div,
' ',
//div[contains(text(),'state')]/following-sibling::div
)
One other thing: note that in your example XML the text of the first div is "City" rather than "city". Make sure the strings in your XPath expression match the text exactly because the expression 'City'='city' evaluates to false

Related

Xpath to strip text using substring-after

I have the following which is the second span in html with the class of 'ProductListOurRef':
<span class="ProductListOurRef">Product Code: 60076</span>
Ive tried the following Xpath:
(//span[#class="ProductListOurRef"])[2]
But that returns 'Product Code: 60076'. But I need to use Xpath to strip the 'Product Code: ' to just give me the result of '60076'.
I believe 'substring-after' should do it but i dont know how to write it
If you are using XPath 1.0, then the result of an XPath expression must be either a node-set, a single string, a single number, or a single boolean.
As shown in comments on the question, you can write a query using substring-after(), whose result is a string.
However, some applications expect the result of an XPath expression always to be a node-set, and it looks as if you are stuck with such an application. Because you can't construct new nodes in XPath (you can only select nodes that are already present in the input), there is no way around this.

How are these two XPath expressions different?

I'm parsing a website using XPath. I've got two queries, one which finds the node I'm looking for:
//td[.//text()[contains(., "Date Filed:")]]
And one which doesn't:
//td[contains(.//text(), "Date Filed:")]
I don't understand how these are different. I'd read them both to mean, "Find td nodes which have a descendant text node containing Date Filed."
Can anybody explain how these are different?
Here's the HTML, though I don't think it's relevant to the question:
<td width="40%" valign="top">
<br><br><br><br><br>
<b>Date Filed:</b> 11/13/2008<br>
<b>Jury Demand: </b> No<br><br>
<br><b>Date Terminated: </b><br>
<br><b>Date Reopened: </b><br>
<br><b>Does this action raise an issue of constitutionality?: </b>Y<br>
</td>
(Don't look at me that way. I didn't make the website, the U.S. Gov't did.)
That is how string conversion works in XPath:
In the second query contains(.//text(), "Date Filed:") you call contains function. It accepts two arguments of type string, you first parameter .//text() is node-set datatype, which means string function gets called internally to convert list of nodes to string. In this case string(.//text()) returns only first text node. If you change your second query to this: //td[contains(., "Date Filed:")] it will select the wanted td.
In XPath 1.0, if you supply a node-set to a function like contains() that expects a string, it uses the string-value of the first node in the node-set (in document order).
In XPath 2.0 and later versions, if you supply a node-set to a function like contains() that expects a string, the node-set is atomized, and if the result contains more than one string (which will normally be the case when more than one node is selected), then you get a type error XPTY0004.
When you ask questions about XSLT or XPath on StackOverflow, please always say which version you are talking about.

How do I use an AND statement in XPATH?

I have this query //*[#id="test"]/div/[not(contains(.,'/explore'))]
I want to add a second 'not contains' command to this:
//*[#id="test"]/div/[not(contains(.,'/locations'))]
And maybe even a 3rd one. Does anyone know how to do this?
None of what you posted is a valid XPath expression. If you meant to filter the div element so that only div that doesn't contain certain string, say "/explore", is returned, you can do this way instead :
//*[#id="test"]/div[not(contains(.,'/explore'))]
and another XPath example that check if the div doesn't contain any of 2 strings, "/explore" and "/locations" :
//*[#id="test"]/div[not(contains(.,'/explore')) and not(contains(.,'/locations'))]

Xpath: why normalize-space could not remove the empty space and \n?

For the following code:
<a class="title" href="the link">
Low price
<strong>computer</strong>
you should not miss
</a>
I used this xpath code to scrapy:
response.xpath('.//a[#class="title"]//text()[normalize-space()]').extract()
I got the following result:
u'\n \n Low price ', u'computer', u' you should not miss'
Why two \n and many empty spaces before low price was not removed by normalize-space() for this example?
Another question: how to combine the 3 parts as one scraped item as u'Low price computer you should not miss'?
Please try this:
'normalize-space(.//a[#class="title"])'
I already had the same problem, try this:
[item.strip() for item in response.xpath('.//a[#class="title"]//text()').extract()]
Your call to normalize-space() is in a predicate. That means you are selecting text nodes where (the effective boolean value of) normalize-space() is true. You aren't selecting the result of normalize-space: for that you would want
.//a[#class="title"]//text()/normalize-space()
(which needs XPath 2.0)
The second part of your question: just use
string(.//a[#class="title"])
(assuming scrapy-spider allows you to use an XPath expression that returns a string, rather than one that returns nodes).

Is it safe to concatenate two XPath 1.0 queries?

If I have two XPath queries where the second one is meant to further drill down the result of the first, can I safely let my script combine them into a single query by...
placing parenthesis around the first query,
prefixing the second query with with a slash, and then
simply concatenating the two strings ?
Context
The concrete usecase that sparked this question involves extracting information from XML/XHTML documents according to externally supplied pairs of "CSS selector + attribute name" using XPath behind the scenes.
For example the script may get the following as input:
selector: a#home, a.chapter
attribute: href
It then compiles the selector to an XPath query using the HTML::Selector::XPath Perl module, and the attribute by simply prefixing a # ... which in this case would yield:
XPath query 1: //a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')]
XPath query 2: #href
And then it repeatedly passes those queries to libxml2's XPath engine to extract the requested information (in this example, a list of URLs) from the XML documents in question.
It works, but I would prefer to combine the two queries into a single one, which would simplify the code for invoking them and reduce the performance overhead:
XPath query: (//a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')])/#href
(note the added parenthesis and slash)
But is this safe to do programmatically, for arbitrary input queries?
In general, no, you can't concatenate two arbitrary XPath expressions in this way, especially not in XPath 1.0. It's easy to find counter-examples: in XPath 1.0 you can't even have a union expression on the RHS of '/', so concatenating "/a" and "(b|c)" would fail.
In XPath 2.0, the result will always be syntactically valid, but in may contain type errors, e.g. if the expressions are "count(a)" and "b". The LHS operand of "/" must evaluate to a sequence of nodes.
Sure, this should work. However, you will always have to respect the correct context. If the elements in your example in the first query have no href attribute, you will get an empty result set.
Also, you will have to take care of e.g. a leading slash in front of your second query, so that you don't end up with a descendant-or-self axis step, which might not be what you want. Apart from that, this should always work - The worst that can happen that it is not logical correct (i.e. you don't get the expected result), but it should always be valid XPath.

Resources