How to scrape data using xpath contains? - xpath

How can i exclude element to be scraped using contains with OR my current xpath that i use is not working.
//div/li[contains(text(), 'Night') OR contains(text(), 'Big')

To complete #Sergii Dmytrenko's answer, use also a lowercase or operator.
//div/li[contains(text(), 'Night') or contains(text(), 'Big')]
The preceding XPath will output li elements containing the text "Night" or "Big" (case sensitive).
In order to exclude elements, you can use the not operator as previoulsy described.
Side note : using != (not equal) with and operator is also possible to exclude elements :
//div/li[text()!='Night' and text()!='Big']
This will exclude elements which strictly contain (no more text) "Night" or "Big".
EDIT : Assuming you have :
<div>
<h2>Night of the living dead</h2>
<h2>Big fish</h2>
<h2>Save the last dance</h2>
<h2>Tomorrow never die</h2>
<h2>Australia nuclear war</h2>
</div>
To select elements which don't contain "Night","Big", or "Australia", you have two options :
Using or operators inside a not condition :
//div/h2[not(contains(text(),'Night') or contains(text(),'Big') or contains(text(),'Australia'))]
Using multiple not with and operators :
//div/h2[not(contains(text(),'Night')) and not(contains(text(),'Big')) and not(contains(text(),'Australia'))]
Output : 2 nodes :
Save the last dance
Tomorrow never die

Your XPath expression (if corrected the typos: li[contains(text(), 'Night') or contains(text(), 'Big')]) will return li elements having the text "Night" or "Big".
to exclude these the correct expression should be
//div/li[not(contains(text(), 'Night') or contains(text(), 'Big'))]
or you may try
//div/li[not(contains(text(), 'Night')) and not(contains(text(), 'Big'))]

Your xpath should end with ']', currently it is invalid one.
If you would like to exclude 'Night' and 'Big' you may try this:
//div/li[not(contains(text(), 'Night') OR contains(text(), 'Big'))]

Related

xpath without specificy the tag? [duplicate]

Given this XML, what XPath returns all elements whose prop attribute contains Foo (the first three nodes):
<bla>
<a prop="Foo1"/>
<a prop="Foo2"/>
<a prop="3Foo"/>
<a prop="Bar"/>
</bla>
//a[contains(#prop,'Foo')]
Works if I use this XML to get results back.
<bla>
<a prop="Foo1">a</a>
<a prop="Foo2">b</a>
<a prop="3Foo">c</a>
<a prop="Bar">a</a>
</bla>
Edit:
Another thing to note is that while the XPath above will return the correct answer for that particular xml, if you want to guarantee you only get the "a" elements in element "bla", you should as others have mentioned also use
/bla/a[contains(#prop,'Foo')]
This will search you all "a" elements in your entire xml document, regardless of being nested in a "blah" element
//a[contains(#prop,'Foo')]
I added this for the sake of thoroughness and in the spirit of stackoverflow. :)
This XPath will give you all nodes that have attributes containing 'Foo' regardless of node name or attribute name:
//attribute::*[contains(., 'Foo')]/..
Of course, if you're more interested in the contents of the attribute themselves, and not necessarily their parent node, just drop the /..
//attribute::*[contains(., 'Foo')]
descendant-or-self::*[contains(#prop,'Foo')]
Or:
/bla/a[contains(#prop,'Foo')]
Or:
/bla/a[position() <= 3]
Dissected:
descendant-or-self::
The Axis - search through every node underneath and the node itself. It is often better to say this than //. I have encountered some implementations where // means anywhere (decendant or self of the root node). The other use the default axis.
* or /bla/a
The Tag - a wildcard match, and /bla/a is an absolute path.
[contains(#prop,'Foo')] or [position() <= 3]
The condition within [ ]. #prop is shorthand for attribute::prop, as attribute is another search axis. Alternatively you can select the first 3 by using the position() function.
Have you tried something like:
//a[contains(#prop, "Foo")]
I've never used the contains function before but suspect that it should work as advertised...
John C is the closest, but XPath is case sensitive, so the correct XPath would be:
/bla/a[contains(#prop, 'Foo')]
If you also need to match the content of the link itself, use text():
//a[contains(#href,"/some_link")][text()="Click here"]
/bla/a[contains(#prop, "foo")]
try this:
//a[contains(#prop,'foo')]
that should work for any "a" tags in the document
For the code above...
//*[contains(#prop,'foo')]

Select distinct values with Xpath

Im using this Xpath query
//li[contains(#class, 'cmil_header')]/span[contains(#class, 'cmil_theatre')] and the result of this query is:
Park
Saga Tokey
Latvia
Latvia
Skande
Paramount
Paramount
Paramount
Oslo
Oslo
...
I have been searching and i have come to conclusion that there is a option to select unique or distinct nodevalues/items with Xpath. But i can't get it to work.
I have managede to be able to select specific item with //li[contains(#class, 'cmil_header')][1]/span[contains(#class, 'cmil_theatre')] (Park in this case), and i thought //li[contains(#class, 'cmil_header')][distinct-values()]/span[contains(#class, 'cmil_theatre')] would work, but not.
My question:
How would my query be to reproduce:
Park
Saga Tokey
Latvia
Skande
Paramount
Oslo
...
Edit: pastabin with sample
http://pastebin.com/a3x7hRFu
XPath 1.0 solution (where there is no distinct-values function) that relies on the duplicates being sequential:
//li[contains(#class, 'cmil_header')]/span[contains(#class, 'cmil_theatre') and (not(../preceding-sibling::li[contains(#class, 'cmil_header')]) or ../preceding-sibling::li[contains(#class, 'cmil_header')][1]/span[contains(#class, 'cmil_theatre')]/text() != ./text())]
find all li nodes that contain the cmil_header class: //li[contains(#class, 'cmil_header')]
find the child span nodes that contain the cmil_theatre class: /span[contains(#class, 'cmil_theatre') and
where there is no previous li node containing the cmil_header class: (not(../preceding-sibling::li[contains(#class, 'cmil_header')])
or the previous li node containing the cmil_header class has a span node child that contains the cmil_theatre class: or ../preceding-sibling::li[contains(#class, 'cmil_header')][1]/span[contains(#class, 'cmil_theatre')]
and the text content of that span is not the same as the text content of... : /text() !=
...this span: ./text())]
i thought //li[contains(#class, 'cmil_header')][distinct-values()]/span[contains(#class, 'cmil_theatre')] would work, but not.
No, there is no way this could work. I find it hard to know what you were imagining. The most basic error is that distinct-values() expects an argument. More subtly, you really don't seem to have understood how predicates (expressions in square brackets) work.
What would work -- assuming your XPath processor supports XPath 2.0 -- is
distinct-values(//li[contains(#class, 'cmil_header')]/
span[contains(#class, 'cmil_theatre')])

Xpath expression returns null

I have the plenty of links like this:
<b>Edit issue >></b>
Trying to extract the href' content I use Xpath expression:
//a[contains(#href,'/edit_flat')]
but it returns me null. What am I doing wrong ?
//a[contains(#href,'/edit_flat')] selects a elements anywhere in the document tree that have an href attribute containing the '/edit_flat' string.
These matching elements do have this very "href" attribute, but the XPath expression you are using returns "only" the a elements, if there are any.
To actually return the matching elements' attribute's values, you need an extra step, with / and #href. So what you want is:
//a[contains(#href,'/edit_flat')]/#href
Suggestion:
What you really want is probably to select links which href begin with the substring "/edit_flat", so it's safer to use:
.//a[starts-with(#href,'/edit_flat')]/#href

XPATH selections containing some strings but not others

I have the following XPATH that selects elements containing certain strings ("video" or "color" or "black and white"). The issue I am having is that one of the elements that is selected contains a string "video reprints" and although it's correct, I do not want this particular element selected. I thought I could specify NOT in the XPATH as in the following...
//div/A[contains(., 'video') or contains(., 'color') or contains(., 'black and white') and (not (contains(., 'reprint')))]
Any thoughts on how I can remove any selection that contains the string "reprints" from the selections above?
This is a precedence issue. Just wrap all the or-ed conditions into parentheses:
[( ... or ... or ...) and (not(...))]
Really it's just because of the way you have your parentheses, so this will work:
//div/A[(contains(., 'video') or contains(., 'color') or contains(., 'black and white')) and not(contains(., 'reprints'))]

use YQL with substring-before in xpath

I am trying to get a string before '--' within a paragraph in an html page using the xpath and send it to yql
for example i want to get the date from the following article:
<div>
<p>Date --- the body of the article</p>
</div>
I tried this query in yql:
select * from html where url="article url" and xpath="//div/p/text()/[substring-before(.,'--')]"
but it does not work.
how can I get the date of the article which is before the '--'
You can simply use:
substring-before(//div/p,'--')
Use:
substring-before(/div/p/text(), '--')
This XPath expression evaluates to the string immediately preceding '--' in the first text node in the XML document, that is a child of a p that is a child of the div top element.
In case you want to get this value for every such text node, you have to use an expression like:
substring-before((//div/p/text())[$k], '--')
and evaluate this expression $N times, for $k = 1,2, ..., $N
where $N is count(//div/p/text())
Do note: Try to avoid using the // XPath pseudo-operator always when the structure of the XML document is statically known. Using // usually results in big inefficiency (O(N^2)) that are felt especially painful on big XML documents.

Resources