Could you help me to undestand why the below Xpath expression does not work in Scrapy?
I get "Invalid Xpath"
//*[#id="Text_Body"//..[not(contains(#type,"text/css"))]/text()
The above Xpath expression is working in FireBug.
I made a mistake in my XPath expression because I forgot a ]: I should have written
//*[#id="Text_Body"]//..[not(contains(#type,"text/css"))]/text()
I was able to make it working also in Scrapy with this reformulation:
//*[#id="Text_Body"]//text()[not(ancestor::style)]
Related
I tested these two XPath expressions:
//*[not(#class='android.view.ViewGroup[2]')]
and
//*[not(contains(#class,'android.view.ViewGroup[2]'))]
But that syntax doesn't work. I want to filter unavailable days with not or another operator. But I can't. Thanks for help
Try the following:
//*[not(#class='android.view.ViewGroup')][2]
or, an easier version:
//*[#class!='android.view.ViewGroup'][2]
Both select the second matching element with the given condition.
I am trying get the first and second <td> in all the <tr> from this table with XPath. But something I am doing wrong. Return [INVALID XPATH EXPRESSION]
//table[#id='thetable']/tbody/tr/concat(td[1],'-',td[2])
Try:
string-join(//table[#id='thetable']/tbody/tr/td[1 to 2]/string(), "-")
Using concat() on the right hand side of "/" requires an XPath 2.0 engine. The error message suggests you are trying to run this using an XPath 1.0 engine. The string-join version also needs XPath 2.0.
In fact any expression that returns a sequence of strings is going to need XPath 2.0 because the XPath 1.0 type system doesn't have any such data type.
If you want an XPath 2.0 implementation that runs in the browser you could try Saxon-JS. (In fact that will give you XPath 3.1).
I am trying to get the following information extracted out of a link using XPath, for example:
I have LINK TEXT HERE
I would like to select the href value of the link but only anything following ag_num=
So I would end up with 470 for the link above. Any ideas are truly appreciated, thank you!!
You can use below XPath expression to get required value:
substring-after(//a/#href, "ag_num=")
I want to write xpath to check node contain '#'
<node1>
<node11>Some text</node11>
<node11>#2o11 PickMe</node12>
</node1>
I want to write xpath like "//node11[contains(,'#\d+')]". Whats correct way to check #
The correct XPath expression is:
//node11[contains(., '#')]
In your XML, the closing tag of the second subnote should be </node11> instead of </node12>.
If you are using xpath 2.0 you should be able to use something like:
"//node11[matches(.,'#\d+')]"
However, if you aren't using 2.0 you won't have regex support directly. If you are using 1.0 then you won't be able to match using \d+. But this will work:
"//node11[contains(.,'#')]"
Or even:
"//node11[starts-with(.,'#')]"
Use:
/*/node11[contains(., '#')]
Note: It is recommended to avoid using the // pseudo-operator because this most often leads to very slow evaluation of the XPath expression.
I want to extract "Date: 2009-09-25, 1:54PM EDT" from this webpage
http://auburn.craigslist.org/sha/1392067187.html
But I don't understand how to write Xpath expressions for that.
Can anyone help me in that.
I am getting other fields also from this page.
Why don't you just run a regexp like the one below?
'Date:\s+([0-9]{4}-[0-9]{2}-[0-9]{2}.+?\<)'
It seams to be the easiest way. And if you don't want to use pure text you can use XPath 2.0 which has support for regexps (fn:matches).
Are you running the HTML through TIDY or some other process to turn it into XHTML? Or how are you able to execute XPATH against that HTML?
If the document was well-formed, then you could probably use the following XPATH:
/html/body/hr[1]/following-sibling::text()[1]
It finds the first HR element in the document, then selects the first text() node following it(which contains the string "Date: 2009-09-25, 1:54PM EDT"