XPath Need help getting part of href value preceding specific characters - xpath

I am trying to get the following information extracted out of a link using XPath, for example:
I have LINK TEXT HERE
I would like to select the href value of the link but only anything following ag_num=
So I would end up with 470 for the link above. Any ideas are truly appreciated, thank you!!

You can use below XPath expression to get required value:
substring-after(//a/#href, "ag_num=")

Related

How to get element value in xpath

I am trying to get an XML element value out of a string but is doesn't work yet.
Normally i only extract the xpath text values but now i want to get the EAN code of the following string:
<div data-retailrocket-markup-block="60c8602d97a528373cb51d77" data-product-id="8710429017146"></div>
Is this possible? And if so how? Thanks a lot!

Xpath for the Text under the br tag

I have been trying to find the text after the br tag, but so far not been able to do that. Any help with this ?
//div[#class='dl-result-item']/*[2]
The above is returning the correct Name, but not able to figure out the Address.
Try below expression to get required content:
//li/div[#class='dl-result-item']/*[not(name()=("a", "div"))]/text()[string-length()>0]
Try:
//div[#class='dl-result-item']/*[2]/following-sibling::text()[1]

Getting "XPath error: Invalid predicate", while trying to use an Xpath that contains greek letters

From this webpage: page I am trying to build a crawler that is going to extract "Μακεδονία > Ν. Ημαθίας > Δ. Δοβρά" from the "Περιοχή:" field.
--> See a screenshot of the item:
See Link no.2 below
In order to do this, I am intended to use XPath to focus on "Περιοχή:" and then use the following-sibling keyword to access and extract the text "Μακεδονία > Ν. Ημαθίας > Δ. Δοβρά", because the td that contains it can be in a different location in other webpages (but always after the tr with the text "Περιοχή:") or even missing.
See Link no.3 below
In scrapy shell I am testing the following:
x = response.xpath(u"//th[#text()=u'Περιοχή:']/text()").extract()
expecting to get x = [u"Περιοχή:"]
but instead I am getting an error:
ValueError: XPath error: Invalid predicate in //th[#text()=u'\u03a0\u03b5\u03c1\u03b9\u03bf\u03c7\u03ae:']/text()
What am I doing wrong?
Thanks in advance.
You are specifying unicode encoding twice, you shouldn't specify that in xpath since it's already an unicode string.
i.e.
# this:
u"//th[#text()=u'Περιοχή:']/text()"
# should be this:
u"//th[text()='Περιοχή:']/text()"
Notice no u before the text and you don't need # before text() either because it's an xpath function not a node attribute.

Remove or replace some text from XPath string

Is it possible to remove or replace text on XPath string?
Using XPath I get url with http://www and I want to remove http://www, so the same XPath query would return me only a link without http://www. I can't find anything about removing or replacing Xpath string.
Is it possible?
If so, how to do this?
Have you tried substring-after?
substring-after('http://www.stackoverflow.com', 'http://www.')
Example:
<demo>http://www.stackoverflow.com</demo>
XPath:
//demo/substring-after(., 'http://www.')
Yields:
stackoverflow.com
Check online demo here.

XPath expression?

I want to extract "Date: 2009-09-25, 1:54PM EDT" from this webpage
http://auburn.craigslist.org/sha/1392067187.html
But I don't understand how to write Xpath expressions for that.
Can anyone help me in that.
I am getting other fields also from this page.
Why don't you just run a regexp like the one below?
'Date:\s+([0-9]{4}-[0-9]{2}-[0-9]{2}.+?\<)'
It seams to be the easiest way. And if you don't want to use pure text you can use XPath 2.0 which has support for regexps (fn:matches).
Are you running the HTML through TIDY or some other process to turn it into XHTML? Or how are you able to execute XPATH against that HTML?
If the document was well-formed, then you could probably use the following XPATH:
/html/body/hr[1]/following-sibling::text()[1]
It finds the first HR element in the document, then selects the first text() node following it(which contains the string "Date: 2009-09-25, 1:54PM EDT"

Resources