How to filter certain words from selected text using XPath? - xpath

To select the text here:
Alpha Bravo Charlie Delta Echo Foxtrot
from this HTML structure:
<div id="entry-2" class="item-asset asset hentry">
<div class="asset-header">
<h2 class="asset-name entry-title">
<a rel="bookmark" href="http://blahblah.com/politics-democrat">Pelosi Q&A</a>
</h2>
</div>
<div class="asset-content entry-content">
<div class="asset-body">
<p>Alpha Bravo Charlie Delta Echo Foxtrot</p>
</div>
</div>
</div>
I apply following XPath expression to select the text inside asset-body:
//div[contains(
div/h2[
contains(concat(' ',#class,' '),' asset-name ')
and
contains(concat(' ',#class,' '),' entry-title ')
]/a[#rel='bookmark']/#href
,'democrat')
]/div/div[
contains(concat(' ',#class,' '),' asset-body ')
]//text()
How would I sanitize the following words from the text:
Alpha
Charlie
Echo
So that I end up with only the following text in this example:
Bravo Delta

With XPath 1.0 supposing uniques NMTokens:
concat(substring-before(concat(' ',$Node,' '),' Alpha '),
substring-after(concat(' ',$Node,' '),' Alpha '))
As you can see, this becomes very verbose (and bad performance).
With XPath 2.0:
string-join(tokenize($Node,' ')[not(.=('Alpha','Charlie','Echo'))],' ')

How would I sanitize the following words from the text:
Alpha
Charlie
Echo
So that I end up with only the following text in this example:
Bravo Delta
This can't be done in XPath 1.0 alone -- you'll need to get the text in the host language and do the replacement there.
In XPath 2.0 one can use the replace() function:
replace(replace(replace($vText, ' Alpha ', ''), ' Charlie ', ''), ' Echo ')

Related

Xpath text between tags

Any idea how i would get the text between 2 tags using Xpath code? specifically the 3, bd, 1, ba.
<p class="MuiTypography-root RoofCard__RoofCardNameStyled-niegej-8 hukPZu MuiTypography-body1" xpath="1">
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md">$65,000</span></p>
**"3" == $0
" bd, " == $0
"1" == $0
" ba | " == $0**
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md" xpath="1">926</span>
tried:
In fact from your sample that's a simple text() node after p:
//p/following-sibling::text()[1]
but of course you'll need to parse it. This will return almost that you need:
values = response.xpath('//p/following-sibling::text()[1]').re(r'"([^"]+)"')

replacing html tag and its content using ruby gsub

I am trying to replace a <p>..</p> tag content in html content with empty string by doing the following.
string = \n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "
When I did
string.gsub!(/<p.*?>|<\/p>/, '')
It just replaced the <p> and </p> with empty string but the content remained. How can I remove both the tag and its content ?
Apparently, your regex does not match <p>...</p> (<p> and its content). Try this:
string.gsub!(/<p>.*<\/p>/, '')
test = '\n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "'
test.gsub(/<p>.*<\/p>/, '')
Return
"\\n <img alt=\\\"testing artice breaking news\\\" src=\\\"something.com\" />\\n \\n \""
Also, please consider #Tom Lord's comment, you can use Nokogiri to manipulate HTML.
First of all, consider using HTML parsers when parsing HTML, see How do I remove a node with Nokogiri?.
If you want to do it with a regex, you can use
string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
See the Rubular regex demo. This will work with tags that cannot be nested. Details:
<p(?:\s[^>]*)?> - <p, and an optional sequence of a whitespace and zero or more chars other than > (as many as possible), and then >
.*? - due to /m, any zero or more chars as few as possible
<\/p> - </p> string.
If the tags can be nested, you still can use a regex:
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/
p string.gsub(rx, '')
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"
See the Rubular regex demo. Details:
<#{tagname} - < and tag name
(?:\s[^>]*)?> - an optional sequence of whitespace and then zero or more chars other than <
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)* - zero or more occurrences of
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)* - zero or more chars other than < and then zero or more sequences of < that is not followed with tag name + > or whitespace or / + tag name + > followed with zero or more chars other than < chars
|
\g<0> - the whole regex pattern recursed
<\/#{tagname}> - </ + tag name + >.
See a Ruby demo:
string = "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n"
p string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/m
p string.gsub(rx, '')```
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"

Why does xpath inside a selector loop still return a list in the tutorial

I am learning scrapy with the tutorial: http://doc.scrapy.org/en/1.0/intro/tutorial.html
When I run the following example script in the tutorial. I found that even though it was already looping through the selector list, the tile I got from sel.xpath('a/text()').extract() was still a list, which contained one string. Like [u'Python 3 Object Oriented Programming'] rather than u'Python 3 Object Oriented Programming'. In a later example the list is assigned to item as item['title'] = sel.xpath('a/text()').extract(), which I think is not logically correct.
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/#href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
However if I use the following code:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
the link is a string rather than a list.
Is this a bug or intended?
.xpath().extract() and .css().extract() return a list because .xpath() and .css() return SelectorList objects.
See https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract
(SelectorList) .extract():
Call the .extract() method for each element is this list and return their results flattened, as a list of unicode strings.
.extract_first() is what you are looking for (which is poorly documented)
Taken from http://doc.scrapy.org/en/latest/topics/selectors.html :
If you want to extract only first matched element, you can call the selector .extract_first()
>>> response.xpath('//div[#id="images"]/a/text()').extract_first()
u'Name: My image 1 '
In your other example:
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
each href in the loop will be a Selector object. Calling .extract() on it will get you a single Unicode string back:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/"
2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
In [1]: response.css("ul.directory.dir-col > li > a::attr('href')")
Out[1]:
[<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>,
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>,
...
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>]
so .css() on the response returns a SelectorList:
In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')"))
Out[2]: scrapy.selector.unified.SelectorList
Looping on that object gives you Selector instances:
In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
...: print href
...:
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>
(...)
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>
And calling .extract() gives you a single Unicode string:
In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
print type(href.extract())
...:
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
Note: .extract() on Selector is wrongly documented as returning a list of strings. I'll open an issue on parsel (which is the same as Scrapy selectors, and used under the hood in scrapy 1.1+)

Filter/Exclude xPath extraction via "pattern"

This is what I have to work with:
<div class="Pictures zoom">
<a title="Productname 1" class="zoomThumbActive" rel="{gallery: 'gallery1', smallimage: '/images/2.24198/little_one.jpeg', largeimage: '/images/76.24561/big-one-picture.jpeg'}" href="javascript:void(0)" style="border-width:inherit;">
<img title="Productname 1" src="/images/24.245/mini-doge-picture.jpeg" alt="" /></a>
<a title="Productname 1" rel="{gallery: 'gallery1', smallimage: '/images/2.24203/small_one.jpeg', largeimage: '/images/9.5664/very-big-one-picture.jpeg'}" href="javascript:void(0)" style="border-width:inherit;">
<img title="Productname 1" src="/images/22.999/this-picture-is-very-small.jpeg" alt="" /></a>
<div>
Using following Xpath:
/html//div[#class='Pictures zoom']/a/#rel
The output becomes:
{gallery: 'gallery1', smallimage: '/images/2.24198/little_one.jpeg', largeimage: '/images/76.24561/big-one-picture.jpeg'}
{gallery: 'gallery1', smallimage: '/images/2.24203/small_one.jpeg', largeimage: '/images/9.5664/very-big-one-picture.jpeg'}
Is it possible to filter the extraction, so intread of above, I only get these:
/images/76.24561/big-one-picture.jpeg
/images/9.5664/very-big-one-picture.jpeg
I only wish to keep everything between largeimage: ' and '}
Best regards,
Liu Kang
Use substring-before and substring-after to cut of the parts you do not want.
Using XPath 1.0, this can only be done for single results (so you cannot fetch all URLs contained in one document with a single XPath call). This query will return the first URL:
substring-before(substring-after((//#rel)[1], "largeimage: '"), "'")
XPath 2.0 allows you to run functions as axis steps. This query will return all URLs you're looking for as single tokens:
//#rel/substring-before(substring-after(., "largeimage: '"), "'")

How do I parse multiple strings from HTML using Nokogiri?

I need to parse this HTML code with Nokogiri, but save "Piso en Calle Antonio Pascual" in one variable and "Peñiscola" in another variable.
<h1 class="title g13_24">
Piso en Calle Antonio Pascual
<span class="title-extra-info">Peñíscola</span>
</h1>
require 'nokogiri'
doc = Nokogiri::HTML.parse(<<-HTML)
<h1 class="title g13_24">
Piso en Calle Antonio Pascual
<span class="title-extra-info">Peñíscola</span>
</h1>
HTML
h1 = doc.at_css('h1.title')
str1 = h1.children[0].text.strip
# => "Piso en Calle Antonio Pascual"
str2 = h1.at_css('.title-extra-info').text.strip
# => "Peñíscola"
But frankly, the Nokogiri documentation would have told you the same.

Resources