Why does xpath inside a selector loop still return a list in the tutorial - xpath

I am learning scrapy with the tutorial: http://doc.scrapy.org/en/1.0/intro/tutorial.html
When I run the following example script in the tutorial. I found that even though it was already looping through the selector list, the tile I got from sel.xpath('a/text()').extract() was still a list, which contained one string. Like [u'Python 3 Object Oriented Programming'] rather than u'Python 3 Object Oriented Programming'. In a later example the list is assigned to item as item['title'] = sel.xpath('a/text()').extract(), which I think is not logically correct.
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/#href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
However if I use the following code:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
the link is a string rather than a list.
Is this a bug or intended?

.xpath().extract() and .css().extract() return a list because .xpath() and .css() return SelectorList objects.
See https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract
(SelectorList) .extract():
Call the .extract() method for each element is this list and return their results flattened, as a list of unicode strings.
.extract_first() is what you are looking for (which is poorly documented)
Taken from http://doc.scrapy.org/en/latest/topics/selectors.html :
If you want to extract only first matched element, you can call the selector .extract_first()
>>> response.xpath('//div[#id="images"]/a/text()').extract_first()
u'Name: My image 1 '
In your other example:
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
each href in the loop will be a Selector object. Calling .extract() on it will get you a single Unicode string back:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/"
2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
In [1]: response.css("ul.directory.dir-col > li > a::attr('href')")
Out[1]:
[<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>,
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>,
...
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>]
so .css() on the response returns a SelectorList:
In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')"))
Out[2]: scrapy.selector.unified.SelectorList
Looping on that object gives you Selector instances:
In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
...: print href
...:
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>
(...)
<Selector xpath=u"descendant-or-self::ul[#class and contains(concat(' ', normalize-space(#class), ' '), ' directory ') and (#class and contains(concat(' ', normalize-space(#class), ' '), ' dir-col '))]/li/a/#href" data=u'/Computers/Programming/Languages/Python/'>
And calling .extract() gives you a single Unicode string:
In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
print type(href.extract())
...:
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
Note: .extract() on Selector is wrongly documented as returning a list of strings. I'll open an issue on parsel (which is the same as Scrapy selectors, and used under the hood in scrapy 1.1+)

Related

How Empty folder "Delete All" in outlook plugin

Is there any way to manage folder to delete all items there? i want to clear trash folder "deleteditems" when i used EWS i got an error :
The requested web method is unavailable to this caller or application.
Is there any workaround to make it works in web plugin ?
full code here :
var xml =
'<?xml version="1.0" encoding="utf-8"?>' +
'<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" \n' +
' xmlns:m="http://schemas.microsoft.com/exchange/services/2006/messages" \n' +
' xmlns:t="http://schemas.microsoft.com/exchange/services/2006/types" \n' +
' xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">\n' +
' <soap:Header>' +
' <t:RequestServerVersion Version="Exchange2010_SP1"/>' +
' </soap:Header>' +
' <soap:Body>' +
' <m:EmptyFolder DeleteType="HardDelete" DeleteSubFolders="true">' +
' <m:FolderIds>' +
' <t:DistinguishedFolderId Id="deleteditems" />' +
' </m:FolderIds>' +
' </m:EmptyFolder>' +
' </soap:Body>' +
'</soap:Envelope>';
Office.context.mailbox.makeEwsRequestAsync(xml, function (result) {
console.log(result);
});
For Office.context.mailbox.makeEwsRequestAsync, we only support a small subset of EWS operations, all documented here. EmptyFolder is not one of the supported operations; that's why you are getting the error. An alternative approach is to use Microsoft Graph.

How can I call multiple values from the element with the same name with xpath?

I want the last two <dd> elements so that the output will read, Talstrasse 2A 01816 Berggiesshübel
here is the html snippet
<dt>Öffnungszeiten:</dt>
<dd>10:00 - 17:00</dd> <dt>Veranstaltungsart:</dt>
<dd> Herbstmarkt</dd> <dt>Veranstaltungsort:</dt>
<dd> Besucherbergwerk "Marie Louise Stolln" Berggiesshübel</dd> <dt>Strasse:</dt>
<dd>Talstrasse 2A,</dd>
<dt>PLZ / Ort:</dt>
<dd> 01816 Berggiesshübel </dd>
here is the suggested xpath my software gives me.
//div[contains(concat (" ", normalize-space(#class), " "), " container ")]/section[contains(concat (" ", normalize-space(#class), " "), " row event-details ")]/div[1]/div[3][contains(concat (" ", normalize-space(#class), " "), " bg-normal pal mbm ")]/div[1][contains(concat (" ", normalize-space(#class), " "), " row ")]/div[1]/dl[1][contains(concat (" ", normalize-space(#class), " "), " dl-horizontal event-detail-dl ")]/dd[6] | //html/body/div/section/div[1]/div[3]/div[1]/div[1]/dl[1]/dd[6]
can anyone help me?
The last two dd elements would be (//dd)[position() >= last() - 1]. In XPath 2.0 and higher you can get a single string using the string-join function e.g. string-join((//dd)[position() >= last() - 1], ' ').

I need help about Title in sweetalert

I have an error (or nothing shows) on put a Text with quotes when I refer to inches ( " or ´´)
swal({
text: ' Product X 10" '
})
or
swal({
text: ' Product X 10' '
})
What might be the cause of this?

loading value from element using xpath

I'm kinda new to xpath. So I don't know how to do this. I have this file
<map>
<object name="object (1)">
<position>564.014893 -7424.033691 35.448875</position>
<rotation>0.000000 0.000000 0.000000</rotation>
<model>3494</model>
</object>
</map>
As you can see in position;
564.014893 -7424.033691 35.448875
564.014893 is X
-7424.033691 is Y
35.448875 is Z
How do I load X(or Y/Z)?
If you have xpath 2 support, you can use tokenize to split the string on spaces. The X, Y and Z values would respectively be:
tokenize(map/object/position, ' ')[1]
tokenize(map/object/position, ' ')[2]
tokenize(map/object/position, ' ')[3]
If you only have xpath 1 support, you could use the substring-before and substring-after methods. The X, Y, and Z values would respectively be:
substring-before(map/object/position, " ")
substring-before(substring-after(map/object/position, " "), " ")
substring-after(substring-after(map/object/position, " "), " ")

How to filter certain words from selected text using XPath?

To select the text here:
Alpha Bravo Charlie Delta Echo Foxtrot
from this HTML structure:
<div id="entry-2" class="item-asset asset hentry">
<div class="asset-header">
<h2 class="asset-name entry-title">
<a rel="bookmark" href="http://blahblah.com/politics-democrat">Pelosi Q&A</a>
</h2>
</div>
<div class="asset-content entry-content">
<div class="asset-body">
<p>Alpha Bravo Charlie Delta Echo Foxtrot</p>
</div>
</div>
</div>
I apply following XPath expression to select the text inside asset-body:
//div[contains(
div/h2[
contains(concat(' ',#class,' '),' asset-name ')
and
contains(concat(' ',#class,' '),' entry-title ')
]/a[#rel='bookmark']/#href
,'democrat')
]/div/div[
contains(concat(' ',#class,' '),' asset-body ')
]//text()
How would I sanitize the following words from the text:
Alpha
Charlie
Echo
So that I end up with only the following text in this example:
Bravo Delta
With XPath 1.0 supposing uniques NMTokens:
concat(substring-before(concat(' ',$Node,' '),' Alpha '),
substring-after(concat(' ',$Node,' '),' Alpha '))
As you can see, this becomes very verbose (and bad performance).
With XPath 2.0:
string-join(tokenize($Node,' ')[not(.=('Alpha','Charlie','Echo'))],' ')
How would I sanitize the following words from the text:
Alpha
Charlie
Echo
So that I end up with only the following text in this example:
Bravo Delta
This can't be done in XPath 1.0 alone -- you'll need to get the text in the host language and do the replacement there.
In XPath 2.0 one can use the replace() function:
replace(replace(replace($vText, ' Alpha ', ''), ' Charlie ', ''), ' Echo ')

Resources