i'm new at scraping with Scrapy and unfortunatly, i can't access data through a request (to simulate an AJAX request made).
I read others topics, but it didnt help me resolve my issues.
The website i would like to crawl is auchan.fr ,it has a dynamic search box driven by algolia (algolia).
Here is my spider for a Nutella request(POST then):
class AjaxspiderSpider(scrapy.Spider):
name = "ajaxspider"
allowed_domains = ["auchandirect.fr/recherche"]
#start_urls = ['https://www.auchandirect.fr/recherche/']
def start_requests(self):
full_url = "/1/indexes/articles_article_11228/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.20.4&x-algolia-application-id=TN96V7LRXC&x-algolia-api-key=46a121512cba9c452df318ffca231225"
yield FormRequest('https://tn96v7lrxc-dsn.algolia.net' + full_url, callback=self.parse, formdata={"params":"query=nutella&facets=%5B%22loopr_shelf%22%5D&hitsPerPage=50"})
def parse(self, response):
with open('data_content', 'w') as file:
and Here is the log i got :
2017-02-03 15:14:35 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://tn96v7lrxc-dsn.algolia.net/1/indexes/articles_article_11228/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.20.4&x-algolia-application-id=TN96V7LRXC&x-algolia-api-key=46a121512cba9c452df318ffca231225> (referer: None)
2017-02-03 15:14:35 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://tn96v7lrxc-dsn.algolia.net/1/indexes/articles_article_11228/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.20.4&x-algolia-application-id=TN96V7LRXC&x-algolia-api-key=46a121512cba9c452df318ffca231225>: HTTP status code is not handled or not allowed
2017-02-03 15:14:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-03 15:14:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 545,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 338,
'downloader/response_count': 1,
'downloader/response_status_count/400': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 3, 14, 14, 35, 216807),
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 2, 3, 14, 14, 34, 977436)}
2017-02-03 15:14:35 [scrapy.core.engine] INFO: Spider closed (finished)
i thank you for any piece for informations
This is not Ajax-related but site-specific question, you just passes search parameters string wrong way trying to pass it as formdata while it should be passed as raw body of POST request, so it should be like that:
yield Request('https://tn96v7lrxc-dsn.algolia.net' + full_url,
callback=self.parse, method='POST',
Just upgraded StreamSets from to using Cloudera Manager (5.8.2). I can't login anymore into StreamSets - I get "login failed". The new version seem to be using a different LDAP lookup method.
My logs BEFORE Update looks as below:
Mar 15, 10:42:07.799 AM INFO com.streamsets.datacollector.http.LdapLoginModule
Searching for users with filter: '(&(objectClass={0})({1}={2}))' from base dn: DC=myComp,DC=Statistics,DC=ComQ,DC=uk
Mar 15, 10:42:07.826 AM INFO com.streamsets.datacollector.http.LdapLoginModule
Found user?: true
Mar 15, 10:42:07.826 AM INFO com.streamsets.datacollector.http.LdapLoginModule
Attempting authentication: CN=UserDV,OU=London,OU=ComQ,DC=ComQ,DC=Statistics,DC=comQ,DC=uk
My logs AFTER Update looks as below:
Mar 15, 11:10:21.406 AM INFO com.streamsets.datacollector.http.LdapLoginModule
Accessing LDAP Server: ldaps://comQ.statisticsxxx.com:3269 startTLS: false
Mar 15, 11:10:22.086 AM INFO org.ldaptive.auth.SearchDnResolver
search for user=[org.ldaptive.auth.User#1573608120::identifier= userdv, context=null] failed using filter=[org.ldaptive.SearchFilter#1129802876::filter=(&(objectClass=user)(uid={user})), parameters={context=null, user=userdv}]
Mar 15, 11:10:22.087 AM INFO com.streamsets.datacollector.http.LdapLoginModule
Found user?: false
Mar 15, 11:10:22.087 AM ERROR com.streamsets.datacollector.http.LdapLoginModule
Result code: null - DN cannot be null
You should change ldap.userFilter in Cloudera Manager from uid={user} to name={user}
I'm trying to create new Datasource in OracleBi Visual Analyzer? but dialog displays Timeout error message.
This is log message from log:
<Feb 18, 2017, 9:38:55,614 PM EET> <Error> <oracle.bi.datasource> <BEA-000000> <Failed to write to output stream
java.util.concurrent.TimeoutException null
<Feb 18, 2017, 9:38:55,639 PM EET> <Error> <oracle.bi.datasource.trace> <BEA-000000> <TIMEOUT_ERROR Request timed out: Request Headers
Accept=text/html, image/gif, image/jpeg, */*; q=.2
Cause - The request did not complete within the allotted time.
Action - Check the log for details and increase the connection pool timeout or fix the root cause.
<Feb 18, 2017, 9:38:55,642 PM EET> <Error> <oracle.bi.web.datasetsvc> <BEA-000000> <getObjectList failed for 'weblogic'.'HiveDS'>
I did telnet to oracle-VirtualBox 9508 and service on that port is responding ( I't Cluster Controller there ) so i'm lost why it's :
1) connecting there anyways ( i suppose it's should try to connect to HIVE straight )
2) failing to do it's stuff
Does anyone has same experience ?
I'm running the CoreNLP dedicated server on AWS and trying to make a request from ruby. The server seems to be receiving the request correctly but the issue is the server seems to ignore the input annotators list and always default to all annotators. My Ruby code to make the request looks like so:
uri = URI.parse(URI.encode('http://ec2-************.compute.amazonaws.com//?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}'))
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Post.new("/v1.1/auth")
request.add_field('Content-Type', 'application/json')
request.body = text
response = http.request(request)
json = JSON.parse(response.body)
In the nohup.out logs on the server I see the following:
[/] API call w/annotators tokenize,ssplit,pos,depparse,lemma,ner,mention,coref,natlog,openie
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.0 sec].
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
PreComputed 100000, Elapsed Time: 2.259 (s)
Initializing dependency parser done [5.1 sec].
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [7.2 sec].
[pool-1-thread-1] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Read 83 rules
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Read 267 rules
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Read 25 rules
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator mention
Using mention detector type: dependency
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
etc etc.
When I run test queries using wget on the command line it seems to work fine.
wget --post-data 'the quick brown fox jumped over the lazy dog' 'ec2-*******.compute.amazonaws.com/?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}' -O -
Any help as to why this is happening would be appreicated thanks!
It turns out the request was being constructed incorrectly. The path should be in the argument to the Post.new. Corrected code below in case it helps anyone:
host = "http://ec2-***********.us-west-2.compute.amazonaws.com"
path = '/?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}'
encoded_path = URI.encode(path)
uri = URI.parse(URI.encode(host))
http = Net::HTTP.new(uri.host, uri.port)
# request = Net::HTTP::Post.new("/v1.1/auth")
request = Net::HTTP::Post.new(encoded_path)
request.add_field('Content-Type', 'application/json')
request.body = text
response = http.request(request)
json = JSON.parse(response.body)
I'm using scrapy to crawl a website,but I dont konw how to parse and find word.
The following is the website,I want to find "hello I'm here".
This is my xpath code:
Html part:
<div class="sort hottest_dishes1">
<ul class="sort_title">
<li class="current">按默认排序</li>
<li class="">按人气排序</li>
<ol class="sort_content">
<li class="show">
<div class="sort_yi">
<div class="sort_left">
<p class="li_title">
<strong class="span_left ">
hello I'm here<span class="restaurant_list_hot"></span>
<span> (川菜) </span>
<span class="span_d_right3" title="馋嘴牛蛙特价只要9.9元,每单限点1份">馋嘴牛蛙特价9块9</span>
<p class="consume">
<p class="sign">
<span>水煮鲶鱼 馋嘴牛蛙 酸梅汤 钵钵鸡 香辣土豆丝 毛血旺 香口猪手 ……</span>
<div class="sort_right">
<div class="sort_all" >
I use response.css in shell is right ,but in scrapy,it returns nothing,am I write the code wrong?
The following is my code:
def parse_torrent(self, response):
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = response.xpath("//div[#class='sort_left']/p/strong/a[1]").extract()[1]
torrent['description'] = response.xpath("//div[#id='list_content']/div/div/ol/li/div/div/p/strong[1]/following-sibling::span[1]").extract()
torrent['size'] = response.xpath("//div[#id='list_content']/div/div/ol/li/div/div/p/span[1]").extract()
return torrent
I personally find css selectors much easier than using xpath for locating content. For the response object that you get on crawling the given document, why don't you try response.css('p[class="li_title"] a::text')[0].extract().
(I tested it and it works in scrapy shell. The output: u"hello I'm here")
This can be an example of what you need to do:
def parse_torrent(self, response):
print response.xpath('//div[#class="sort_left"]/p/strong/a/text()').extract()[0]
2014-12-19 10:58:29+0100 [linkedin] DEBUG: Crawled (200) <GET file:///C:/1.html> (referer: None)
hello I'm here
2014-12-19 10:58:29+0100 [linkedin] INFO: Closing spider (finished)
2014-12-19 10:58:29+0100 [linkedin] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 232,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1599,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 19, 9, 58, 29, 241000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 12, 19, 9, 58, 29, 213000)}
2014-12-19 10:58:29+0100 [linkedin] INFO: Spider closed (finished)
you can see that hello I'm here appeared.
You are referring to
you need to add text() to your xpath and as your a has a span inside, you need to get the element [0] and not [1]. So then you need to change it to
I can't see a <div> in your HTML excerpt which has an attribute with value 'list_content' – so the [#id='list_content'] predicate filters-out everything, whatever the rest of your XPath expression is. The result of the expression evaluation is an empty sequence.
After the question edit:
There is no <href> element in your HTML, so the .../a/href subexpression selects nothing.
href is an attribute of <a> – use .../a/#href instead to proces the href attribute contents.
However if you still want to find the 'hello I'm here' text, then you need to reach the <a> element contents – use .../a/text().
I'm trying to build a spider to catch images. I've got the spider working, it just.. doesn't work and doesn't error out.
from urlparse import urljoin
from scrapy.selector import XmlXPathSelector
from scrapy.spider import BaseSpider
from nasa.items import NasaItem
class NasaImagesSpider(BaseSpider):
name = "nasa.gov"
start_urls = ('http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml',)
def parse(self, response):
xxs = XmlXPathSelector(response)
item = NasaItem()
baseLink = xxs.select('//link/text()').extract()[0]
imageLink = xxs.select('//tn/text()').extract()
imgList = []
for img in imageLink:
imgList.append(urljoin(baseLink, img))
item['image_urls'] = imgList
return item
It runs through the page, and it captures the urls correctly. I pass it down the pipeline, but.. no pics.
The settings file:
BOT_NAME = 'nasa.gov'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGE_STORE = '/home/usr1/Scrapy/spiders/nasa/images'
SPIDER_MODULES = ['nasa.spiders']
NEWSPIDER_MODULE = 'nasa.spiders'
and the items file:
from scrapy.item import Item, Field
class NasaItem(Item):
image_urls = Field()
images = Field()
and the output log:
2012-11-12 07:47:29-0500 [nasa.gov] DEBUG: Crawled (200) <GET http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml> (referer: None)
2012-11-12 07:47:29-0500 [nasa.gov] DEBUG: Scraped from <200 http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml>
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Closing spider (finished)
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2526,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 11, 12, 12, 47, 29, 802477),
'item_scraped_count': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 11, 12, 12, 47, 29, 682005)}
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Spider closed (finished)
2012-11-12 07:47:29-0500 [scrapy] INFO: Dumping global stats:
{'memusage/max': 104132608, 'memusage/startup': 104132608}
I'm stuck. Any suggestions as to what I'm doing wrong?
[EDITED] Added output log, changed settings bot name.
#pipeline file
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class PaulsmithPipeline(ImagesPipeline):
def process_item(self, item, spider):
return item
def get_media_requests(self,item,info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self,results,item,info):
image_paths=[x['path'] for ok,x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item