I'm using scrapy to crawl a website,but I dont konw how to parse and find word.
The following is the website,I want to find "hello I'm here".
This is my xpath code:
//div[#class='sort_left']/p/strong/a/href/text()
Html part:
<div class="sort hottest_dishes1">
<ul class="sort_title">
<li class="current">按默认排序</li>
<li class="">按人气排序</li>
</ul>
<ol class="sort_content">
<li class="show">
<div class="sort_yi">
<div class="sort_left">
<p class="li_title">
<strong class="span_left ">
hello I'm here<span class="restaurant_list_hot"></span>
<span> (川菜) </span>
</strong>
<span class="span_d_right3" title="馋嘴牛蛙特价只要9.9元,每单限点1份">馋嘴牛蛙特价9块9</span>
</p>
<p class="consume">
<strong>人均消费:</strong>
<b><span>¥70</span>元</b>
看网友点评
</p>
<p class="sign">
<strong>招牌菜:</strong>
<span>水煮鲶鱼 馋嘴牛蛙 酸梅汤 钵钵鸡 香辣土豆丝 毛血旺 香口猪手 ……</span>
</p>
</div>
<div class="sort_right">
看菜谱
</div>
<div class="sort_all" >
<strong>送达时间:</strong><span>60分钟</span>
</div>
</div>
I use response.css in shell is right ,but in scrapy,it returns nothing,am I write the code wrong?
The following is my code:
def parse_torrent(self, response):
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = response.xpath("//div[#class='sort_left']/p/strong/a[1]").extract()[1]
torrent['description'] = response.xpath("//div[#id='list_content']/div/div/ol/li/div/div/p/strong[1]/following-sibling::span[1]").extract()
torrent['size'] = response.xpath("//div[#id='list_content']/div/div/ol/li/div/div/p/span[1]").extract()
return torrent
strong text
I personally find css selectors much easier than using xpath for locating content. For the response object that you get on crawling the given document, why don't you try response.css('p[class="li_title"] a::text')[0].extract().
(I tested it and it works in scrapy shell. The output: u"hello I'm here")
This can be an example of what you need to do:
def parse_torrent(self, response):
print response.xpath('//div[#class="sort_left"]/p/strong/a/text()').extract()[0]
output:
2014-12-19 10:58:28+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: skema_crawler)
2014-12-19 10:58:28+0100 [scrapy] INFO: Optional features available: ssl, http11
2014-12-19 10:58:28+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'skema_crawler.spiders', 'SPIDER_MODULES': ['skema_crawler.spiders'], 'BOT_NAME': 'skema_crawler'}
2014-12-19 10:58:28+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-19 10:58:29+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-19 10:58:29+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-19 10:58:29+0100 [scrapy] INFO: Enabled item pipelines:
2014-12-19 10:58:29+0100 [linkedin] INFO: Spider opened
2014-12-19 10:58:29+0100 [linkedin] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-19 10:58:29+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-19 10:58:29+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-19 10:58:29+0100 [linkedin] DEBUG: Crawled (200) <GET file:///C:/1.html> (referer: None)
hello I'm here
2014-12-19 10:58:29+0100 [linkedin] INFO: Closing spider (finished)
2014-12-19 10:58:29+0100 [linkedin] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 232,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1599,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 19, 9, 58, 29, 241000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 12, 19, 9, 58, 29, 213000)}
2014-12-19 10:58:29+0100 [linkedin] INFO: Spider closed (finished)
you can see that hello I'm here appeared.
You are referring to
response.xpath("//div[#class='sort_left']/p/strong/a[1]").extract()[1]
you need to add text() to your xpath and as your a has a span inside, you need to get the element [0] and not [1]. So then you need to change it to
response.xpath("//div[#class='sort_left']/p/strong/a/text()").extract()[0]
I can't see a <div> in your HTML excerpt which has an attribute with value 'list_content' – so the [#id='list_content'] predicate filters-out everything, whatever the rest of your XPath expression is. The result of the expression evaluation is an empty sequence.
After the question edit:
There is no <href> element in your HTML, so the .../a/href subexpression selects nothing.
href is an attribute of <a> – use .../a/#href instead to proces the href attribute contents.
However if you still want to find the 'hello I'm here' text, then you need to reach the <a> element contents – use .../a/text().
Related
I've tried to initiate my first instance of the substrate node using the Create Your First Substrate Chain tutorial.
On running the command
./target/release/node-template --dev --tmp
We got a panic
main WARN sc_cli::commands::run_cmd Running in --dev mode, RPC CORS has been disabled.
2020-09-01 11:32:26.633 main INFO sc_cli::runner Substrate Node
2020-09-01 11:32:26.633 main INFO sc_cli::runner ✌️ version 2.0.0-rc6-c9fda53-x86_64-macos
2020-09-01 11:32:26.633 main INFO sc_cli::runner ❤️ by Substrate DevHub <https://github.com/substrate-developer-hub>, 2017-2020
2020-09-01 11:32:26.633 main INFO sc_cli::runner 📋 Chain specification: Development
2020-09-01 11:32:26.633 main INFO sc_cli::runner 🏷 Node name: yummy-increase-5727
2020-09-01 11:32:26.633 main INFO sc_cli::runner 👤 Role: AUTHORITY
2020-09-01 11:32:26.633 main INFO sc_cli::runner 💾 Database: RocksDb at /var/folders/k4/8vkq36gd4dv2npf7pzfpt9mm0000gn/T/substrateuBAgDv/chains/dev/db
2020-09-01 11:32:26.633 main INFO sc_cli::runner ⛓ Native runtime: node-template-1 (node-template-1.tx1.au1)
2020-09-01 11:32:26.699 main INFO sc_service::client::client 🔨 Initializing Genesis block/state (state: 0xa2b5…3bab, header-hash: 0x0bea…49e8)
2020-09-01 11:32:26.700 main INFO afg 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.
2020-09-01 11:32:26.725 main INFO sc_consensus_slots ⏱ Loaded block-time = 6000 milliseconds from genesis on first-launch
2020-09-01 11:32:26.726 main WARN sc_service::builder Using default protocol ID "sup" because none is configured in the chain specs
2020-09-01 11:32:26.726 main INFO sub-libp2p 🏷 Local node identity is: 12D3KooWMZTpWokAWBBuKTuv3EuUpf4f8PnctCgsCCs46tzMZ1ZN (legacy representation: QmSVUxS4iwXroNRbqs9zNGDsJhskXsc66c7CapZWVjjyht)
====================
Version: 2.0.0-rc6-c9fda53-x86_64-macos
0: backtrace::backtrace::trace
1: backtrace::capture::Backtrace::new
2: sp_panic_handler::set::{{closure}}
3: std::panicking::rust_panic_with_hook
4: _rust_begin_unwind
5: core::panicking::panic_fmt
6: core::option::expect_none_failed
7: hyper_rustls::connector::HttpsConnector<hyper::client::connect::http::HttpConnector>::new
8: sc_offchain::api::http::SharedClient::new
9: sc_offchain::OffchainWorkers<Client,Storage,Block>::new
10: sc_service::builder::build_offchain_workers
11: node_template::service::new_full
12: sc_cli::runner::Runner<C>::run_node_until_exit
13: node_template::command::run
14: node_template::main
15: std::rt::lang_start::{{closure}}
16: std::rt::lang_start_internal
17: _main
Thread 'main' panicked at 'cannot access native cert store: Custom { kind: Other, error: Error { code: -25262, message: "The Trust Settings Record was corrupted." } }', /Users/rmp/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-rustls-0.21.0/src/connector.rs:46
This is a bug. Please report it at:
support.anonymous.an
Any ideas? It's a pretty basic tutorial and not a lot to get wrong.
Running on Mac 10.14.6, Node 12.18.3, Yarn 1.22.5
Edit:
Looking back at the compile I did notice that I'd already had rust / rustup installed and the compile through a warning that it ignored. Possibly related?
Additional logging:
RUST_LOG=debug RUST_BACKTRACE=1 ./target/release/node-template -lruntime=debug --dev --tmp
2020-09-02 06:00:57.058 main WARN sc_cli::commands::run_cmd Running in --dev mode, RPC CORS has been disabled.
2020-09-02 06:00:57.058 main INFO sc_cli::runner Substrate Node
2020-09-02 06:00:57.058 main INFO sc_cli::runner ✌️ version 2.0.0-rc6-c9fda53-x86_64-macos
2020-09-02 06:00:57.058 main INFO sc_cli::runner ❤️ by Substrate DevHub <https://github.com/substrate-developer-hub>, 2017-2020
2020-09-02 06:00:57.058 main INFO sc_cli::runner 📋 Chain specification: Development
2020-09-02 06:00:57.058 main INFO sc_cli::runner 🏷 Node name: educated-tub-7928
2020-09-02 06:00:57.058 main INFO sc_cli::runner 👤 Role: AUTHORITY
2020-09-02 06:00:57.058 main INFO sc_cli::runner 💾 Database: RocksDb at /var/folders/k4/8vkq36gd4dv2npf7pzfpt9mm0000gn/T/substratelHH2Ba/chains/dev/db
2020-09-02 06:00:57.058 main INFO sc_cli::runner ⛓ Native runtime: node-template-1 (node-template-1.tx1.au1)
2020-09-02 06:00:57.132 main INFO sc_service::client::client 🔨 Initializing Genesis block/state (state: 0xa2b5…3bab, header-hash: 0x0bea…49e8)
2020-09-02 06:00:57.132 main DEBUG db DB Commit 0x0beaa5a0e87b3bd60a9e16630bcd9c27544a4d9f7b8bfb7e39d6f432eac049e8 (0), best = true
2020-09-02 06:00:57.134 main INFO afg 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.
2020-09-02 06:00:57.154 main DEBUG wasm-runtime Prepared new runtime version Some(RuntimeVersion { spec_name: RuntimeString::Owned("node-template"), impl_name: RuntimeString::Owned("node-template"), authoring_version: 1, spec_version: 1, impl_version: 1, apis: [([223, 106, 203, 104, 153, 7, 96, 155], 3), ([55, 227, 151, 252, 124, 145, 245, 228], 1), ([64, 254, 58, 212, 1, 248, 149, 154], 4), ([210, 188, 152, 151, 238, 208, 143, 21], 2), ([247, 139, 39, 139, 229, 63, 69, 76], 2), ([221, 113, 141, 92, 197, 50, 98, 212], 1), ([171, 60, 5, 114, 41, 31, 235, 139], 1), ([237, 153, 197, 172, 178, 94, 237, 245], 2), ([188, 157, 137, 144, 79, 91, 146, 63], 1), ([55, 200, 187, 19, 80, 169, 162, 168], 1)], transaction_version: 1 }) in 20 ms.
2020-09-02 06:00:57.155 main DEBUG wasm-runtime Allocated WASM instance 1/8
2020-09-02 06:00:57.158 main INFO sc_consensus_slots ⏱ Loaded block-time = 6000 milliseconds from genesis on first-launch
2020-09-02 06:00:57.158 main WARN sc_service::builder Using default protocol ID "sup" because none is configured in the chain specs
2020-09-02 06:00:57.158 main INFO sub-libp2p 🏷 Local node identity is: 12D3KooWHX7yTCJP8wZn53w98pnvJ76HuHvYrNLhCmJbnkbx4ew1 (legacy representation: QmUVjga4dsGvTZJfxisxLogR3fqPDLxbZTUgLhRgZBZJdS)
2020-09-02 06:00:57.160 main DEBUG libp2p_websocket::framed /ip6/::/tcp/30333 is not a websocket multiaddr
2020-09-02 06:00:57.161 main DEBUG libp2p_websocket::framed /ip4/0.0.0.0/tcp/30333 is not a websocket multiaddr
====================
Version: 2.0.0-rc6-c9fda53-x86_64-macos
0: backtrace::backtrace::trace
1: backtrace::capture::Backtrace::new
2: sp_panic_handler::set::{{closure}}
3: std::panicking::rust_panic_with_hook
4: _rust_begin_unwind
5: core::panicking::panic_fmt
6: core::option::expect_none_failed
7: hyper_rustls::connector::HttpsConnector<hyper::client::connect::http::HttpConnector>::new
8: sc_offchain::api::http::SharedClient::new
9: sc_offchain::OffchainWorkers<Client,Storage,Block>::new
10: sc_service::builder::build_offchain_workers
11: node_template::service::new_full
12: sc_cli::runner::Runner<C>::run_node_until_exit
13: node_template::command::run
14: node_template::main
15: std::rt::lang_start::{{closure}}
16: std::rt::lang_start_internal
17: _main
Thread 'main' panicked at 'cannot access native cert store: Custom { kind: Other, error: Error { code: -25262, message: "The Trust Settings Record was corrupted." } }', /Users/rmp/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-rustls-0.21.0/src/connector.rs:46
I am using the pycorenlp client in order to talk to the Stanford CoreNLP Server. In my setup I am setting pipelineLanguage to german like this:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
text = 'Das große Auto.'
output = nlp.annotate(text, properties={
'annotators': 'tokenize,ssplit,pos,depparse,parse',
'outputFormat': 'json',
'pipelineLanguage': 'german'
})
However, from the looks I'd say that it's not working:
output['sentences'][0]['tokens']
will return:
[{'after': ' ',
'before': '',
'characterOffsetBegin': 0,
'characterOffsetEnd': 3,
'index': 1,
'originalText': 'Das',
'pos': 'NN',
'word': 'Das'},
{'after': ' ',
'before': ' ',
'characterOffsetBegin': 4,
'characterOffsetEnd': 9,
'index': 2,
'originalText': 'große',
'pos': 'NN',
'word': 'große'},
{'after': '',
'before': ' ',
'characterOffsetBegin': 10,
'characterOffsetEnd': 14,
'index': 3,
'originalText': 'Auto',
'pos': 'NN',
'word': 'Auto'},
{'after': '',
'before': '',
'characterOffsetBegin': 14,
'characterOffsetEnd': 15,
'index': 4,
'originalText': '.',
'pos': '.',
'word': '.'}]
This should be more like
Das große Auto
POS: DT JJ NN
It seems to me that setting 'pipelineLanguage': 'de' does not work for some reason.
I've executed
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
in order to start the server.
I am getting the following from the logger:
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
[pool-1-thread-3] ERROR CoreNLP - Failure to load language specific properties: StanfordCoreNLP-german.properties for german
[pool-1-thread-3] INFO CoreNLP - [/127.0.0.1:60700] API call w/annotators tokenize,ssplit,pos,depparse,parse
Das große Auto.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 8.645 (s)
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [9.8 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.3 sec].
Apparently the server is loading the models for the English language - without warning me about that.
Alright, I just downloaded the models jar for German from the website and moved it into the directory where I extracted the server e.g.
~/Downloads/stanford-corenlp-full-2017-06-09 $
After re-running the server, the model was successfully loaded.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger ... done [5.1 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/UD_German.gz ...
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99984, Elapsed Time: 11.419 (s)
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [12.2 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/germanFactored.ser.gz ... done [1.0 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[pool-1-thread-3] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz ... done [0.7 sec].
i'm new at scraping with Scrapy and unfortunatly, i can't access data through a request (to simulate an AJAX request made).
I read others topics, but it didnt help me resolve my issues.
The website i would like to crawl is auchan.fr ,it has a dynamic search box driven by algolia (algolia).
Here is my spider for a Nutella request(POST then):
class AjaxspiderSpider(scrapy.Spider):
name = "ajaxspider"
allowed_domains = ["auchandirect.fr/recherche"]
#start_urls = ['https://www.auchandirect.fr/recherche/']
def start_requests(self):
full_url = "/1/indexes/articles_article_11228/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.20.4&x-algolia-application-id=TN96V7LRXC&x-algolia-api-key=46a121512cba9c452df318ffca231225"
yield FormRequest('https://tn96v7lrxc-dsn.algolia.net' + full_url, callback=self.parse, formdata={"params":"query=nutella&facets=%5B%22loopr_shelf%22%5D&hitsPerPage=50"})
def parse(self, response):
with open('data_content', 'w') as file:
file.write(response.content)
and Here is the log i got :
2017-02-03 15:14:34 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: ajax)
2017-02-03 15:14:34 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['ajax.spiders'], 'NEWSPIDER_MODULE': 'ajax.spiders', 'BOT_NAME': 'ajax'}
2017-02-03 15:14:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2017-02-03 15:14:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-02-03 15:14:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-02-03 15:14:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-02-03 15:14:34 [scrapy.core.engine] INFO: Spider opened
2017-02-03 15:14:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-03 15:14:34 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-02-03 15:14:35 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://tn96v7lrxc-dsn.algolia.net/1/indexes/articles_article_11228/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.20.4&x-algolia-application-id=TN96V7LRXC&x-algolia-api-key=46a121512cba9c452df318ffca231225> (referer: None)
2017-02-03 15:14:35 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://tn96v7lrxc-dsn.algolia.net/1/indexes/articles_article_11228/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.20.4&x-algolia-application-id=TN96V7LRXC&x-algolia-api-key=46a121512cba9c452df318ffca231225>: HTTP status code is not handled or not allowed
2017-02-03 15:14:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-03 15:14:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 545,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 338,
'downloader/response_count': 1,
'downloader/response_status_count/400': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 3, 14, 14, 35, 216807),
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 2, 3, 14, 14, 34, 977436)}
2017-02-03 15:14:35 [scrapy.core.engine] INFO: Spider closed (finished)
i thank you for any piece for informations
This is not Ajax-related but site-specific question, you just passes search parameters string wrong way trying to pass it as formdata while it should be passed as raw body of POST request, so it should be like that:
yield Request('https://tn96v7lrxc-dsn.algolia.net' + full_url,
callback=self.parse, method='POST',
body='{"params":"query=nutella&facets=%5B%22loopr_shelf%22%5D&hitsPerPage=50"}')
I'm running the CoreNLP dedicated server on AWS and trying to make a request from ruby. The server seems to be receiving the request correctly but the issue is the server seems to ignore the input annotators list and always default to all annotators. My Ruby code to make the request looks like so:
uri = URI.parse(URI.encode('http://ec2-************.compute.amazonaws.com//?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}'))
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Post.new("/v1.1/auth")
request.add_field('Content-Type', 'application/json')
request.body = text
response = http.request(request)
json = JSON.parse(response.body)
In the nohup.out logs on the server I see the following:
[/38.122.182.107:53507] API call w/annotators tokenize,ssplit,pos,depparse,lemma,ner,mention,coref,natlog,openie
....
INPUT TEXT BLOCK HERE
....
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.0 sec].
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
PreComputed 100000, Elapsed Time: 2.259 (s)
Initializing dependency parser done [5.1 sec].
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [7.2 sec].
[pool-1-thread-1] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Read 83 rules
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Read 267 rules
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Read 25 rules
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator mention
Using mention detector type: dependency
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
etc etc.
When I run test queries using wget on the command line it seems to work fine.
wget --post-data 'the quick brown fox jumped over the lazy dog' 'ec2-*******.compute.amazonaws.com/?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}' -O -
Any help as to why this is happening would be appreicated thanks!
It turns out the request was being constructed incorrectly. The path should be in the argument to the Post.new. Corrected code below in case it helps anyone:
host = "http://ec2-***********.us-west-2.compute.amazonaws.com"
path = '/?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}'
encoded_path = URI.encode(path)
uri = URI.parse(URI.encode(host))
http = Net::HTTP.new(uri.host, uri.port)
http.set_debug_output($stdout)
# request = Net::HTTP::Post.new("/v1.1/auth")
request = Net::HTTP::Post.new(encoded_path)
request.add_field('Content-Type', 'application/json')
request.body = text
response = http.request(request)
json = JSON.parse(response.body)
I'm trying to build a spider to catch images. I've got the spider working, it just.. doesn't work and doesn't error out.
Spider:
from urlparse import urljoin
from scrapy.selector import XmlXPathSelector
from scrapy.spider import BaseSpider
from nasa.items import NasaItem
class NasaImagesSpider(BaseSpider):
name = "nasa.gov"
start_urls = ('http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml',)
def parse(self, response):
xxs = XmlXPathSelector(response)
item = NasaItem()
baseLink = xxs.select('//link/text()').extract()[0]
imageLink = xxs.select('//tn/text()').extract()
imgList = []
for img in imageLink:
imgList.append(urljoin(baseLink, img))
item['image_urls'] = imgList
return item
It runs through the page, and it captures the urls correctly. I pass it down the pipeline, but.. no pics.
The settings file:
BOT_NAME = 'nasa.gov'
BOT_VERSION = '1.0'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGE_STORE = '/home/usr1/Scrapy/spiders/nasa/images'
LOG_LEVEL = "DEBUG"
SPIDER_MODULES = ['nasa.spiders']
NEWSPIDER_MODULE = 'nasa.spiders'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
and the items file:
from scrapy.item import Item, Field
class NasaItem(Item):
image_urls = Field()
images = Field()
and the output log:
2012-11-12 07:47:28-0500 [scrapy] INFO: Scrapy 0.14.4 started (bot: nasa)
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Enabled item pipelines:
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Spider opened
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-11-12 07:47:29-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-11-12 07:47:29-0500 [nasa.gov] DEBUG: Crawled (200) <GET http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml> (referer: None)
2012-11-12 07:47:29-0500 [nasa.gov] DEBUG: Scraped from <200 http://www.nasa.gov/multimedia/imagegallery/iotdxml.xml>
#removed output of every jpg link
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Closing spider (finished)
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2526,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 11, 12, 12, 47, 29, 802477),
'item_scraped_count': 1,
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 11, 12, 12, 47, 29, 682005)}
2012-11-12 07:47:29-0500 [nasa.gov] INFO: Spider closed (finished)
2012-11-12 07:47:29-0500 [scrapy] INFO: Dumping global stats:
{'memusage/max': 104132608, 'memusage/startup': 104132608}
I'm stuck. Any suggestions as to what I'm doing wrong?
[EDITED] Added output log, changed settings bot name.
#pipeline file
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class PaulsmithPipeline(ImagesPipeline):
def process_item(self, item, spider):
return item
def get_media_requests(self,item,info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self,results,item,info):
image_paths=[x['path'] for ok,x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item["image_paths"]=image_paths
return item