Newbie here, running scrapy in windows. How to avoid showing the extracted links and crawled items in the command window? I found comments in the "parse" section on this linkhttp://doc.scrapy.org/en/latest/topics/commands.html, not sure if it's relevant and how to apply it if so. Here is more detail with part of the code, starting from my second Ajax request (In the first Ajax request, the callback function is "first_json_response":
def first_json_response(self, response):
try:
data = json.loads(response.body)
meta = {'results': data['results']}
yield Request(url=url, callback=self.second_json_response,headers={'x-requested-with': 'XMLHttpRequest'}, meta = meta)
def second_json_response(self, response):
meta = response.meta
try:
data2 = json.loads(response.body)
...
The "second_json_response" is to retrieve the response from the requested result in first_json_response, as well as to load the new requested data. "meta" and "data" are then both used to define items that need to be crawled. Currently, the meta and links are shown in the windows terminal where I submitted my code. I guess it is taking up some extra time for computer to show them on the screen, and thus want them to disappear. I hope by running scrapy on a kinda-of batch mode will speed up my lengthy crawling process.
Thanks! I really appreciate your comment and suggestion!
From scrapy documentation:
"You can set the log level using the –loglevel/-L command line option, or using the LOG_LEVEL setting."
So append to your scray crawl etc command -loglevel='ERROR' . That should make all the info disappear from your command line, but I don't think this will speed things much.
In your pipelines.py file, try using something like:
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
This way, when you yield an item from your spider class, it will print it out to items.jl.
Hope that helps.
Related
I am actually trying to parse a website using the requests module, and extract some text out of it.
Url : https://www.icsi.in/student/Members/MemberSearch.aspx
after hitting the url in the Cp Number text field input : 16803
hit search,
on the bottom you can see some data, I want that data, let's say a name.
I am successfully able to get the data using selenium, but can't able to get it using requests module.
I have tried the requests module giving parameters, sessions, cookies etc.
but nothing worked.
url = "https://www.icsi.in/student/Members/MemberSearch.aspx"
ss = {'dnn$ctr410$MemberSearch$txtCpNumber':'16803',
'__EVENTTARGET':'dnn$ctr410$MemberSearch$btnSearch',
'__VIEWSTATEGENERATOR':'6A295697',
'dnn$ctlHeader$dnnSearch$Search':'SiteRadioButton'}
session = requests.Session()
cookies = session.cookies.get_dict()
for cookie in cookies:
session.cookies.set(cookie['name'], cookie['value'])
response = requests.post(url, data=ss)
print(response)
HTMLTree = html.fromstring(response.content)
name = HTMLTree.xpath('//div[#class="name_head"]//text()')
print(name)
I expect the output of the name of the person.
Anyone out there please help me.
If you don't mind using C# code I would be more than happy to help you otherwise it's a very lengthy process. If you choose that python is the only road you're willing to take then you should try grabbing the encrypted value within C:\User[USERNAME]\Appdata\Local\Google\Chrome\User Data\Default\Cookies You can change the file path accordingly to your OS. You can use SQLite to read and modify the encrypted values.
cookie = Decrypt(Encoding.Default.GetBytes(SQLDatabase1.GetValue(i, "encrypted_value")
if (cookie.Contains(".ASPXANONYMOUS")):
Step1 = cookie + "END"
Step2 = (step1 + ".ASPXANONYMOUS")
The following code above may help you with your journey.
I am using prawnpdf/pdf-inspector to test that content of a PDF generated in my Rails app is correct.
I would want to check that the PDF file contains a link with certain URL. I looked at yob/pdf-reader but haven't found any useful information related to this topic
Is it possible to test URLs within PDF with Ruby/RSpec?
I would want the following:
expect(urls_in_pdf(pdf)).to include 'https://example.com/users/1'
The https://github.com/yob/pdf-reader contains a method for each page called text.
Do something like
pdf = PDF::Reader.new("tmp/pdf.pdf")
assert pdf.pages[0].text.include? 'https://example.com/users/1'
assuming what you are looking for is at the first page
Since pdf-inspector seems only to return text, you could try to use the pdf-reader directly (pdf-inspector uses it anyways).
reader = PDF::Reader.new("somefile.pdf")
reader.pages.each do |page|
puts page.raw_content # This should also give you the link
end
Anyway I only did a quick look at the github page. I am not sure what raw_content exactly returns. But there is also a low-level method to directly access the objects of the pdf:
reader = PDF::Reader.new("somefile.pdf")
puts reader.objects.inspect
With that it surely is possible to get the url.
I am trying to get the transcription info from some Khan Academy videos using scrapy.
For example: https://www.khanacademy.org/math/algebra-basics/basic-alg-foundations/alg-basics-negative-numbers/v/opposite-of-a-number
When I Tried to select the Transcript button through xpath response.xpath('//div[contains(#role, "tablist")]/a').extract() I only got the information about the tab has the aria-selected="true" which is the About section. I would need to use scrapy to change the aria-selected from false to true in the Transcript button and then retrieve the necessary information.
Could anyone please clarify how I would be able to accomplish this?
Much appreciated !
If you take a look at your network inspect you can see that an AJAX request is being made to retrieve the transcript once the page loads:
In this case it's https://www.khanacademy.org/api/internal/videos/2Zk6u7Uk5ow/transcript?casing=camel&locale=en&lang=en
It seems to use youtube video url id to create this api url. So you can recreate it really easily:
import json
import scrapy
class MySpider(scrapy.Spider):
#...
transcript_url_template = 'https://www.khanacademy.org/api/internal/videos/{}/transcript?locale=en&lang=en'
def parse(self, response):
# find youtube id
youtube_id = response.xpath("//meta[#property='og:video']/#content").re_first('v/(.+)')
# create transcript API url using the youtube id
url = self.transcript_url_template.format(youtube_id)
# download the data and parse it
yield Request(url, self.parse_transript)
def parse_transcript(self, response):
# convert json data to python dictionary
data = json.loads(response.body)
# parse your data!
I can successfully upload a single file using a Mechanize form like this:
def add_attachment(form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads.first.file_name = attachment[:path]
end
end
where form is a mechanize form. But if attachments has more than one element, the last one overwrites the previous ones. This is obviously because I'm using the first accessor which always returns the same element of the file_uploads array.
To fix this, I tried this, which results an error, because there is only one element in this array.
def add_attachment(form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads[i].file_name = attachment[:path]
end
end
If I try to create a new file_upload object, it also doesn't work:
def add_attachment(form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads[i] ||= Mechanize::Form::FileUpload.new(form, attachment[:path])
form.file_uploads[i].file_name = attachment[:path]
end
end
Any idea how I can upload multiple files using Mechanize?
So, I solved this issue, but not exactly how I imagined it would work out.
The site I was trying to upload files to was a Redmine project. Redmine is using JQueryUI for the file uploader, which confused me, since Mechanize doesn't use Javascipt. But, it turns out that Redmine degrades nicely if Javascript is disabled and I could take advantage of this.
When Javascript is disabled, only one file at time can be uploaded in the edit form, but going to the 'edit' url for the issue that was just created gives the chance to upload a second file. My solution was to simply attach a file, upload the form and then click the 'Update' link on the resulting page, which presented a page with a new form and another upload field, which I could then use to attach the next file to. I did this for all attachments but the last, so that the form processing could be completed and then uploaded for a final time. Here is the relavant bit of code:
def add_attachment(agent,form, attachments)
attachments.each_with_index do |attachment, i|
form.file_uploads.first.file_name = attachment[:path]
if i < attachments.length - 1
submit_form(agent, form)
agent.page.links_with(text: 'Update').first.click
form = get_form(agent)
end
end
form
end
I used the following
form.file_uploads[0].file_name = "path to the first file that to be uploaded" form.file_uploads[1].file_name = "path to the second file that to be uploaded" form.file_uploads[2].file_name = "path to the third file that to be uploaded".
and worked fine. Hope this helps.
I am scraping a sequence of urls. The code is working but scrapy is not parsing the urls in sequential order. E.g. Although I am trying to parse url1, url2,...,url100, scrapy parses url2, url10,url1...etc.
It parses all the urls but when a specific url does not exist (e.g example.com/unit.aspx?b_id=10) Firefox shows me the result of my previous request. As I want to make sure that I don´t have duplicates, I need to ensure that the loop is parsing the urls sequentially and not "at will".
I tried "for n in range(1,101) and also a "while bID<100" the result is the same. (see below)
thanks in advance!
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
bID=0
#for n in range(1,100,1):
while bID<100:
bID=bID+1
startURL='https://www.example.com/units.aspx?b_id=%d' % (bID)
request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]})
# print self.metabID
yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2)
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
You could try something like this. I'm not sure if it's fit for purpose on the basis that I haven't seen the rest of the spider code but here you go:
# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_add_tables(self, response):
# parsing code here
if self.crawl_urls:
next_url = self.crawl_urls.pop()
return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})
return items
You can use use priority attribute in Request object. Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.
Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.
For more info you can see here
Scrapy Crawl URLs in Order