I'm working with scrapy. I want to rotate proxies on a per request basis and get a proxy from an api I have that returns a single proxy. My plan is to make a request to the api, get a proxy, then use it to set the proxy based on :
http://stackoverflow.com/questions/4710483/scrapy-and-proxies
where I would assign it using:
request.meta['proxy'] = 'your.proxy.address';
I have the following:
class ContactSpider(Spider):
name = "contact"
def parse(self, response):
for i in range(1,3,1):
PR= Request('htp//myproxyapi.com', headers= self.headers)
newrequest= Request('htp//sitetoscrape.com', headers= self.headers)
newrequest.meta['proxy'] = PR
but I'm not sure how to use The Scrapy Request object to perform the api call. I'm Not getting a response to the PR request while debugging. Do I need to do this in a separate function and use a yield statement or is my approach wrong?
Do I need to do this in a separate function and use a yield statement or is my approach wrong?
Yes. Scrapy uses a callback model. You would need to:
Yield the PR objects back to the scrapy engine.
Parse the response of PR, and in its callback, yield newrequest.
A quick example:
def parse(self, response):
for i in range(1,3,1):
PR = Request(
'http://myproxyapi.com',
headers=self.headers,
meta={'newrequest': Request('htp//sitetoscrape.com', headers=self.headers),},
callback=self.parse_PR
)
yield PR
def parse_PR(self, response):
newrequest = response.meta['newrequest']
proxy_data = get_data_from_response(PR)
newrequest.meta['proxy'] = proxy_data
yield newrequest
See also: http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
Related
I'm trying to make a GET API call through DRF.
My URL looks like this:
http://127.0.0.1:8000/patha/pathb/pathc/pathd/cd7701b?name=IT&size=20&workflow=rv
But It's producing an error:
if response.get('X-Frame-Options') is not None:
AttributeError: 'NoneType' object has no attribute 'get'
but when I'm just sending two query params:
http://127.0.0.1:8000/patha/pathb/pathc/pathd/cd007b?name=IT&size=20
or, any combination of two params out of those three it's reaching the view method.
My view Method looks like this:
class MyRtList(generics.ListAPIView, customMixin):
...
# Here I'd like to use all three inputs:
# cd007b
# `name` and `size`
And urls.py:
url(r'patha/pathb/pathc/(?P<name>[^/]+)?$',views.MyRtList.as_view()),
I was going through Doc. example shown with param1 and param2.
Is it maximum to use two query param?
Is it forbidden to use mixed data like cd007b and query param?
Because it's getting stuck with 3 params but not with 2 params.
Where is it going wrong?
This is the solution in my case
If you forgot to add render to your view in views.py then make sure to add render
views.py
Before:
from django.shortcuts import render
# Create your views here.
def test1(request):
kontext = {}
# This will give error
return (request, 'appName/test1.html', kontext)
After:
from django.shortcuts import render
# Create your views here.
def test1(request):
kontext = {}
return render(request, 'appName/test1.html', kontext)
I have a REST api url endpoint that represents a Song within an Album:
/api/album/(?P<album_id>)/song/(?P<id>)/
and I want to refer to it from another resource, e.g. Chart that contains Top-1000 songs ever. Here's an implementation of ChartSerializer:
class ChartSerializer(HyperlinkedModelSerializer):
songs = HyperlinkedRelatedField(
queryset=Song.objects.all(),
view_name='api:song-detail',
lookup_field='id'
)
class Meta:
model = Chart
fields = ('songs', )
Clearly, I can pass id as lookup_field, but it seems to me that I won't be able to pass album_id by any means. I'm looking into HyperlinkedModelSerializer.get_url() method:
def get_url(self, obj, view_name, request, format):
"""
Given an object, return the URL that hyperlinks to the object.
May raise a `NoReverseMatch` if the `view_name` and `lookup_field`
attributes are not configured to correctly match the URL conf.
"""
# Unsaved objects will not yet have a valid URL.
if hasattr(obj, 'pk') and obj.pk in (None, ''):
return None
lookup_value = getattr(obj, self.lookup_field)
kwargs = {self.lookup_url_kwarg: lookup_value}
return self.reverse(view_name, kwargs=kwargs, request=request, format=format)
As you can see, it constructs kwargs for reverse url lookup from scratch and doesn't allow to pass additional parameters to it. Am I right that this is not supported?
UPDATE:
Found a reference to this problem in the issue list of DRF: https://github.com/tomchristie/django-rest-framework/issues/3204
So, the answer is YES. There is even a paragraph about this issue in the DRF documentation:
http://www.django-rest-framework.org/api-guide/relations/#custom-hyperlinked-fields
I have been working on this for the past few hours, but cannot figure out what I'm doing wrong. When I run my xpath states using the selector in the scrapy shell, the statement works as expected. When I try to use the same statement in my spider, however, I get back an empty set. Does anyone know what I am doing wrong?
from scrapy.spider import Spider
from scrapy.selector import Selector
from TFFRS.items import Result
class AthleteSpider(Spider):
name = "athspider"
allowed_domains = ["www.tffrs.org"]
start_urls = ["http://www.tffrs.org/athletes/3237431/",]
def parse(self, response):
sel = Selector(response)
results = sel.xpath("//table[#id='results_data']/tr")
items = []
for r in results:
item = Result()
item['event'] = r.xpath("td[#class='event']").extract()
items.append(item)
return items
When viewed by the spider your url contains no content. To debug this kind of problems you should use scrapy.shell.inspect_response in parse method, use it like so:
from scrapy.shell import inspect_response
class AthleteSpider(Spider):
# all your code
def parse(self, response):
inspect_response(response, self)
then when you do
scrapy crawl <your spider>
you will get a shell from within your spider. There you should do:
In [1]: view(response)
This will display this particular response as it looks for this particular spider.
Try using HtmlXPathSelector for extracting xpaths.
Remove http from the start_urls section. Also the table id is something you are not entering correctly in your xpath. Try using inspect element to get a proper xpath for the data you want to scrape.
also consider changing function name, from docs:
Warning
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work
Scrapy spiders must implement specific methods; examples are: parse and start_requests but there are others in docs
So if you don't implement these methods for that, you will have problem. In my case the problem was i had a typo and my function name was start_request instead of start_requests!
so make sure your skeleton is something like this:
class MySpider(scrapy.Spider):
name = "name"
allowed_domains = ["https://example.com"]
start_urls = ['https://example.com/']
def start_requests(self):
#start_request method
def parse(self, response):
#parse method
I am building an API wrapper and am writing some tests for it and I have a couple of questions.
1) How do I write an assert for calls where data doesn't exist? For example, looking up a member by id using the API but the user won't exist yet.
2) How do I write an assert for testing PUT and DELETE requests?
I already have a grasp on testing GET and POST requests just not sure on the other 2 verbs.
For your question part 1...
You have a couple choices for data that doesn't exist:
You can create the data ahead of time, for example by using a test seed file, or a fixture, or a factory. I like this choice for larger projects with more sophisticated data arrangements. I also like this choice for getting things working first because it's more straightfoward to see the data.
You can create a test double, such as a stub method or fake object. I like this choice for fastest test performance and best isolation. For fastest tests, I intercept calls as early as possible. The tradeoff is that I'm not doing end-to-end testing.
For your question part 2...
You should edit your question to show your actual code; this will help people here answer you.
Is your VCR code is something like this?
VCR.use_cassette('test_unit_example') do
response = Net::HTTP.get_response('localhost', '/', 7777)
assert_equal "Hello", response.body
end
If so, you change the HTTP get to put, something like this:
uri = URI.parse(...whatever you want...)
json = "...whatever you want..."
req = Net::HTTP::Put.new(uri)
req["content-type"] = "application/json"
req.body = json
request(req)
Same for HTTP delete:
Net::HTTP::Delete.new(uri)
A good blog post is the http://www.rubyinside.com/nethttp-cheat-sheet-2940.html>Net::HTTP cheat sheet excerpted here:
# Basic REST.
# Most REST APIs will set semantic values in response.body and response.code.
require "net/http"
http = Net::HTTP.new("api.restsite.com")
request = Net::HTTP::Post.new("/users")
request.set_form_data({"users[login]" => "quentin"})
response = http.request(request)
# Use nokogiri, hpricot, etc to parse response.body.
request = Net::HTTP::Get.new("/users/1")
response = http.request(request)
# As with POST, the data is in response.body.
request = Net::HTTP::Put.new("/users/1")
request.set_form_data({"users[login]" => "changed"})
response = http.request(request)
request = Net::HTTP::Delete.new("/users/1")
response = http.request(request)
I need some help in doing this: I have to build the following URL in order to perform a query aganst an Apache Solr instance:
http://localhost:8080/solr/select?q=*%3A*&fq=deal_discount%3A[20+TO+*]&fq=deal_price%3A[*+TO+100]&fq={!geofilt+pt%3D45.6574%2C9.9627+sfield%3Dlocation_latlng+d%3D600}
As you can see, the URL contains 3 times the parameter named "fq". What I'm just wondering is how to use the URI.parse() method if I need to pass three times the parameter "fq" within the Hash that is the second argument of the parse() method.
Here's a simple snippet:
path = 'http://localhost:8080/solr/select'
pars = { 'fq' => 'deal_price [* TO 100]', 'fq' => '{!geofilt pt=45.6574,9.9627 sfield=location_latlng d=600}' } # This is obviously wrong!
res = Net::HTTP::post_form( URI.parse(path), pars )
The solution would be passing the full URL as a String, but I cannot find a method that provide this kind of signature.
Could you please post a simple solution to my problem? Thanks in Advance.
Thaks for your help. Yes, you're right... A get method was what I need. Anyway I had to make a little change to your code because Net:HTTP.get() threw an exception "Unknown method hostname"
uri = URI(solrUrl)
req = Net::HTTP::Get.new(uri.request_uri)
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
This solved my problem. Thanks indeed.
Your URL suggest that you should use HTTP GET to query solr while your snippet uses POST so that is one thing to change. But I think your main problem is with the parameters, a Hash may only contain one entry for a key so you can't use a Hash in this case. One easy way is to construct the URL by hand.
params_array = ['deal_price [* TO 100]',
'{!geofilt pt=45.6574,9.9627 sfield=location_latlng d=600}']
base_url = "http://localhost:8080/solr/select"
query_string = "?fq=#{params_array.join('&fq=')}"
url = base_url + query_string
result = Net::HTTP.get url
A bit compact maybe - a more readable version may be (according to taste):
params_array = ['deal_price [* TO 100]',
'{!geofilt pt=45.6574,9.9627 sfield=location_latlng d=600}']
url = "http://localhost:8080/solr/select&"
params_array.each do |param|
url << "&fq=#{param}"
end
result = Net::HTTP.get url