I am implementing a scrapy spider to crawl a website that contains real estate offers. The site contains a telephone number to the real estate agent, which can be retreived be an ajax post request. The request yielded by scrapy returns an error from the server, while the same request sent from Postman returns the desired data.
Here's the site URL: https://www.otodom.pl/oferta/piekne-mieszkanie-na-mokotowie-do-wynajecia-ID3ezHA.html
I recorded the request using Network tab in chrome's dev tools. The url of the ajax request is: enter link description here The data needed to send the request is the CSRFtoken contained in the page's source, which changes periodically. In Postman giving only the CSRFtoken as form-data gives an expected answer.
This is how I construct the request in scrapy:
token_input = response.xpath('//script[contains(./text(), "csrf")]/text()').extract_first()
csrf_token = token_input[23:-4]
offerID_input = response.xpath('//link[#rel="canonical"]/#href').extract_first()
offerID = (offerID_input[:-5])[-7:]
form_data = {'CSRFToken' : csrf_token}
request_to_send = scrapy.Request(url='https://www.otodom.pl/ajax/misc/contact/phone/3ezHA/', headers = {"Content-Type" : "application/x-www-form-urlencoded"}, method="POST", body=urllib.urlencode(form_data), callback = self.get_phone)
yield request_to_send
Unfortunately, I get an error, though everything should be ok. Does anybody have any idea what might be the problem? Is is maybe connected with encoding? The site uses utf-8.
You can find the token in page source:
<script type="text/javascript">
var csrfToken = '0ec80a520930fb2006e4a3e5a4beb9f7e0d6f0de264d15f9c87b572a9b33df0a';
</script>
And you can get it quite easily with this regular expression:
re.findall("csrfToken = '(.+?)'", response.body)
To get the whole thing you can use scrapy's FormRequest which can make a correct post request for you:
def parse(self, response):
token = re.findall("csrfToken = '(.+?)'", response.body)[0]
yield FormRequest('https://www.otodom.pl/ajax/misc/contact/phone/3ezHA/',
formdata={'CSRFToken': token},
callback=self.parse_phone)
def parse_phone(self, response):
print(response.body)
#'{"value":"515 174 616"}'
You can debug your scrapy requests by insersting inspect_response call and looking into request object:
def parse_phone(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
# shell opens up here and spider is put on pause
# now check `request.body` and `request.headers`, match those to what you see in your browser
Related
I am trying to scrape some ajax content from http://lsgelection.kerala.gov.in/lbtrend2015/views/lnkResultsGrama.php
Once the two dropdowns are selected and submitted, from chrome networks tab
Request URL: http://lsgelection.kerala.gov.in/lbtrend2015/includes/detailed_results_grama_ajax.php
Request Method: POST
FormData
token: 9fd54c089d36035c3ce2b5cf08f38982
process: getGramaWonCandData
cno: 46
districtCode: D02001
Panchayat: G02069
I tried to scrape data from scrapy shell along with splash, as cno comes from JS
scrapy shell 'http://localhost:8050/render.html?url=http://lsgelection.kerala.gov.in/lbtrend2015/views/lnkResultsGrama.php'
token = response.xpath('//input[#id="token"]/#value').extract_first()
cno = response.xpath('//input[#id="cno"]/#value').extract_first()
Then i tried to fetch the form response using
fetch(scrapy.FormRequest.from_response(response,url='http://lsgelection.kerala.gov.in/lbtrend2015/includes/detailed_results_grama_ajax.php',method='POST',formdata={'token':token,'process': 'getGramaWonCandData','cno':cno,'districtCode': 'D02001','Panchayat': 'G02069'},headers={'Content-Type': 'json/...'}))
When i tried to get response.text or response.body it is returning b'\n\n\n\n\n\n\n'
Where have i gone wrong ?
I am trying to configure ajaxAppender of log4javascript in DJango. I have made a file frontendlog.json where I want to write the logs going from the front end. This is how I write the script in myPage.html.
<script type="text/javascript" src="/static/js/log4javascript.js"></script>
<script language="javascript">
var url = '/frontEndLog/';
var log = log4javascript.getLogger("serverlog");
var ajaxAppender = new log4javascript.AjaxAppender(url);
ajaxAppender.addHeader("Content-Type", "application/json");
var jsonLayout = new log4javascript.JsonLayout();
ajaxAppender.setLayout(jsonLayout);
log.addAppender(ajaxAppender);
window.onerror = function(errorMsg, url, lineNumber){
log.fatal("Uncaught error "+errorMsg+" in "+url+", line "+lineNumber);
};
log.info("Front End Log");
alert('!!')
</script>
In my django urls.py I have this entry url(r'^frontEndLog/$', 'TryOn.views.frontEndLog'),
and in my django view I have this view function
def frontEndLog(request):
LOGGER.info ("frontEndLog")
return render_to_response('frontEndLog.json', mimetype="text/json")
So I expected the frontEndLog to be written in frontEndLog.json in the same location as other HTMLs are found in django. However, it tells me that XMLhttpRequest Request to URL returned status code 500. Can somebody please tell me where I am going wrong here and is this the correct way to use log4javascript in django?
I solved it. I printed the django request object in views.py. There I was able to find the log messages in the request.POST. It appears in the form of a dictionary since it is JSON-ified. You can access the logs with this
clientLogs = request.POST.get('data')
'data' is the key in the key : value pair here. (You can easily understand that when you see the POST object).
Whether you want to print it in the views.py itself or write it to a a txt file is up to you. So all this while the logs were actually getting logged without me being able to identify it! I guess I should have read the documentation better.
I can do GET requests, but when I do POST, in Chrome developer tools I see: "Failed to load resource: the server responded with a status of 500 (INTERNAL SERVER ERROR)"
I thought the problem is in Django's csrf_token, so I found this solution:
.config(function($httpProvider){
$httpProvider.defaults.headers.common['X-CSRFToken'] = CSRF_TOKEN;
});
In my index.html, in <head> I have:
<script>
CSRF_TOKEN = '{{ csrf_token }}';
</script>
But it still raises 500 error. Am I doing something wrong or the problem is not in csrf?
P.S. CSRF_TOKEN is declared before
<script src="{{ STATIC_URL }}lib/angular/angular.js"></script>
and other scripts.
I've figured out the problem.
Django by default appends slash to your URL. If you enter:
http://mydjangosite.com/page Django will redirect you to: http://mydjangosite.com/page/
Angular's $resource removes trailing slash (you can read about it on github: https://github.com/angular/angular.js/issues/992).
Django has APPEND_SLASH setting which uses HTTP 302 redirection to
append slash to urls without slash. This works with GET method but not
with others (POST,PUT,DELETE) because redirection cannot and will not
pass the data to the new URL
So, there are two options:
1) Use $http insread of $resource
or
2) In Django's settings.py add this line:
APPEND_SLASH = False
and in your urls.py remove all trailing slashes
simply escape the backslash like: /custom_api/get_nearest_hotels/:eventId\/
(From: http://pragmaticstartup.wordpress.com/2013/04/27/some-lessons-learnt-from-messing-with-django-and-angularjs/)
As you all know you need to dump dictionary in your HTTPRESPONCE object.
sometimes what happens, in your view; you try to dump something in to your dict that can not be serialized. that is python/django can not serialize that object.
the examples can be (FORM OBJECT), (MODEL OBJECT), etc
so you need to get those away.
context = {}
context['office_form'] = OfficeCompleteForm(request.POST)
this can not be serialized and you will get 500 error.
be free to add following data.
context['success'] = {"msg": "successfully updated. "}
context['error'] = {"msg": "error can not update. "}
and at last do not forget to call you response method like this.
return HttpResponse(json.dumps(context), content_type="application/json")
I have a JSP page called CreateProcessGroup.jsp and I use an annotation controller to map requests to CreateProcessGroup.htm to that page. But I'm having an interesting issue when I request the page from browser it works, when send a request using jQuery $.get method I get 404 (CreateProcessGroup.htm not found) is there a difference between two requests?
My JSP page just under WebContent dir and JS file under WEBContent/Jquery my function sending the request like below:
function SendCreateProcessGroupRequest()
{
var pid = $('#pid').val();
var description = $('#processGroupDescription').val();
var x = "/CreateProcessGroup.htm";
alert(x);
$.get(x, { pid: 62, description: description },
function(data){
alert("Data Loaded: " + data);
});
}
Do I need to give the URL as ../CreateProcessGroup.htm? Indeed I tried:
/CreateProcessGroup.htm
../CreateProcessGroup.htm
/../CreateProcessGroup.htm
../../CreateProcessGroup.htm
/../../CreateProcessGroup.htm
My guess is DispatcherServlet can not map Ajax requests to Controllers but this is stupid isn't it?
How can i get rid of the situation?
Thanks all.
Try this instead:
var x = "CreateProcessGroup.htm";
If the page you're requesting is beside the one making the request there's no need for a path in front, it will (by default) make a request to the same path just with that page/handler on the end.
I need to know how to start a session by Ajax in Django. I'm doing exactly as described bellow, but it is not working! The request is sent correctly, but don't start any session. If a request directly without ajax it works! What is going on?
'# urls
r'^logout/$', 'autenticacao.views.logout_view'
'# view of login
def login_view(request):
username = request.GET.get('username', '')
password = request.GET.get('password', '')
user = authenticate(username=username, password=password)
if user is not None:
if user.is_active:
login(request, user)
return HttpResponse(user.get_profile().sos_user.name)
return HttpResponse('user invalido')
'# ajax in a html page
$(function(){
$.get('http://localhost:8000/logout/?username=usuario?>&password=senha', function(data){
alert(data);
});
You're not calling the login_view. You're ajax request is going to the /logout/ url which is calling the autenticacao.views.logout_view.
Also, The ?> after username=usuario doesn't look right in the your get url.
My guess is you should be doing something like http://localhost:8000/login/?username=usuario&password=senha. (but I'd need to see your login url mapping to be sure).
Also, you should be POSTing the login information and using HTTPS for security reasons, but that's a different issue.