I have just started out with web scraping.
The data I need seems to be returned by an AJAX POST request. POST requests are very rarely covered by scraping tutorials and seem to come with lots of "gotcha's" for new users like myself.
I copied the request from Chrome dev tools into Postman using cURL and then generated the Python request code. The request uses a peculiar set query parameters... I have however repeated this process and the only parameter that changes is the session ID.
The problem is that the request stops working after some time has elapsed (Internal server error 500). I would then have to copy the request from the site again with the new session ID.
Any pointers in the right direction would be appreciated.
import requests
url = "https://online.natis.gov.za/gateway/PreBooking?_flowId=PreBooking&_flowExecutionKey=e1s2&flowName=[object%20Object]&_eventId_next=Next?dtoName=perSummaryDetailDto&viewId=perSummaryDetail&flowExecutionKey=e1s2&flowExecutionUrl=%2Fgateway%2FPreBooking%3F_flowId%3DPreBooking%26_flowExecutionKey%3De1s2&sessionId=IWhelPTLyYDa7JohJV6x8So_qEKdC8wOknArAXkS&surname={SURNAME}&initials=R&firstName1={FIRSTNAME}&emailAddress={EMAIL}&cellN={CELL}&isWithinPriorityDate=false&viewPrioritySlots=false&showPrioritySlotsModal=false&provcdt=4&supportUser=false"
payload = {}
headers = {
'Connection': 'keep-alive',
'Content-Length': '0',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'Accept': 'application/json, text/plain, */*',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Origin': 'https://online.natis.gov.za',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://online.natis.gov.za/',
'Accept-Language': 'en-US,en;q=0.9',
'Cookie': 'JSESSIONID=IWhelPTLyYDa7JohJV6x8So_qEKdC8wOknArAXkS.master:gateway_3; Gateway=R35619282; ROUTEID.33f40c02f95309866c572c0def16f016=.node1; JSESSIONID=BadmtwJ7c8YWEz73xe6Wu165Q7gapmm4WTY6at-p.master:gateway_3; Gateway=R35619282',
'dnt': '1',
'sec-gpc': '1'
}
response = requests.request(
"POST", url, headers=headers, data=payload, verify=False)
print(response.text)
Related
I have been trying to make a simple web request using python post data, the response ,
The server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).
The code below posts a request using python requests library. 400 response is received when executed. Could this issue be due to header syntax or format issues.
code:
headers = {
'Host': 'host.url',
'Content-Length': '1847',
'Sec-Ch-Ua': '"Chromium";v="95", ";Not A Brand";v="99"',
'Accept': 'application/json, text/plain, /',
'Content-Type': 'application/json',
'Authorization': 'auth-key',
'Sec-Ch-Ua-Mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Sec-Ch-Ua-Platform': '"Windows"',
'Origin': 'origin.url',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'referer.url',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'close',
}
data = {}
import json
json_object = json.dumps(data, indent = 4)
response = requests.post('url', data=json_object ,headers=headers, verify=False)
print(response.text)
I am trying to replicate in Python 3 the ajax requests I can see in Chrome Tools at this address when changing the time slot of data in the EPG grid.
This is my code:
import sys
import os
import re
import requests
session = requests.Session()
url_list = ['http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599019200&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599033600&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599048000&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599062400&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599076800&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599004800&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5']
for url in url_list:
headers ={
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'tvmds.tvpassport.com',
'Referer': 'http://www.canada.com/entertainment/television/tv-listings/index.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
r = session.get(url=url, headers=headers)
status_code = r.status_code
data2 = r.content.decode('utf-8', errors='ignore')
session.close()
chan_data = re.findall(' channel row -->(.*?)<!--', data2, re.S)
for chan in chan_data:
if "\'tvm_chan_50\'" in chan:
print(url)
print(chan)
print('-' * 150)
The code runs without error, but the &st parameter appears to be having no effect. This the timestamp of various times throughout the day.
I keep getting the same time slot of data returned of 7am to 9:30am roughly.
What am I doing wrong?
Thanks
Edit:
For clarity, the code itself is not falling over. My problem is the the ajax requests I am seeing in Chrome tools when using the time toggle buttons on the EPG referenced does not seem to be producing the same effect when the ajax API is either run in code, or copied and pasted directly into a web browser...
On a government site I managed to login via my credidentials (specified as a python dictonary in login_data) as follows:
with requests.Session() as s:
url = 'https:......../login'
r = s.get(url, data=login_data, headers=headers, verify=False)
r = s.post(url, data=login_data, headers = headers, verify=False)
print(r.content)
which displays a html:
b'<!DOCTYPE html..... and if I search for my username i find <span class="rich-messages-label msg-def-inf-label">Welcome, USER..XYZ!< From which i conclude a succesful login.
Next I want to proceed to the search subsite (url = 'https:......./search) of the site I'm now logged in to. This subsite allows me to search the government records for an incident (incident-ID) on a given date (start_date, end_date).
because of the login success I tried the following:
with requests.Session() as s:
url = 'https:......../search'
r = s.get(url, data=search_data, headers=headers, verify=False)
r = s.post(url, data=search_data, headers = headers, verify=False)
print(r.content)
In advance i defined search_data using Google Chrome Inspecor for Network and Header:
search_data:{
'AJAXREQUEST': '_viewRoot',
'theSearchForm': 'theSearchForm',
'incident-ID' : '12345',
'start_date' : '05/03/2019 00:00:00 +01:00',
'end_date' : '05/03/2019 23:59:59 +01:00',
}
and i specified headers to include more than just the agent:
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=8351xxxxxxxxxxxxFD5; _ga=GA1.2.xxxxxxx.xxxxxxxx',
'Host': 'somehost...xyz.eu',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
}
So far the setup should be nice, no? But I ran into a problem as the print(r.content) doesn't give me the .html as after the login but some disappointingly short: b'<?xml version="1.0" encoding="UTF-8"?>\n<html xmlns="http://www.w3.org/1999/xhtml"><head><meta name="Ajax-Response" content="redirect" /><meta name="Location" content="home.seam?cid=3774801" /></head></html>
it's a pitty because I can see in the inspctor that the response for the post-request in the browser yields the exact data I am looking for. Similarl the first post-request yields the exact same data as my python command r = s.post(url, data=login_data, headers = headers, verify=False). But the print(r.content) as already said seams to be a redirect which only brings me back to the login site, stating you're already logged in.
To sum up:
The first request.Session.get & -.post worked (I get the same response html as in the Google Chrome Inspector).
The second request.Session.post doesnt work as it just yields some weird redirect (but I get
the correct response in the Google Chrome Inspector).
What am I missing??? Please Help! :S
I used both F12(Chrome) and postman to check the request and its detailed info on site
http://www.zhihu.com/
(email:jianguo.bai#hirebigdata.cn, password:wsc111111), then go to
http://www.zhihu.com/people/hynuza/columns/followed
I want to get all the columns the people Hynuza had followed which is 105 currently. When open the page, there is only 20 of them, then I need to scroll down to get more. Each time I scroll down the details of the request is like this:
Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097
Then I use postman to simulate the request like this:
As you can see, it got want I wanted, and it worked even I logout this site.
According to all of this, I write my spider like this:
# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request
class PostSpider(scrapy.Spider):
name = "post"
allowed_domains = ["zhihu.com"]
start_urls = (
'http://www.zhihu.com',
)
def __init__(self):
super(PostSpider, self).__init__()
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'email': 'jianguo.bai#hirebigdata.cn', 'password': 'wsc111111'},
callback=self.login,
)
def login(self, response):
yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
callback=self.parse_followed_columns)
def parse_followed_columns(self, response):
# here deal with the first 20 divs
params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
method = 'next'
_xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
data = {
'params': params,
'method': method,
'_xsrf': _xsrf,
}
r = Request(
"http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
method='POST',
body=urllib.urlencode(data),
headers={
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cache-Control': 'no-cache',
'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
'__utmt=1; '
'__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
'__utmb=51854390.2.10.1419902703; '
'__utmc=51854390; '
'__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
'__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
'host': 'www.zhihu.com',
'Origin': 'http://www.zhihu.com',
'Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest',
},
callback=self.parse_more)
r.headers['Cookie'] += response.request.headers['Cookie']
print r.headers
yield r
print "after"
def parse_more(self, response):
# here is where I want to get the returned divs
print response.url
followers = response.xpath("//div[#class='zm-profile-card "
"zm-profile-section-item zg-clear no-hovercard']")
print len(followers)
Then I got 403 like this:
2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed
So it never enter the parse_more.
I've been working for two days and still got nothing, any help or advice will be appreciated.
The login sequence is correct. However the parsed_followed_columns() method totally corrupts the session.
you cannot use hardcoded values for data['_xsrf'] and params['hash_id']
You should find a way to read this information directly from html content of previous page and inject the values dynamically.
Also, I suggest you remove the headers parameter in this request which can only cause trouble.
I tried to post username and password to api, but looks like it doesnt work as simple as jquery post. I keep geting this 400 error.
Code:
$http({
method: 'POST',
url: apiLink + '/general/dologin.json',
data: {"username":"someuser","password": "somepass"}
}).success(function(response) {
console.log(response)
}).error(function(response){
console.log(response)
});
But if I add this line:
$http.defaults.headers.post["Content-Type"] = "application/x-www-form-urlencoded";
and change data to:
data: "username=someuser&password=somepass"
it works. But the thing is, that I have to use json.
And detailed informations from Google Chrome:
Request URL:http://coldbox.abak.si:8080/general/dologin.json
Request Method:POST
Status Code:400 Bad Request
Request Headersview source
Accept:application/json, text/plain, */*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en,sl;q=0.8,en-GB;q=0.6
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:57
Content-Type:application/x-www-form-urlencoded
Host:coldbox.abak.si:8080
Origin:http://localhost:8888
Referer:http://localhost:8888/
User-Agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36
Form Dataview sourceview URL encoded
{"username":"someuser","password":"somepass"}:
Response Headersview source
Access-Control-Allow-Origin:*
Connection:close
Content-Length:49
Content-Type:application/json;charset=utf-8
Date:Wed, 02 Apr 2014 07:50:00 GMT
Server:Apache-Coyote/1.1
Set-Cookie:cfid=b5bbcbe2-e2df-4eef-923f-d7d13e5aea42;Path=/;Expires=Thu, 31-Mar-2044 15:41:30 GMT;HTTPOnly
Set-Cookie:cftoken=0;Path=/;Expires=Thu, 31-Mar-2044 15:41:30 GMT;HTTPOnly
I'm betting it's a CORS issue if your angular app isn't on the exact same domain as the server to which you're posting your JSON.
See this answer for details: AngularJS performs an OPTIONS HTTP request for a cross-origin resource
Try
data: {username:"someuser",password: "somepass"}
without the quotes around the username and password and see if that makes a difference.
You would have to transform the data with a JSON.stringify when you assign that to the data