I used both F12(Chrome) and postman to check the request and its detailed info on site
http://www.zhihu.com/
(email:jianguo.bai#hirebigdata.cn, password:wsc111111), then go to
http://www.zhihu.com/people/hynuza/columns/followed
I want to get all the columns the people Hynuza had followed which is 105 currently. When open the page, there is only 20 of them, then I need to scroll down to get more. Each time I scroll down the details of the request is like this:
Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097
Then I use postman to simulate the request like this:
As you can see, it got want I wanted, and it worked even I logout this site.
According to all of this, I write my spider like this:
# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request
class PostSpider(scrapy.Spider):
name = "post"
allowed_domains = ["zhihu.com"]
start_urls = (
'http://www.zhihu.com',
)
def __init__(self):
super(PostSpider, self).__init__()
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'email': 'jianguo.bai#hirebigdata.cn', 'password': 'wsc111111'},
callback=self.login,
)
def login(self, response):
yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
callback=self.parse_followed_columns)
def parse_followed_columns(self, response):
# here deal with the first 20 divs
params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
method = 'next'
_xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
data = {
'params': params,
'method': method,
'_xsrf': _xsrf,
}
r = Request(
"http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
method='POST',
body=urllib.urlencode(data),
headers={
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cache-Control': 'no-cache',
'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
'__utmt=1; '
'__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
'__utmb=51854390.2.10.1419902703; '
'__utmc=51854390; '
'__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
'__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
'host': 'www.zhihu.com',
'Origin': 'http://www.zhihu.com',
'Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest',
},
callback=self.parse_more)
r.headers['Cookie'] += response.request.headers['Cookie']
print r.headers
yield r
print "after"
def parse_more(self, response):
# here is where I want to get the returned divs
print response.url
followers = response.xpath("//div[#class='zm-profile-card "
"zm-profile-section-item zg-clear no-hovercard']")
print len(followers)
Then I got 403 like this:
2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed
So it never enter the parse_more.
I've been working for two days and still got nothing, any help or advice will be appreciated.
The login sequence is correct. However the parsed_followed_columns() method totally corrupts the session.
you cannot use hardcoded values for data['_xsrf'] and params['hash_id']
You should find a way to read this information directly from html content of previous page and inject the values dynamically.
Also, I suggest you remove the headers parameter in this request which can only cause trouble.
Related
I have been trying to make a simple web request using python post data, the response ,
The server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).
The code below posts a request using python requests library. 400 response is received when executed. Could this issue be due to header syntax or format issues.
code:
headers = {
'Host': 'host.url',
'Content-Length': '1847',
'Sec-Ch-Ua': '"Chromium";v="95", ";Not A Brand";v="99"',
'Accept': 'application/json, text/plain, /',
'Content-Type': 'application/json',
'Authorization': 'auth-key',
'Sec-Ch-Ua-Mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Sec-Ch-Ua-Platform': '"Windows"',
'Origin': 'origin.url',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'referer.url',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'close',
}
data = {}
import json
json_object = json.dumps(data, indent = 4)
response = requests.post('url', data=json_object ,headers=headers, verify=False)
print(response.text)
I have just started out with web scraping.
The data I need seems to be returned by an AJAX POST request. POST requests are very rarely covered by scraping tutorials and seem to come with lots of "gotcha's" for new users like myself.
I copied the request from Chrome dev tools into Postman using cURL and then generated the Python request code. The request uses a peculiar set query parameters... I have however repeated this process and the only parameter that changes is the session ID.
The problem is that the request stops working after some time has elapsed (Internal server error 500). I would then have to copy the request from the site again with the new session ID.
Any pointers in the right direction would be appreciated.
import requests
url = "https://online.natis.gov.za/gateway/PreBooking?_flowId=PreBooking&_flowExecutionKey=e1s2&flowName=[object%20Object]&_eventId_next=Next?dtoName=perSummaryDetailDto&viewId=perSummaryDetail&flowExecutionKey=e1s2&flowExecutionUrl=%2Fgateway%2FPreBooking%3F_flowId%3DPreBooking%26_flowExecutionKey%3De1s2&sessionId=IWhelPTLyYDa7JohJV6x8So_qEKdC8wOknArAXkS&surname={SURNAME}&initials=R&firstName1={FIRSTNAME}&emailAddress={EMAIL}&cellN={CELL}&isWithinPriorityDate=false&viewPrioritySlots=false&showPrioritySlotsModal=false&provcdt=4&supportUser=false"
payload = {}
headers = {
'Connection': 'keep-alive',
'Content-Length': '0',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'Accept': 'application/json, text/plain, */*',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Origin': 'https://online.natis.gov.za',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://online.natis.gov.za/',
'Accept-Language': 'en-US,en;q=0.9',
'Cookie': 'JSESSIONID=IWhelPTLyYDa7JohJV6x8So_qEKdC8wOknArAXkS.master:gateway_3; Gateway=R35619282; ROUTEID.33f40c02f95309866c572c0def16f016=.node1; JSESSIONID=BadmtwJ7c8YWEz73xe6Wu165Q7gapmm4WTY6at-p.master:gateway_3; Gateway=R35619282',
'dnt': '1',
'sec-gpc': '1'
}
response = requests.request(
"POST", url, headers=headers, data=payload, verify=False)
print(response.text)
I am trying to scrape an API that returns a JSON object but it only returns a JSON very first time and after it, it's not returning anything. i am using "if-none-match" header with Cookies but i want to do it without Cookies because I have lots of API of this category to scrape.
Here is my spider code:
import scrapy
from scrapy import Spider, Request
import json
from scrapy.crawler import CrawlerProcess
header_data = {'authority': 'shopee.com.my',
'method': 'GET',
'scheme': 'https',
'accept': '*/*',
'if-none-match-': '*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'x-shopee-language': 'en',
'Cache-Control': 'max-age=0',
}
class TestSales(Spider):
name = "testsales"
allowed_domains = ['shopee.com', 'shopee.com.my', 'shopee.com.my/api/']
cookie_string = {'SPC_U':'-', 'SPC_IA':'-1' , 'SPC_EC':'-' , 'SPC_F':'7jrWAm4XYNNtyVAk83GPknN8NbCMQEIk', 'REC_T_ID':'476673f8-eeb0-11ea-8919-48df374df85c', '_gcl_au':'1.1.1197882328.1599225148', '_med':'refer', '_fbp':'fb.2.1599225150134.114138691', 'language':'en', '_ga':'GA1.3.1167355736.1599225151', 'SPC_SI':'mall.gTmrpiDl24JHLSNwnCw107mao3hd8qGP', 'csrftoken':'2ntG40uuWzOLUsjv5Sn8glBUQjXtbGgo', 'welcomePkgShown':'true', '_gid':'GA1.3.590966412.1602427202', 'AMP_TOKEN':'%24NOT_FOUND', 'SPC_CT_21c6f4cb':'1602508637.vtyz9yfI6ckMZBdT9dlICuAYf7crlEQ6NwQScaB2VXI=', 'SPC_CT_087ee755':'1602508652.ihdXyWUp3wFdBN1FGrKejd91MM8sJHEYCPqcgmKqpdA=', '_dc_gtm_UA-61915055-6':'1', 'SPC_R_T_ID':'vT4Yxil96kYSRG2GIhtzk8fRJldlPJ1/szTbz9sG21nTJr4zDoOnnxFEgYe2Ea+RhM0H8q0m/SFWBMO7ktpU5Kim0CJneelIboFavxAVwb0=', 'SPC_T_IV':'hhHcCbIpVvuchn7SbLYeFw==', 'SPC_R_T_IV':'hhHcCbIpVvuchn7SbLYeFw==', 'SPC_T_ID':'vT4Yxil96kYSRG2GIhtzk8fRJldlPJ1/szTbz9sG21nTJr4zDoOnnxFEgYe2Ea+RhM0H8q0m/SFWBMO7ktpU5Kim0CJneelIboFavxAVwb0='}
custom_settings = {
'AUTOTHROTTLE_ENABLED' : 'True',
# The initial download delay
'AUTOTHROTTLE_START_DELAY' : '0.5',
# The maximum download delay to be set in case of high latencies
'AUTOTHROTTLE_MAX_DELAY' : '10',
# The average number of requests Scrapy should be sending in parallel to
# each remote server
'AUTOTHROTTLE_TARGET_CONCURRENCY' : '1.0',
# 'DNSCACHE_ENABLED' : 'False',
# 'COOKIES_ENABLED': 'False',
}
def start_requests(self):
subcat_url = '/Baby-Toddler-Play-cat.27.23785'
id = subcat_url.split('.')[-1]
header_data['path'] = f'/api/v2/search_items/?by=sales&limit=50&match_id={id}&newest=0&order=desc&page_type=search&version=2'
header_data['referer'] = f'https://shopee.com.my{subcat_url}?page=0&sortBy=sales'
url = f'https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id={id}&newest=0&order=desc&page_type=search&version=2'
yield Request(url=url, headers=header_data, #cookies=self.cookie_string,
cb_kwargs={'subcat': 'baby tobbler play cat', 'category': 'baby and toys' })
def parse(self, response, subcat, category):
# pass
try:
jdata = json.loads(response.body)
except Exception as e:
print(f'exception: {e}')
print(response.body)
return None
items = jdata['items']
for item in items:
name = item['name']
image_path = item['image']
absolute_image = f'https://cf.shopee.com.my/file/{image_path}_tn'
print(f'this is absolute image {absolute_image}')
subcategory = subcat
monthly_sold = 'pending'
price = float(item['price'])/100000
total_sold = item['sold']
location = item['shop_location']
stock = item['stock']
print(name)
print(price)
print(total_sold)
print(location)
print(stock)
app = CrawlerProcess()
app.crawl(TestSales)
app.start()
This is the page url which you can see the entering on browser: https://shopee.com.my/Baby-Toddler-Play-cat.27.23785?page=0&sortBy=sales
This is the API url which you can also find from developers tool of that page: https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id=23785&newest=0&order=desc&page_type=search&version=2
Please tell me how to handle 'cache' or 'if-none-match' because i can't understand how to handle it.
Thanks in Advance!
All you need to generate API GET requests is category identificator which is match_id and start item number which is newest parameter.
Using link template https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id={category_id}&newest={start_item_number}&order=desc&page_type=search&version=2 you can fetch any API category endpoint.
There's no need to manage cookies or even headers in this case. API is not restrictive at all.
UPDATE:
This worked for me in scrapy shell:
from scrapy import Request
url = 'https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id=23785&newest=50&order=desc&page_type=search&version=2'
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"X-Requested-With": "XMLHttpRequest",
}
request = Request(
url=url,
method='GET',
dont_filter=True,
headers=headers,
)
fetch(request)
I am trying to replicate in Python 3 the ajax requests I can see in Chrome Tools at this address when changing the time slot of data in the EPG grid.
This is my code:
import sys
import os
import re
import requests
session = requests.Session()
url_list = ['http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599019200&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599033600&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599048000&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599062400&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599076800&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5',
'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lu=36625D&wd=940&ht=445&mode=json&style=wh&st=1599004800&tz=America%2FToronto&lang=en&ctrlpos=top&items=90&numhours=5']
for url in url_list:
headers ={
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'tvmds.tvpassport.com',
'Referer': 'http://www.canada.com/entertainment/television/tv-listings/index.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
r = session.get(url=url, headers=headers)
status_code = r.status_code
data2 = r.content.decode('utf-8', errors='ignore')
session.close()
chan_data = re.findall(' channel row -->(.*?)<!--', data2, re.S)
for chan in chan_data:
if "\'tvm_chan_50\'" in chan:
print(url)
print(chan)
print('-' * 150)
The code runs without error, but the &st parameter appears to be having no effect. This the timestamp of various times throughout the day.
I keep getting the same time slot of data returned of 7am to 9:30am roughly.
What am I doing wrong?
Thanks
Edit:
For clarity, the code itself is not falling over. My problem is the the ajax requests I am seeing in Chrome tools when using the time toggle buttons on the EPG referenced does not seem to be producing the same effect when the ajax API is either run in code, or copied and pasted directly into a web browser...
On a government site I managed to login via my credidentials (specified as a python dictonary in login_data) as follows:
with requests.Session() as s:
url = 'https:......../login'
r = s.get(url, data=login_data, headers=headers, verify=False)
r = s.post(url, data=login_data, headers = headers, verify=False)
print(r.content)
which displays a html:
b'<!DOCTYPE html..... and if I search for my username i find <span class="rich-messages-label msg-def-inf-label">Welcome, USER..XYZ!< From which i conclude a succesful login.
Next I want to proceed to the search subsite (url = 'https:......./search) of the site I'm now logged in to. This subsite allows me to search the government records for an incident (incident-ID) on a given date (start_date, end_date).
because of the login success I tried the following:
with requests.Session() as s:
url = 'https:......../search'
r = s.get(url, data=search_data, headers=headers, verify=False)
r = s.post(url, data=search_data, headers = headers, verify=False)
print(r.content)
In advance i defined search_data using Google Chrome Inspecor for Network and Header:
search_data:{
'AJAXREQUEST': '_viewRoot',
'theSearchForm': 'theSearchForm',
'incident-ID' : '12345',
'start_date' : '05/03/2019 00:00:00 +01:00',
'end_date' : '05/03/2019 23:59:59 +01:00',
}
and i specified headers to include more than just the agent:
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=8351xxxxxxxxxxxxFD5; _ga=GA1.2.xxxxxxx.xxxxxxxx',
'Host': 'somehost...xyz.eu',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
}
So far the setup should be nice, no? But I ran into a problem as the print(r.content) doesn't give me the .html as after the login but some disappointingly short: b'<?xml version="1.0" encoding="UTF-8"?>\n<html xmlns="http://www.w3.org/1999/xhtml"><head><meta name="Ajax-Response" content="redirect" /><meta name="Location" content="home.seam?cid=3774801" /></head></html>
it's a pitty because I can see in the inspctor that the response for the post-request in the browser yields the exact data I am looking for. Similarl the first post-request yields the exact same data as my python command r = s.post(url, data=login_data, headers = headers, verify=False). But the print(r.content) as already said seams to be a redirect which only brings me back to the login site, stating you're already logged in.
To sum up:
The first request.Session.get & -.post worked (I get the same response html as in the Google Chrome Inspector).
The second request.Session.post doesnt work as it just yields some weird redirect (but I get
the correct response in the Google Chrome Inspector).
What am I missing??? Please Help! :S