Trying to send a request to here
using requests-html.
Here is my code:
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
session = HTMLSession()
while True:
try:
r = session.get("https://www.size.co.uk/product/white-fila-v94m-low/119095/",headers=headers,timeout=40)
r.html.render()
print(r.html.text)
except Exception as e:
print(e)
Here is the error I am receiving:
HTTPSConnectionPool(host='www.size.co.uk', port=443): Read timed out. (read timeout=40)
I thought setting the user agent would fix the problem, but I am still receiving the error? Increasing the timeout hasn't done the trick either
You can do this with Async
from requests_html import AsyncHTMLSession
s = AsyncHTMLSession()
async def main():
r = await s.get('https://www.size.co.uk/product/white-fila-v94m-low/119095/')
await r.html.arender()
print(r.content)
s.run(main)
Related
I used Discord.py to create discord bot and run it all day. This bot performs crawling of community sites once every 10 seconds. After about 3 hours of execution, the warning message was displayed every 10 seconds.
My bot code:
import os
import discord
import requests
from bs4 import BeautifulSoup
import asyncio
from apscheduler.schedulers.asyncio import AsyncIOScheduler
intents = discord.Intents.default()
intents.message_content = True
intents.members = True
client = discord.Client(intents=intents)
token = "token"
url = "site url"
params = {
'id': 'id',}
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
headers = {
'User-Agent': user_agent}
async def read_site():
resp = requests.get(url, params=params, headers=headers)
soup = BeautifulSoup(resp.content, 'html.parser')
article_nums = soup.find("tbody").find_all("tr", class_="ub-content us-post")
with open("recent.txt", "r") as f:
recent = int(f.read())
new_flag = False
for val in article_nums:
article_num = val.select_one("td", class_='num').get_text()
if not article_num.isdecimal():
continue
article_num = int(article_num)
if article_num > recent:
new_flag = True
recent = article_num
if new_flag:
channel = await client.fetch_channel(channel_id)
await channel.send(recent)
with open("recent.txt", "w") as f:
f.write(str(recent))
f.close()
...
#client.event
async def on_ready():
await client.wait_until_ready()
scheduler = AsyncIOScheduler()
scheduler.add_job(read_site, 'interval', seconds=3, misfire_grace_time=5)
scheduler.start()
print(f'We have logged in as {client.user}')
client.run(token)
If anyone knows about this error, I would appreciate your help.
I saw a similar article in stackoverflow, so I set
misfire_grace_time
in the
add_job
function, but I couldn't solve it.
"Heartbeat blocked for more than 10 seconds" can be caused by the blocking in your code. Then if the execution of this part of blocking code takes longer time then discord gateway will show this warning.
From the code provided, the line of requests.get(...) is definitely a blocking line of code since requests library is all synchronous. Consider changing this part of the code to aiohttp.
Also, there might be other libraries that could be blocking. Replace them with similar asynchronous libraries.
So Im trying to download image but server response is "403". I have tried using other user-agents if it has any sense but it doesn't work. With other links code works well. I don't know how i can circumvent server or smth.
import requests
import shutil
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
r = requests.get('https://s8.mkklcdnv6temp.com/mangakakalot/b1/bd926355/chapter_3/1.jpg',
headers=headers,
stream=True)
if r.status_code == 200:
with open("img.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
else:
print('failure')
I tried to do a http get for the website http://www.shopyourway.com using ruby Net::HTTP.get, but I got an error with code 512. And I tried to do a get with ssl for url "https://www.shopyourway.com". It just followed a redirection to the url without ssl.
code is as below:
uri = URI('https ://www.shopyourway.com') #space between https and : does not exist
body = Net::HTTP.get(uri)
I can browse the url using browser. But why I can't do a http get for that url?
Thanks
finally get this to works, need to add couple headers to the Get request.
uri = URI('http://www.shopyourway.com/today')
req = Net::HTTP::Get.new(uri)
req['Upgrade-Insecure-Requests'] = '1'
req['Connection'] = 'keep-alive'
req['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
I want to get the contents of 'a.com/a.html' and 'a.com/b.html' with the same request
And my code is
uri = URI.parse("http://www.sample.com/sample1.html")
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
# request.initialize_http_header({"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"})
result = http.request(request).body
should i change the path of the request?or any other idea?
You can't fetch multiple resources at once, but you can reuse a HTTP connection to fetch multiple resources from the same server (one after another):
require 'net/http'
Net::HTTP.start('a.com') do |http|
result_a = http.get('/a.html').body
result_b = http.get('/b.html').body
end
From the docs:
::start immediately creates a connection to an HTTP server which is kept open for the duration of the block. The connection will remain open for multiple requests in the block if the server indicates it supports persistent connections.
I've used this formula in Google Spreadsheets in the past to input the number of search results into a cell. The formula's not working anymore and I know I need to fix the xpath query.
any ideas?
Current formula:
=importXML("http://www.google.com/search?num=100&q="&A2,"//p[#id='resultStats']/b[3]")
Spreadsheet for public testing:
https://spreadsheets8.google.com/ccc?key=tJzmFllp7Sk1lt23cXSVXFw&authkey=CM625OUO#gid=0
=ImportXML(A2,"//*[#id='resultStats']")
It doesn't produce the number only, but it's a start
Set C2 =ImportXML(A2,"//*[#id='resultStats']")
Then use =REGEXREPLACE(C2 ;"^About(.*) results";"$1")
import requests
from bs4 import BeautifulSoup
import re
def google_results(query):
url = 'https://www.google.com/search?q=' + query
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
}
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='resultStats')
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
print(google_results('test'))