Effects of passing headers in a requests? - image

I want to know what difference it makes when you pass headers in requests.get i.e. the difference between requests.get(url, headers) and requests.get(url).
I have these two pieces of code:
from lxml import html
from lxml import etree
import requests
import re
url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"
page = requests.get(url)
tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[#id="main-image-container"]//img/#src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]
this whose out put is a url as you'd expect. But this:
from lxml import html
from lxml import etree
import requests
import re
url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)
tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[#id="main-image-container"]//img/#src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]
has an output that starts with data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAoHBwgHBgoIC
I'm guessing this is the actual image without the rendering, just plain data. Any idea how I could keep it in url form? In what other ways does the presence of a header affect the response we get?
Thank You

Save the first code's response to html file and open in your browser:
as you can see, you are banned by amazon without headers.
use this xpath:
XPATH_IMAGE_SOURCE = '//*[#id="main-image-container"]//img/#data-old-hires'
out:
type: <class 'lxml.etree._ElementStringResult'>
http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg
this is raw html data:
<img alt=".." src="
data:image/webp;base64,UklGRuYIAABXRUJQVlA4INoIAACQQQCdASosAcsAPrFWpEqkIqQhIxN6gIgWCek6r4bUf/..."
data-old-hires="http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg"
the picture url is in data-old-hires attribute.

Related

Discord.py Warning "discord.gateway: Shard ID None heartbeat blocked for more than 10 seconds." while using AsyncIOScheduler

I used Discord.py to create discord bot and run it all day. This bot performs crawling of community sites once every 10 seconds. After about 3 hours of execution, the warning message was displayed every 10 seconds.
My bot code:
import os
import discord
import requests
from bs4 import BeautifulSoup
import asyncio
from apscheduler.schedulers.asyncio import AsyncIOScheduler
intents = discord.Intents.default()
intents.message_content = True
intents.members = True
client = discord.Client(intents=intents)
token = "token"
url = "site url"
params = {
'id': 'id',}
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
headers = {
'User-Agent': user_agent}
async def read_site():
resp = requests.get(url, params=params, headers=headers)
soup = BeautifulSoup(resp.content, 'html.parser')
article_nums = soup.find("tbody").find_all("tr", class_="ub-content us-post")
with open("recent.txt", "r") as f:
recent = int(f.read())
new_flag = False
for val in article_nums:
article_num = val.select_one("td", class_='num').get_text()
if not article_num.isdecimal():
continue
article_num = int(article_num)
if article_num > recent:
new_flag = True
recent = article_num
if new_flag:
channel = await client.fetch_channel(channel_id)
await channel.send(recent)
with open("recent.txt", "w") as f:
f.write(str(recent))
f.close()
...
#client.event
async def on_ready():
await client.wait_until_ready()
scheduler = AsyncIOScheduler()
scheduler.add_job(read_site, 'interval', seconds=3, misfire_grace_time=5)
scheduler.start()
print(f'We have logged in as {client.user}')
client.run(token)
If anyone knows about this error, I would appreciate your help.
I saw a similar article in stackoverflow, so I set
misfire_grace_time
in the
add_job
function, but I couldn't solve it.
"Heartbeat blocked for more than 10 seconds" can be caused by the blocking in your code. Then if the execution of this part of blocking code takes longer time then discord gateway will show this warning.
From the code provided, the line of requests.get(...) is definitely a blocking line of code since requests library is all synchronous. Consider changing this part of the code to aiohttp.
Also, there might be other libraries that could be blocking. Replace them with similar asynchronous libraries.

For Loop for scraping emails for mulitple urls - BS

Below is the code for scraping emails for a single base url and I have been cracking my head on getting a simple "for loop" to do it for an array of urls or reading a list of urls (csv) into python. Can anyone modify the code so that it will do the job?
import requests
import re
from bs4 import BeautifulSoup
allLinks = [];mails=[]
url = 'https://www.smu.edu.sg/'
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
allLinks=set(links)
def findMails(soup):
for name in soup.find_all('a'):
if(name is not None):
emailText=name.text
match=bool(re.match('[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText))
if('#' in emailText and match==True):
emailText=emailText.replace(" ",'').replace('\r','')
emailText=emailText.replace('\n','').replace('\t','')
if(len(mails)==0)or(emailText not in mails):
print(emailText)
mails.append(emailText)
for link in allLinks:
if(link.startswith("http") or link.startswith("www")):
r=requests.get(link)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
else:
newurl=url+link
r=requests.get(newurl)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")

Downloaded images from the Metropolitan Museum collection are empty

I'm trying to download random public domain images from the Metropolitan Museum collection using their API (more info here : https://metmuseum.github.io/) and Python, unfortunatly the images I get are empty. Here is a minimal code :
import urllib
from urllib2 import urlopen
import json
from random import randint
url = "https://collectionapi.metmuseum.org/public/collection/v1/objects"
objectID_list = json.loads(urlopen(url).read())['objectIDs']
objectID = objectID_list[randint(0,len(objectID_list)-1)]
url_request = url+"/"+str(objectID)
fetched_data = json.loads(urlopen(url_request).read())
if fetched_data['isPublicDomain']:
name = str(fetched_data['title'])
ID = str(fetched_data['objectID'])
url_image = str(fetched_data['primaryImage'])
urllib.urlretrieve(url_image, 'path/'+name+'_'+ID+'.jpg')
If I print url_image and copy/paste it in a browser I get to the desired image, but the code retrieves an image that weights 1ko and can't be opened.
Any idea what I'm doing wrong ?
Your way of downloading is correct, however, it seems as the domain is validating request headers to prevent scraping (probably unintended as they have an API to pull images).
One way of solving this problem is by changing your headers to something realistic, or utilizing fake_useragent and requests.
import requests
from fake_useragent import UserAgent
def save_image(link, file_path):
ua = UserAgent(verify_ssl=False)
headers = {"User-Agent": ua.random}
r = requests.get(link, stream=True, headers=headers)
if r.status_code == 200:
with open(file_path, 'wb') as f:
f.write(r.content)
else:
raise Exception("Error code {}.".format(r.status_code))

Not able to scrape more then 10 records using scrapy

I'm new to scrapy and python. I'm using scrapy for scraping the data.
The site using AJAX for pagination so I'm not able to get the data more than 10 records I'm posting my code
from scrapy import Spider
from scrapy.selector import Selector
from scrapy import Request
from justdial.items import JustdialItem
import csv
from itertools import izip
import scrapy
import re
class JustdialSpider(Spider):
name = "JustdialSpider"
allowed_domains = ["justdial.com"]
start_urls = [
"http://www.justdial.com/Mumbai/Dentists/ct-385543",
]
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)
def parse(self, response):
questions = Selector(response).xpath('//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]')
for question in questions:
item = JustdialItem()
item['name'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()').extract()
item['contact'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[#class="contact-info"]/span/a/b/text()').extract()
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(izip(item['name'], item['contact']))
f.close()
return item
# if running code above this I'm able to get 10 records of the page
# This code not working for getting data more than 10 records, Pagination using AJAX
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate='
next_page = int(re.findall('page=(\d+)', url)[0]) + 1
next_url = re.sub('page=\d+', 'page={}'.format(next_page), url)
print next_url
def parse_ajaxurl(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
Please help me
Thanks.
Actually if you disable javascript when viewing the page you'll notice that site offers traditional pagination instead of "never ending" AJAX one.
Using this you can simply find url of next page and continue:
def parse(self, response):
questions = response.xpath('//div[contains(#class,"store-details")]')
for question in questions:
item = dict()
item['name'] = question.xpath("h4/span/a/text()").extract_first()
item['contact'] = question.xpath("p[#class='contact-info']//b/text()").extract_first()
yield item
# next page
next_page = response.xpath("//a[#rel='next']/#href").extract_first()
if next_page:
yield Request(next_page)
I also fixed up your xpaths but in overal the only bit that changed is those 3 lines under # next page comment.
As a side note I've noticed you are saving to csv in spider where you can use built-in scrapy exporter command like:
scrapy crawl myspider --output results.csv

Forming Xpath Query for total Google results

I've used this formula in Google Spreadsheets in the past to input the number of search results into a cell. The formula's not working anymore and I know I need to fix the xpath query.
any ideas?
Current formula:
=importXML("http://www.google.com/search?num=100&q="&A2,"//p[#id='resultStats']/b[3]")
Spreadsheet for public testing:
https://spreadsheets8.google.com/ccc?key=tJzmFllp7Sk1lt23cXSVXFw&authkey=CM625OUO#gid=0
=ImportXML(A2,"//*[#id='resultStats']")
It doesn't produce the number only, but it's a start
Set C2 =ImportXML(A2,"//*[#id='resultStats']")
Then use =REGEXREPLACE(C2 ;"^About(.*) results";"$1")
import requests
from bs4 import BeautifulSoup
import re
def google_results(query):
url = 'https://www.google.com/search?q=' + query
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
}
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='resultStats')
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
print(google_results('test'))

Resources