Forming Xpath Query for total Google results - xpath

I've used this formula in Google Spreadsheets in the past to input the number of search results into a cell. The formula's not working anymore and I know I need to fix the xpath query.
any ideas?
Current formula:
=importXML("http://www.google.com/search?num=100&q="&A2,"//p[#id='resultStats']/b[3]")
Spreadsheet for public testing:
https://spreadsheets8.google.com/ccc?key=tJzmFllp7Sk1lt23cXSVXFw&authkey=CM625OUO#gid=0

=ImportXML(A2,"//*[#id='resultStats']")
It doesn't produce the number only, but it's a start

Set C2 =ImportXML(A2,"//*[#id='resultStats']")
Then use =REGEXREPLACE(C2 ;"^About(.*) results";"$1")

import requests
from bs4 import BeautifulSoup
import re
def google_results(query):
url = 'https://www.google.com/search?q=' + query
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
}
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='resultStats')
return int(''.join(re.findall(r'\d+', div.text.split()[1])))
print(google_results('test'))

Related

Discord.py Warning "discord.gateway: Shard ID None heartbeat blocked for more than 10 seconds." while using AsyncIOScheduler

I used Discord.py to create discord bot and run it all day. This bot performs crawling of community sites once every 10 seconds. After about 3 hours of execution, the warning message was displayed every 10 seconds.
My bot code:
import os
import discord
import requests
from bs4 import BeautifulSoup
import asyncio
from apscheduler.schedulers.asyncio import AsyncIOScheduler
intents = discord.Intents.default()
intents.message_content = True
intents.members = True
client = discord.Client(intents=intents)
token = "token"
url = "site url"
params = {
'id': 'id',}
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
headers = {
'User-Agent': user_agent}
async def read_site():
resp = requests.get(url, params=params, headers=headers)
soup = BeautifulSoup(resp.content, 'html.parser')
article_nums = soup.find("tbody").find_all("tr", class_="ub-content us-post")
with open("recent.txt", "r") as f:
recent = int(f.read())
new_flag = False
for val in article_nums:
article_num = val.select_one("td", class_='num').get_text()
if not article_num.isdecimal():
continue
article_num = int(article_num)
if article_num > recent:
new_flag = True
recent = article_num
if new_flag:
channel = await client.fetch_channel(channel_id)
await channel.send(recent)
with open("recent.txt", "w") as f:
f.write(str(recent))
f.close()
...
#client.event
async def on_ready():
await client.wait_until_ready()
scheduler = AsyncIOScheduler()
scheduler.add_job(read_site, 'interval', seconds=3, misfire_grace_time=5)
scheduler.start()
print(f'We have logged in as {client.user}')
client.run(token)
If anyone knows about this error, I would appreciate your help.
I saw a similar article in stackoverflow, so I set
misfire_grace_time
in the
add_job
function, but I couldn't solve it.
"Heartbeat blocked for more than 10 seconds" can be caused by the blocking in your code. Then if the execution of this part of blocking code takes longer time then discord gateway will show this warning.
From the code provided, the line of requests.get(...) is definitely a blocking line of code since requests library is all synchronous. Consider changing this part of the code to aiohttp.
Also, there might be other libraries that could be blocking. Replace them with similar asynchronous libraries.

Downloading image from url using python-requests recieving error 403:Forbidden

So Im trying to download image but server response is "403". I have tried using other user-agents if it has any sense but it doesn't work. With other links code works well. I don't know how i can circumvent server or smth.
import requests
import shutil
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
r = requests.get('https://s8.mkklcdnv6temp.com/mangakakalot/b1/bd926355/chapter_3/1.jpg',
headers=headers,
stream=True)
if r.status_code == 200:
with open("img.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
else:
print('failure')

Trying to http get "www.shopyourway.com" Fail

I tried to do a http get for the website http://www.shopyourway.com using ruby Net::HTTP.get, but I got an error with code 512. And I tried to do a get with ssl for url "https://www.shopyourway.com". It just followed a redirection to the url without ssl.
code is as below:
uri = URI('https ://www.shopyourway.com') #space between https and : does not exist
body = Net::HTTP.get(uri)
I can browse the url using browser. But why I can't do a http get for that url?
Thanks
finally get this to works, need to add couple headers to the Get request.
uri = URI('http://www.shopyourway.com/today')
req = Net::HTTP::Get.new(uri)
req['Upgrade-Insecure-Requests'] = '1'
req['Connection'] = 'keep-alive'
req['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'

Effects of passing headers in a requests?

I want to know what difference it makes when you pass headers in requests.get i.e. the difference between requests.get(url, headers) and requests.get(url).
I have these two pieces of code:
from lxml import html
from lxml import etree
import requests
import re
url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"
page = requests.get(url)
tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[#id="main-image-container"]//img/#src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]
this whose out put is a url as you'd expect. But this:
from lxml import html
from lxml import etree
import requests
import re
url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)
tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[#id="main-image-container"]//img/#src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]
has an output that starts with 
I'm guessing this is the actual image without the rendering, just plain data. Any idea how I could keep it in url form? In what other ways does the presence of a header affect the response we get?
Thank You
Save the first code's response to html file and open in your browser:
as you can see, you are banned by amazon without headers.
use this xpath:
XPATH_IMAGE_SOURCE = '//*[#id="main-image-container"]//img/#data-old-hires'
out:
type: <class 'lxml.etree._ElementStringResult'>
http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg
this is raw html data:
<img alt=".." src="
..."
data-old-hires="http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg"
the picture url is in data-old-hires attribute.

Not able to scrape more then 10 records using scrapy

I'm new to scrapy and python. I'm using scrapy for scraping the data.
The site using AJAX for pagination so I'm not able to get the data more than 10 records I'm posting my code
from scrapy import Spider
from scrapy.selector import Selector
from scrapy import Request
from justdial.items import JustdialItem
import csv
from itertools import izip
import scrapy
import re
class JustdialSpider(Spider):
name = "JustdialSpider"
allowed_domains = ["justdial.com"]
start_urls = [
"http://www.justdial.com/Mumbai/Dentists/ct-385543",
]
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)
def parse(self, response):
questions = Selector(response).xpath('//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]')
for question in questions:
item = JustdialItem()
item['name'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()').extract()
item['contact'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[#class="contact-info"]/span/a/b/text()').extract()
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(izip(item['name'], item['contact']))
f.close()
return item
# if running code above this I'm able to get 10 records of the page
# This code not working for getting data more than 10 records, Pagination using AJAX
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate='
next_page = int(re.findall('page=(\d+)', url)[0]) + 1
next_url = re.sub('page=\d+', 'page={}'.format(next_page), url)
print next_url
def parse_ajaxurl(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
Please help me
Thanks.
Actually if you disable javascript when viewing the page you'll notice that site offers traditional pagination instead of "never ending" AJAX one.
Using this you can simply find url of next page and continue:
def parse(self, response):
questions = response.xpath('//div[contains(#class,"store-details")]')
for question in questions:
item = dict()
item['name'] = question.xpath("h4/span/a/text()").extract_first()
item['contact'] = question.xpath("p[#class='contact-info']//b/text()").extract_first()
yield item
# next page
next_page = response.xpath("//a[#rel='next']/#href").extract_first()
if next_page:
yield Request(next_page)
I also fixed up your xpaths but in overal the only bit that changed is those 3 lines under # next page comment.
As a side note I've noticed you are saving to csv in spider where you can use built-in scrapy exporter command like:
scrapy crawl myspider --output results.csv

Resources