How can I scrape Google Search results in Heroku? - heroku

I want to scrape Google Search results in Heroku but when I am using simple request and bs4 to scrape but my request is getting blocked due to its cookies.
I have also attached a image of the Google response on searching in Heroku:

It might be because there's no user-agent specified. For example, default requests user-agent is python-requests thus Google knows that it's a bot and not a "real" user visit.
Another thing that might happen is you receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers. Check what's your user-agent.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Pass user-agent in request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('YOUR_URL', headers=headers)
Also, it's just a cookie consent. Everything is accessible for scraping. If you want to remove it, pass a specific cookie (dev tools -> network -> fetch/xhr -> headers - look for cookie) to request headers, or remove an element with decompose() bs4 method which will remove an element from the HTML tree.
Example code to extract URL and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
"q": "samurai cop what does katana mean", # query
"gl": "us", # country to search from
"hl": "en", # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
... other links
'''
Alternatively, you can achieve exactly the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
Essentially, the main difference in your example is that you don't have to figure out how to bypass blocks from Google or other search engines, create a parser from scratch and maintain it over time.
The only thing that really needs to be done is to iterate over structured JSON and get the data you want.
Code to integrate:
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
print()
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
... other links
'''
Disclaimer, I work for SerpApi.

Related

Scrape peekyou.com ( having POST METHOD)

Please see the output which I am getting I am trying to scrape peekyou.com which is kind of peoples search engine. They use POST method of php.I am using requests.post method of requests library to scrape the results .
suppose a persons name is "john coasta" then the target url would be :
peekyou.com/john_coasta
import requests
import json
payload = { 'formdata' : {'md5': '4a9050a569e0f7d862b771926f7abc57',
'asynchronous': 'true'}
}
req = requests.post('https://www.peekyou.com/shantanu_sharma',
data = payload,
headers={ 'X-Requested-With': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'referer': 'https://www.peekyou.com/shantanu_sharma',
'server':'Apache/2.4.33 (FreeBSD) OpenSSL/1.0.2k-freebsd mod_fastcgi/mod_fastcgi-SNAP-0910052141'
}
)
print(req.content)
although I am getting the full result in HTML form , the result which I am seeking for is encoded(I need decoded o/p) in the characters like :\n\t ( inside every HTML tag {surprisingly this is the actual result}).I didn't use POST requests frequently. Please provide me some solution.
Thanks in advance
the result which I am seeking for is encoded in the characters like :\n\t
maybe the response is blank becuase you are doing something wrong?
when i opened that site, I found that it uses a lot of Cookies,but you are not using any cookies.If you are sure that you are doing everything in correct way ,then use a tool like Chrome dev tools to see what happens after making this post requests (using the browser)
,see if the browser is decoding/encoding/sending cookies/etc.
Edit:You are getting Blank response ,as i think :it is not encoded,the server is sending you this because you are doing something wrong in your post request(according to something i faced before!)

Passing request_body with GET request?

Like at this elastic get query I see below example where per my understanding query_string is passed under request body in GET request . Is n't it ? But I believe we can't pass request body with GET request then how come this example is true ?
GET /_search
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
}
In fact when I used the option COPY as CURL from the stated link I see below copied text
curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
}
'
Am I missing anything here or something wrong in example? In fact I do not see the way to send the request body under Postman tool.
The fact is that you can send a GET request with a body. The current HTTP standard rfc7231 (obsoletes rfc2616 and updates rfc2817) does not strictly define what must happen to a GET request with a body. The previous versions were different in this regard. For that reason, some HTTP servers allow it, but some others don't, I'm afraid. This case is mentioned in the latest standard as follows:
A payload within a GET request message has no defined semantics;
sending a payload body on a GET request might cause some existing
implementations to reject the request.
In terms of Elasticsearch, using using GET for a search request is a design decision. They feel it makes more sense semantically. Because it represents a data retrieving action better than the POST verb.
On the other hand, as mentioned above, a GET request with a body is not supported universally. That's why Postman does not allow you to do so, although Kibana > Dev Tool does it by using cURL. Therefore, the Elasticsearch search API also supports POST requests to search and retrieve information. So, when you cannot make a GET request with a body, you can obtain exactly the same result by making a POST request.
This is actually very interested question. In fact, a lot of HTTP clients aren’t supporting GET requests with body (i just recently face, that iOS client in Cocoa isn’t able to do so).
I also had a lot of discussions with my colleagues - to me after using Elasticsearch for a long time GET with a body sounds like a perfectly fine HTTP request, however some may argue, that GET shouldn’t go with body at all according to HTTP standard. However, I will leave this discussion out of this answer.
In general this leads to a situation, that if you’re using client which not supporting GET, you could either change it to POST or switch to something else - I used to use cURL all the time or Kibana Dev Tools if I needed to construct complex query on the fly

What is the "accept" part for?

When connecting to a website using Net::HTTP you can parse the URL and output each of the URL headers by using #.each_header. I understand what the encoding and the user agent and such means, but not what the "accept"=>["*/*"] part is. Is this the accepted payload? Or is it something else?
require 'net/http'
uri = URI('http://www.bible-history.com/subcat.php?id=2')
http://www.bible-history.com/subcat.php?id=2>
http_request = Net::HTTP::Get.new(uri)
http_request.each_header { |header| puts header }
# => {"accept-encoding"=>["gzip;q=1.0,deflate;q=0.6,identity;q=0.3"], "accept"=>["*/*"], "user-agent"=>["Ruby"], "host"=>["www.bible-history.com"]}
From https://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#z3
This field contains a semicolon-separated list of representation schemes ( Content-Type metainformation values) which will be accepted in the response to this request.
Basically, it specifies what kinds of content you can read back. If you write an api client, you may only be interested in application/json, for example (and you couldn't care less about text/html).
In this case, your header would look like this:
Accept: application/json
And the app will know not to send any html your way.
Using the Accept header, the client can specify MIME types they are willing to accept for the requested URL. If the requested resource is e.g. available in multiple representations (e.g an image as PNG, JPG or SVG), the user agent can specify that they want the PNG version only. It is up to the server to honor this request.
In your example, the request header specifies that you are willing to accept any content type.
The header is defined in RFC 2616.

WebApi: */* media type handling

In my application I'm making some javascript requests to my Api Controllers to get some html formatted strings. When those requests are made with Accept: */* HTTP header (jQuery $.get method), so by default JsonMediaTypeFormatter is used and the data is returned with Content-Type: application/json in JSON format.
What I would like is to handle */* requests as text/html. So I tried to create a custom MediaTypeFormatter that supports */* media type, but it gives me the following error
The 'MediaTypeHeaderValue' of */* cannot be used as a supported
media type because it is a media range.`
Alternatively I could always provide correct expected data types in my requests, but I'm curious if there's a way to handle */* media types.
The above behavior is due to the following:
The default con-neg algorithm in Web API has the following precedence order of choosing the formatter for response:
Formatter match based on Media Type Mapping.
Formatter match based on Request Accept header's media type.
Formatter match based on Request Content-Type header's media type.
Formatter match based on if it can serialize the response data’s Type.
Now, JsonMediaTypeFormatter comes with a built-in media type mapping called XmlHttpRequestHeaderMapping which inspects an incoming request and sees if the request has the header x-requested-with: XMLHttpRequest and also if there is no accept header or if the Accept header is only having */*.
Since your request is mostly probably looking like below, according to the precedence order JsonMediaTypeFormatter is chosen as the one writing the response:
GET /api/something
Accept: */*
x-requested-with: XMLHttpRequest
A solution for your issue would be is to explicitly ask for "text/html" as this is what you are expecting.
GET /api/something
Accept: text/html
x-requested-with: XMLHttpRequest
Couple of very old blog posts about Content negotiation that I wrote:
http://blogs.msdn.com/b/kiranchalla/archive/2012/02/25/content-negotiation-in-asp-net-mvc4-web-api-beta-part-1.aspx
http://blogs.msdn.com/b/kiranchalla/archive/2012/02/27/content-negotiation-in-asp-net-mvc4-web-api-beta-part-2.aspx
Great question.
You can't set */* to be a supported media type, but what you can do is set your formatter to be the first one. Web API will pick the first formatter in the formatter collection that can write out the type if there is no Accept header or if the Accept header is */*.
So you'd want to configure your Web API like this:
config.Formatters.Insert(0, new MyHtmlFormatter());

Fetching only X/HTML links (not images) based on mime type

I'm crawling a site using Ruby + OpenURI + Nokogiri. Fetch a page, find all the a[href] and (if they're in the same domain and right protocol) follow them to crawl again.
Sometimes there are links to large binaries (e.g. jpeg, exe), and I don't want to crawl those.
I tried using the HTTP "Accept" header to get an error or empty response for the wrong mime types like so:
require 'open-uri'
page = open(url, 'Accept'=>'text/html,application/xhtml+xml,application/xml')
...but OpenURI still downloads binaries sent with another mime type.
Other than looking at file extensions in the url for a probable file type, how can I prevent the download (or detect a conflicting response type) for an arbitrary URL?
You could send a HEAD request first, then check the Content-type header of the response and only make the real request if it’s acceptable:
ACCEPTABLE_TYPES = %w{text/html application/xhtml+xml application/xml}
uri = URI(url)
type = Net::HTTP.start(uri.host, uri.port) do |http|
http.head(uri.path).content_type
end
if ACCEPTABLE_TYPES.include? type
# fetch the url
else
# do whatever
end
This will need an extra request for each page, but I can’t see a way of avoiding it. It also relies on the server sending the same headers for a HEAD request as it does for a GET, which I think is a reasonable assumption but something to be aware of.

Resources