Scrape peekyou.com ( having POST METHOD) - ajax

Please see the output which I am getting I am trying to scrape peekyou.com which is kind of peoples search engine. They use POST method of php.I am using requests.post method of requests library to scrape the results .
suppose a persons name is "john coasta" then the target url would be :
peekyou.com/john_coasta
import requests
import json
payload = { 'formdata' : {'md5': '4a9050a569e0f7d862b771926f7abc57',
'asynchronous': 'true'}
}
req = requests.post('https://www.peekyou.com/shantanu_sharma',
data = payload,
headers={ 'X-Requested-With': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'referer': 'https://www.peekyou.com/shantanu_sharma',
'server':'Apache/2.4.33 (FreeBSD) OpenSSL/1.0.2k-freebsd mod_fastcgi/mod_fastcgi-SNAP-0910052141'
}
)
print(req.content)
although I am getting the full result in HTML form , the result which I am seeking for is encoded(I need decoded o/p) in the characters like :\n\t ( inside every HTML tag {surprisingly this is the actual result}).I didn't use POST requests frequently. Please provide me some solution.
Thanks in advance

the result which I am seeking for is encoded in the characters like :\n\t
maybe the response is blank becuase you are doing something wrong?
when i opened that site, I found that it uses a lot of Cookies,but you are not using any cookies.If you are sure that you are doing everything in correct way ,then use a tool like Chrome dev tools to see what happens after making this post requests (using the browser)
,see if the browser is decoding/encoding/sending cookies/etc.
Edit:You are getting Blank response ,as i think :it is not encoded,the server is sending you this because you are doing something wrong in your post request(according to something i faced before!)

Related

POST a JSON and a CSV in a multipart/form-data request through a NiFi 1.15 InvokeHTTP processor

I'm working on a NiFi 1.15 flow where I have to send a request to a service that requires 2 pieces of form-data sent in a POST request as a multipart/form-data. The first part is a simple JSON object with a few parameters, while the second part is a CSV file. Here is an example of what I am trying to achieve.
Content-Type: multipart/form-data; boundary=1cf28ed799fe4325b8cd0637a67dc612
--1cf28ed799fe4325b8cd0637a67dc612
Content-Disposition: form-data; name="json"; filename="json"
{"Param1": "Value1","Param2": "Value2","Param3": true}
--1cf28ed799fe4325b8cd0637a67dc612
Content-Disposition: form-data; name="file"; filename="body.csv"
Field1,Field2,Field3,Field4,Field5
VALUE_FIELD_1,VALUE_FIELD_2,VALUE_FIELD_3,"Some value for field 4",VALUE_FIELD_5
--1cf28ed799fe4325b8cd0637a67dc612--
Another acceptable output would have the Content-Disposition lines empty.
Due to a few restrictions in the environment I am working on, I am not allowed to use scripting processors, such as ExecuteGroovyScript as suggested in another SO question.
Instead, I started creating a GenerateFlowFile -> InvokeHTTP flow. The GenerateFlowFile would output to a flow file a text similar to the one mentioned above. Here is the screenshot of the GenerateFlowFile.
The connected InvokeHTTP processor is configured to use the POST Http Method and to send headers (the Authorization header in my case) and Send Message Body is set to true. It also extracts the Content-Type from the flow file previsously generated attribute through a ${mime.type} function. You can see the details in the following screenshots.
Sadly, this does not work. The server responds with an "Unexpected end of MIME multipart stream. MIME multipart message is not complete." error.
After searching for a while in SO, I found another question describing what looks like a similar problem, but there they are getting a different error and is also posting parameters through a different method.
I am also aware about the blog post from Otto Fowler where he shows how InvokeHTTP supports POSTs with multipart/form-data. I did try this approach too, but did not manage to get it working. The service throws an error stating that NiFi does not send one of my post:form:parts.
Right now I am stuck and am not able to send that data. I did manage to write a simple Python script to test if the server is working properly and it is. For reference, here is the script:
import requests
server = 'https://targetserver.com'
#Authentication
result = requests.post(server + '/authentication',
{'grant_type': 'password',
'username': 'username',
'password': 'password'})
token = result.json()['access_token']
#Build the request
headers = {'Authorization': 'bearer ' + token}
json_data = '{"Param1": "Value1","Param2": "Value2","Param3": true}'
# First the JSON then the csv file.
files = {'json': json_data,
'file': open('body.csv', 'rb')}
result = requests.post(server + '/endpoint', headers = headers, files = files)
print(result.text)
Does anyone have a suggestion on how to get around this situation?

How can I scrape Google Search results in Heroku?

I want to scrape Google Search results in Heroku but when I am using simple request and bs4 to scrape but my request is getting blocked due to its cookies.
I have also attached a image of the Google response on searching in Heroku:
It might be because there's no user-agent specified. For example, default requests user-agent is python-requests thus Google knows that it's a bot and not a "real" user visit.
Another thing that might happen is you receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers. Check what's your user-agent.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Pass user-agent in request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('YOUR_URL', headers=headers)
Also, it's just a cookie consent. Everything is accessible for scraping. If you want to remove it, pass a specific cookie (dev tools -> network -> fetch/xhr -> headers - look for cookie) to request headers, or remove an element with decompose() bs4 method which will remove an element from the HTML tree.
Example code to extract URL and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
"q": "samurai cop what does katana mean", # query
"gl": "us", # country to search from
"hl": "en", # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
... other links
'''
Alternatively, you can achieve exactly the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
Essentially, the main difference in your example is that you don't have to figure out how to bypass blocks from Google or other search engines, create a parser from scratch and maintain it over time.
The only thing that really needs to be done is to iterate over structured JSON and get the data you want.
Code to integrate:
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
print()
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
... other links
'''
Disclaimer, I work for SerpApi.

What is the "accept" part for?

When connecting to a website using Net::HTTP you can parse the URL and output each of the URL headers by using #.each_header. I understand what the encoding and the user agent and such means, but not what the "accept"=>["*/*"] part is. Is this the accepted payload? Or is it something else?
require 'net/http'
uri = URI('http://www.bible-history.com/subcat.php?id=2')
http://www.bible-history.com/subcat.php?id=2>
http_request = Net::HTTP::Get.new(uri)
http_request.each_header { |header| puts header }
# => {"accept-encoding"=>["gzip;q=1.0,deflate;q=0.6,identity;q=0.3"], "accept"=>["*/*"], "user-agent"=>["Ruby"], "host"=>["www.bible-history.com"]}
From https://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#z3
This field contains a semicolon-separated list of representation schemes ( Content-Type metainformation values) which will be accepted in the response to this request.
Basically, it specifies what kinds of content you can read back. If you write an api client, you may only be interested in application/json, for example (and you couldn't care less about text/html).
In this case, your header would look like this:
Accept: application/json
And the app will know not to send any html your way.
Using the Accept header, the client can specify MIME types they are willing to accept for the requested URL. If the requested resource is e.g. available in multiple representations (e.g an image as PNG, JPG or SVG), the user agent can specify that they want the PNG version only. It is up to the server to honor this request.
In your example, the request header specifies that you are willing to accept any content type.
The header is defined in RFC 2616.

Retrieve POST response data with cURL

In jQuery/AJAX you can make POST requests and get their response with something like
$.post(url, data, function(res){
//do something
});
Where res contains the server's response. I am trying to replicate this with cURL in bash.
curl -d "data=data" --cookie cookies.txt --header "Content-Type:text/html" https://example.com/path > result.html
Returns gibberish (some sort of js object maybe?), but I am expecting html. Is there a way to retrieve the data that would be in res using cURL?
Thanks in advance.
Sometimes servers send back compressed content. It looks like random garbage. Use curl --compressed to get the decompressed result.
It seems that you are trying to get a data from a POST request. Server returns you a javascript object. And you are getting it right way, javascript converts returned data to HTML, not cURL. So basically you can't do that.

Why do I get this 501 Not Implemented error?

I am performing the following AJAX call:
$(document).ready(function() {
$.getJSON('https://sendgrid.com/api/user.stats.json',
{
'api_user': 'me#mydomain.com',
'api_key': 'MYAPIKEY',
'user': 'me#mydomain.com',
'category': 'MY_CATEGORY'
},
function(response){
alert('received response');
}
);
});
and I get no alert message as expected. Instead, Firebug says I get "501 Not Implemented."
Why? What do I need to do to fix this?
If I go to the URL corresponding to the AJAX call in Firebug, I get a JSON file as a download, and it contains the expected data.
One thing I've noticed is that firebug says OPTIONS instead of GET:
alt text http://grab.by/grabs/b1a13d92a4fc69aa310880a5d7a06b95.png
I don't know if this is related, but generally when requesting JSON on the client to a server in a different domain you'll need to use JSONP instead of JSON due to the Same Origin Policy. Unfortunately, it doesn't appear that their API supports using JSONP -- so they must expect you to interact with their site from your server. In that case you'll need proxy methods on your server to translate the calls to their API so that the client calls are made to a server in the same domain as the page.
As this is the top Google match for "jQuery 501 (Method not implement)" I thought I'd share what worked for me when experiencing this on the same domain (which is not your problem).
My problem was that I was not returning valid JSON, I was just returning "1". So to fix this, either:
Ensure you return valid JSON, or if you don't require a JSON response,
Swap your call to use $.ajax instead of $.getJSON, or
If you're already using &.ajax, remove type: "json"
Hope that helps some people.
I had the same problem, and realized it was an encoding problem. It was solved by encoding the values of the data sent to the server. Try something like:
$(document).ready(function() {
$.getJSON('https://sendgrid.com/api/user.stats.json',
{
'api_user': encodeURIComponent('me#mydomain.com'),
'api_key': encodeURIComponent('MYAPIKEY'),
'user': encodeURIComponent('me#mydomain.com'),
'category': encodeURIComponent('MY_CATEGORY')
},
function(response){
alert('received response');
}
);
});
end then decode the data on backend. Hope it helps someone.

Resources