ruby http request freeze with SSL - ruby

I'm trying to download images with ruby and found interesting issue
Its part of my code for downloading an image (HTTP request only):
HTTParty.get(url)
or with
Net::HTTP.new(URI.parse(url))
and when I'm trying to download an image from Nike
url = 'https://c.static-nike.com/a/images/t_PDP_1728_v1/f_auto,b_rgb:f5f5f5/bfau7aauvleh5puvuiqa/zoom-pegasus-turbo-mens-running-shoe-Z163c3.jpg'
it works well
but for some reasons, it freezes when I'm opening Adidas:
url = 'https://www.adidas.com.sg/dis/dw/image/v2/bcbs_prd/on/demandware.static/-/Sites-adidas-products/default/dw0eb054ad/zoom/G27805_01_standard.jpg'
I have suck logs
SSL established
<- "GET /dis/dw/image/v2/bcbs_prd/on/demandware.static/-/Sites-adidas-products/default/dw0eb054ad/zoom/G27805_01_standard.jpg HTTP/1.1\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\nHost: www.adidas.com.sg\r\n\r\n"
tried to switch off SSL validation by
verify: false,
but it doesn't solve my pain ¯\_(ツ)_/¯
However, it works well with curl -O for both URLs

There is filtering being done on the server side for the Adidas URL, likely to prevent automated scraping. At a minimum you must specify additional headers to successfully make a connection.
The following example successfully returns a response from the Adidas URL:
url = 'https://www.adidas.com.sg/dis/dw/image/v2/bcbs_prd/on/demandware.static/-/Sites-adidas-products/default/dw0eb054ad/zoom/G27805_01_standard.jpg'
headers = {
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' => 'br, gzip, deflate',
'Accept-Language' => 'en-us'
}
response = HTTParty.get(url, headers: headers)
=> #<HTTParty::Response:0x7fcb02856298 parsed_response="\xFF\xD8\xFF\xE0\x00\x10JFIF ...
The three headers listed are the only headers required to get a response, but all three headers are required.
You can see from the returned response that it is returning a JPEG, so this example should work as requested.

It's possible that they block requests when some specific headers are missing, so you might want to set some of them:
HTTParty.get(url, { headers: {
"User-Agent" => "Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) FxiOS/7.0.4 Mobile/16B91 Safari/605.1.15",
"Accept-Language" => "en-US,en;q=0.9,bg;q=0.8",
"Accept-Encoding" => "gzip, deflate, br"
}
})

Related

Web scraping beginner. AJAX POST request not working

I have just started out with web scraping.
The data I need seems to be returned by an AJAX POST request. POST requests are very rarely covered by scraping tutorials and seem to come with lots of "gotcha's" for new users like myself.
I copied the request from Chrome dev tools into Postman using cURL and then generated the Python request code. The request uses a peculiar set query parameters... I have however repeated this process and the only parameter that changes is the session ID.
The problem is that the request stops working after some time has elapsed (Internal server error 500). I would then have to copy the request from the site again with the new session ID.
Any pointers in the right direction would be appreciated.
import requests
url = "https://online.natis.gov.za/gateway/PreBooking?_flowId=PreBooking&_flowExecutionKey=e1s2&flowName=[object%20Object]&_eventId_next=Next?dtoName=perSummaryDetailDto&viewId=perSummaryDetail&flowExecutionKey=e1s2&flowExecutionUrl=%2Fgateway%2FPreBooking%3F_flowId%3DPreBooking%26_flowExecutionKey%3De1s2&sessionId=IWhelPTLyYDa7JohJV6x8So_qEKdC8wOknArAXkS&surname={SURNAME}&initials=R&firstName1={FIRSTNAME}&emailAddress={EMAIL}&cellN={CELL}&isWithinPriorityDate=false&viewPrioritySlots=false&showPrioritySlotsModal=false&provcdt=4&supportUser=false"
payload = {}
headers = {
'Connection': 'keep-alive',
'Content-Length': '0',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'Accept': 'application/json, text/plain, */*',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Origin': 'https://online.natis.gov.za',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://online.natis.gov.za/',
'Accept-Language': 'en-US,en;q=0.9',
'Cookie': 'JSESSIONID=IWhelPTLyYDa7JohJV6x8So_qEKdC8wOknArAXkS.master:gateway_3; Gateway=R35619282; ROUTEID.33f40c02f95309866c572c0def16f016=.node1; JSESSIONID=BadmtwJ7c8YWEz73xe6Wu165Q7gapmm4WTY6at-p.master:gateway_3; Gateway=R35619282',
'dnt': '1',
'sec-gpc': '1'
}
response = requests.request(
"POST", url, headers=headers, data=payload, verify=False)
print(response.text)

Trying to replicate Mobile App POST Request in Ruby, Getting 502 Gateway Error

I'm trying to automate actions I can take manually in an iPhone app using Ruby, but when I do, I get a 502 bad gateway error.
Using Charles Proxy I got the request the iPhone app is making:
POST /1.1/user/-/friends/invitations HTTP/1.1
Host: redacted.com
Accept-Locale: en_US
Accept: */*
Authorization: Bearer REDACTED
Content-Encoding: gzip
Accept-Encoding: br, gzip, deflate
Accept-Language: en_US
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Content-Length: 66
Connection: keep-alive
X-App-Version: 814
invitedUserId=REDACTED&source=PROFILE_INVITATION
I wrote the following code in Ruby to send this same request:
#header_post = {
"Host" => "redacted.com",
"Accept-Locale" => "en_US",
"Accept" => "*/*",
"Authorization" => "Bearer REDACTED",
"Content-Encoding" => "gzip",
"Accept-Encoding" => "br, gzip, deflate",
"Accept-Language" => "en_US",
"Content-Type" => "application/x-www-form-urlencoded; charset=UTF-8",
"Connection" => "keep-alive",
"X-App-Version" => "814"
}
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
path = '/1.1/user/-/friends/invitations'
data = "invitedUserId=REDACTED&source=PROFILE_INVITATION"
resp, data = http.post(path, data, #header_post)
Unfortunately I get a 502 Bad Gateway Error when running this code.
One thing I noticed which I think is key to the solution here is that, in the POST request the mobile app is making, the content length is 66. But the length of the string "invitedUserId=REDACTED&source=PROFILE_INVITATION" with un-redacted userId is only 46.
Am I missing another form variable with format "&param=value" which has length 20? Or am I missing something else?
Thank you in advance!
This is probably not directly tied to the body length you're sending.
I see possibly 2 problems here :
the 502 error : are your uri.host and port correct ? A 502 error means there is something wrong on the server side. Also try by removing the Host header.
body content is not gzipped
You're defining an header Content-Encoding: gzip but you didn't compress the data (Net::Http doesn't do that automatically).
Try with something like that :
require "gzip"
#header_post = {
# ...
}
http = Net::HTTP.new(uri.host, uri.port)
path = '/1.1/user/-/friends/invitations'
data = "invitedUserId=REDACTED&source=PROFILE_INVITATION"
# instanciate a new gzip buffer
gzip = Zlib::GzipWriter.new(StringIO.new)
# append your data
gzip << data
# get the gzip body and use it in your request
body = gzip.close.string
resp, data = http.post(path, body, #header_post)
Alternatively, maybe the server is accepting a non-gzipped content. You could try simply by deleting the Content-Encoding
error from your original code.
However if it was the only mistake, the server should not send a 502 but a 4xx error. So I'm guessing there is another issue there with the uri config like a suggested above.

Redirect After AJAX Request even though Headers and Data Attributes are passed correctly with Python3

On a government site I managed to login via my credidentials (specified as a python dictonary in login_data) as follows:
with requests.Session() as s:
url = 'https:......../login'
r = s.get(url, data=login_data, headers=headers, verify=False)
r = s.post(url, data=login_data, headers = headers, verify=False)
print(r.content)
which displays a html:
b'<!DOCTYPE html..... and if I search for my username i find <span class="rich-messages-label msg-def-inf-label">Welcome, USER..XYZ!< From which i conclude a succesful login.
Next I want to proceed to the search subsite (url = 'https:......./search) of the site I'm now logged in to. This subsite allows me to search the government records for an incident (incident-ID) on a given date (start_date, end_date).
because of the login success I tried the following:
with requests.Session() as s:
url = 'https:......../search'
r = s.get(url, data=search_data, headers=headers, verify=False)
r = s.post(url, data=search_data, headers = headers, verify=False)
print(r.content)
In advance i defined search_data using Google Chrome Inspecor for Network and Header:
search_data:{
'AJAXREQUEST': '_viewRoot',
'theSearchForm': 'theSearchForm',
'incident-ID' : '12345',
'start_date' : '05/03/2019 00:00:00 +01:00',
'end_date' : '05/03/2019 23:59:59 +01:00',
}
and i specified headers to include more than just the agent:
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=8351xxxxxxxxxxxxFD5; _ga=GA1.2.xxxxxxx.xxxxxxxx',
'Host': 'somehost...xyz.eu',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
}
So far the setup should be nice, no? But I ran into a problem as the print(r.content) doesn't give me the .html as after the login but some disappointingly short: b'<?xml version="1.0" encoding="UTF-8"?>\n<html xmlns="http://www.w3.org/1999/xhtml"><head><meta name="Ajax-Response" content="redirect" /><meta name="Location" content="home.seam?cid=3774801" /></head></html>
it's a pitty because I can see in the inspctor that the response for the post-request in the browser yields the exact data I am looking for. Similarl the first post-request yields the exact same data as my python command r = s.post(url, data=login_data, headers = headers, verify=False). But the print(r.content) as already said seams to be a redirect which only brings me back to the login site, stating you're already logged in.
To sum up:
The first request.Session.get & -.post worked (I get the same response html as in the Google Chrome Inspector).
The second request.Session.post doesnt work as it just yields some weird redirect (but I get
the correct response in the Google Chrome Inspector).
What am I missing??? Please Help! :S

Ruby Mechanize, fill dynamic Form / Send JSON (Airbnb calendar)

My goal
I try to update my airbnb calendar using Ruby. For example, here is a URL of a calendar : https://www.airbnb.com/manage-listing/ROOM_ID/calendar
The issue
If you already use Airbnb, to update your calendar, you have to click on the start date then the end date and after that, a form pop-up.
So, when I use Mechanize to get the page content, this form is not loaded and doesn't appears (even the calendar is load dynamically, not able to simulate click too), impossible to use basic Mechanize form filling...
What I did so far
I tried to use the developer tools from Chrome to check the Network. When I update my calendar using Chrome, there is one JSON PUT at https://www.airbnb.com/api/v2/calendars/ROOM_ID/START_DATE/END_DATE?_format=host_calendar&t=1427377357561&key=d306zoyjsyarp7ifhu67rjxn52tv0t20 with some JSON data such as days, availability, price...
My first solution was to tried to reproduce this JSON call with this code :
data = { "event_name" => "calendar",
"event_data" => { "page_uri" => "/manage-listing/ROOM_ID/calendar",
"controller" => "rooms",
"action" => "manage_listing",
"hosting_id" => ROOM_ID,
"start_date" => "2015-03-26",
"end_date" => "2015-03-29",
"available" => true,
"native_price" => 111,
"native_currency" => "EUR"
}
}
page = agent.post 'https://www.airbnb.com/api/v2/calendars/ROOM_ID/2015-03-26/2015-03-29?_format=host_calendar&t=1427374574309&key=d306zoyjsyarp7ifhu67rjxn52tv0t20', data.to_json, {'Content-Type' => 'application/json'}
But I get a 404 response :
Mechanize::ResponseCodeError (404 => Net::HTTPNotFound for https://www.airbnb.com/api/v2/calendars/ROOM_ID/2015-03-26/2015-03-29?_format=host_calendar&t=1427374574309&key=d306zoyjsyarp7ifhu67rjxn52tv0t20 -- unhandled response)
Do you have any suggestions to either send the form even if it is not on the page content, or POST the request with JSON ?
Thanks for your help
Here is the complete JSON call from Chrome :
General
Remote Address:xx.xx.xx.xx:xx
Request URL:https://www.airbnb.com/api/v2/calendars/ROOM_ID/2015-03-26/2015-03-29?_format=host_calendar&t=1427379998507&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr-CA
Request Method:PUT
Status Code:200 OK
Response Headers
cache-control:max-age=0, private, must-revalidate
connection:keep-alive
content-encoding:gzip
content-length:236
content-type:application/json; charset=utf-8
date:Thu, 26 Mar 2015 14:26:46 GMT
etag:W/"10845765865e36a6ccb1541bbda1c2a7"
server:nginx/1.7.7
status:200 OK
status:200 OK
strict-transport-security:max-age=10886400; includeSubdomains
vary:Accept-Encoding
version:HTTP/1.1
x-frame-options:SAMEORIGIN
x-hi-human:The Production Infrastructure team added this header. Come work with us! Email kevin.rice+hiring#airbnb.com
x-ua-compatible:IE=Edge,chrome=1
x-xss-protection:1; mode=block
Request Headers
:host:www.airbnb.com
:method:PUT
:path:/api/v2/calendars/ROOM_ID/2015-03-26/2015-03-29?_format=host_calendar&t=1427379998507&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr-CA
:scheme:https
:version:HTTP/1.1
accept:application/json, text/javascript, */*; q=0.01
accept-encoding:gzip, deflate, sdch
accept-language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
content-length:59
content-type:application/json
cookie:__ssid=4166c81a-49bd-4826-ac44-08307c5700ca; _csrf_token=V4%24.airbnb.ca%24CL1nNdfYkF0%24ulPyJJJWr1h6CvuBMf32YcXtnZssDud3_CqBQoqXOU0%3D; li=1; roles=0; _airbed_session_id=dfa72c17e6d014f9fd0b9705d097e5d8; flags=4027711488; EPISODES=s=1427377914349&r=https%3A%2F%2Ffr.airbnb.ca%2Fmanage-listing%2F5780104%2Fcalendar; _ga=GA1.2.1981489078.1427272843; fbs=not_authorized; _pt=1--WyJjZmYxZmE4N2RhOTU4NGNhYzhhN2M5YTIyNzkyMDliMDI0YTk1YWEzIl0%3D--2890e7d8df5181677516659fbdc4761e6de82a61; bev=1427272835_bw8KI59ELTQAsMt3; _user_attributes=%7B%22curr%22%3A%22EUR%22%2C%22guest_exchange%22%3A0.9134%2C%22id%22%3A29905162%2C%22hash_user_id%22%3A%22cff1fa87da9584cac8a7c9a2279209b024a95aa3%22%2C%22eid%22%3A%22FBPqvskr4MN1Rnpqf-oY-lG7-VNdCJVSYwUMUtm6YyOXzEpbRvmU9FWTxKNdf0UA%22%2C%22num_msg%22%3A0%2C%22num_h%22%3A1%2C%22name%22%3A%22St%C3%A9phane%22%2C%22is_admin%22%3Afalse%2C%22can_access_photography%22%3Afalse%7D
origin:https://www.airbnb.com
referer:https://www.airbnb.com/manage-listing/ROOM_ID/calendar
user-agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36
x-csrf-token:V4$.airbnb.ca$CL1nNdfYkF0$ulPyJJJWr1h6CvuBMf32YcXtnZssDud3_CqBQoqXOU0=
x-requested-with:XMLHttpRequest
Query String Parameters
_format:host_calendar
t:1427379998507
key:d306zoyjsyarp7ifhu67rjxn52tv0t20
currency:EUR
locale:fr-CA
Request Payload
{availability: "available", daily_price: "999", notes: ""}
availability: "available"
daily_price: "999"
notes: ""
I succeeded to update the calendar of my room I used a JSON PUT request. Here is what I did.
The data looks like :
data = { "availability" => availability,
"daily_price" => price,
"notes" => note
}.to_json
Retrieve the cookie :
cookie_csrf_token = ''
cookie_airbed_session_id = ''
agent.cookie_jar.each do |value|
if value.to_s.include? "_csrf_token"
cookie_csrf_token = value.to_s
elsif value.to_s.include? "_airbed_session_id"
cookie_airbed_session_id = value.to_s
end
end
The headers :
headers = { 'X-CSRF-Token' => URI.unescape(cookie_csrf_token.scan(/=(.*)/).join(",")),
'Content-Type' => 'application/json',
'Cookie' => "#{cookie_csrf_token}; #{cookie_airbed_session_id}"
}
The only cookies you need is csrf_token and airbed_session_id which are related to each other. My mistake was to use the csrf_token from the login page... You can find these cookies in the cookie_jar variable from your Mechanize agent.
After that you will need to construct your URL. The URL has a particular parameter which is called "key". You can retrieve it in a meta tag (id='_bootstrap-layout-init') from your calendar page. Do to that I used Nokogiri combined with some regex :
param_t = Time.now.to_i
noko.xpath("//meta[#id='_bootstrap-layout-init']/#content").each do |attr|
param_key = attr.value[/key":"(.*?)"/, 1]
end
Now you are good to go to update your calendar :
url = "https://www.airbnb.com"
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
# Send the PUT request to update the calendar
res = http.start { |req|
req.send_request('PUT', "/api/v2/calendars/#{room_id}/#{start_date}/#{end_date}?_format=host_calendar&t=#{param_t}&key=#{param_key}", data, headers)
}

Get HTML source of a https page by forcing a user agent in Ruby

>>require 'net/https'
>>uri = URI('https://www.facebook.com/careers/department?dept=product-management&req=a2KA0000000E147MAC')
>>conn = Net::HTTP.new(uri.host, uri.port)
>>req = Net::HTTP::Get.new(uri.request_uri, {'User Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1'})
>>resp = conn.request req
=> #<Net::HTTPFound 302 Found readbody=true>
The 302 redirection thrown by the website redirects to a 'unsupported browser' page. What am I doing wrong in setting the user agent for this request? I'm using the same user agent string returned by browser.
Additional info: I cannot use libraries such as watir in this use case. Any solution by using either 'net/http[s]' or 'open-uri' would be awesome.
Change 'User Agent' to 'User-Agent' with a hyphen.

Resources