Ruby Net:Http get request gives different response than with Browser - ruby

I am trying to fetch from API server using Net::HTTP.
puts "#{uri}".green
response = Net::HTTP.new('glassdoor.com').start { |http|
# always proxy via your.proxy.addr:8080
response = http.get(uri, {'Accept' => 'application/json'})
puts "Res val: #{response.body}".blue
}
I got the uri from the console and pasted in the browser, and I received the JSON response.
But using the Ruby Net::HTTP get I receive some security message:
Why the difference? The browser and the Ruby script are behind the same public IP.

You were detected as a crawler (correctly, by the way). Note that those requests (from browser and the script) are not just the same. The browser sends some headers, such as accepted language, user agent etc. You can peek into it using web inspector tool in the browser. On the other side, in your script you only set Accept header (and to JSON, suspicious on its own, as browser would never do that). And you do not send any user agent. It's easy to see that this is an automates request, not natural traffic from the browser.

Related

Get Instagram Access Token using Restful Ruby

I am trying to use the Instagram API to create a rails background worker to query hashtags. I don't need to log in any other user but myself furthermore I don't want to have to use any browsers, just RESTful calls.
I'm trying to automate getting my access token in a Ruby script using the gem "rest-client" (https://github.com/rest-client/rest-client)
I can successfully navigate to the following url in a browser and get the access token from the response url
I have used both this URL:
https://www.instagram.com/oauth/authorize/?client_id=xxx&redirect_uri=xxx&response_type=token
BUT When I use the RESTful gem response = RestClient.get(url) the
response.headers['location'] is nil
I have also tried using the Instagram API URL but no luck: https://api.instagram.com/oauth/authorize/?client_id=xxx&redirect_uri=xxx&response_type=code
Anyone know how to get the access token completely programmatically in Ruby?
I think I'm missing the step to log in the user (which will be me). Not sure how to do this programatically.
Can I use the instagram API with a code or access token that never changes?
rest-client automatically request to get to redirect host.
https://github.com/rest-client/rest-client/blob/master/lib/restclient/abstract_response.rb
I have never used the instagram API yet, but the code to get "Location" on 302 is shown below.
# client
require 'net/http'
require 'uri'
url = URI.parse('http://localhost:2000')
res = Net::HTTP.start(url.host, url.port) do |http|
http.get('/')
end
puts "#{res.header['Location']} is https://google.com"
# test server
require 'socket'
server = TCPServer.new 2000
loop do
socket = server.accept
while header = socket.gets
break if header.chomp.empty?
puts header.chomp
end
socket.puts "HTTP/1.0 302"
socket.puts "Location: https://google.com"
socket.close
end
I hope that this information is helpful to you.

Why is the ruby mechanize gem giving a 403 response after logging in?

So, I'm trying to automate the downloading of images from a website for which you have to login. The login form is on every page (in the browser you click "login" and a javascript slidedown occurs revealing the form). I login using the below code and when I get to agent.get( "http://cdn.com/some_image.jpg" ), a 403 error is thrown. This doesn't happen when I login into the browser and visit "http://cdn.com/some_image.jpg", so what is going on and how can I get around it?
path = "http://www.example.com/some_path"
agent = Mechanize.new
page = agent.get(path) do |page|
form = page.form_with(action: "http://www.example.com/authorize")
username_field = form.field_with(name: "username")
username_field.value = "some_user"
password_field = form.field_with(name: "password")
password_field.value = "password"
form.submit
end
agent.get( "http://cdn.com/some_image.jpg" ).save "some_image.jpg" unless File.exist?("some_image.jpg")
Think about this: you submitted a login request, and then a request for the image. How does the server know that you are the person that logged in from the first request? Tracking by IP (could be shared or a proxy), port (wouldn't tpyically survive multiple requests), user agent (not unique), etc obviously wouldn't work. Typically login sessions are implemented using cookies - a web client is given a session token in the form of a cookie, which, when presented back to the server in a subsequent request, informs the server of the session to which the request belongs, thus allowing the server to track logins across what are otherwise stateless web requests.
There are other methods, but they mostly resolve around passing this token in another way ( custom header, GET URL parameters, etc ) - with the notable exception of signed web requests such as AWS uses (cool, but not very common for web logins). All in all, session cookies are by far the most common implementation.
Thus, I suggest you take a look at this post, as there seems to be a method of managing cookies within the mechanize gem for use with subsequent requests.
Maintaining cookies between Mechanize requests
From a cdn I would guess they're checking user-agent or referer.
Mechanize should be setting the referer properly, so that leaves user-agent.

Does caching interfer when server responses cookies to a GET request?

I have a resource (an html web page, but it could be anything else like json/xml describing a book) and retrieve it with a GET request:
http://127.0.0.1/welcome
This resource is in Japanese (because kawai desu). Now, I do a GET request on this resource, asking server for another language:
http://127.0.0.1/welcome?lang=en
So the server responses with the English version of the resource. But from now on, since I called ?lang=en, I want to set the default language of the user in a cookie. So server adds a cookie to its response:
Cookie: language=en
Browser now have the language=en cookie. Then, I ask for the resource without GET parameters and the server delivers the English version because the browser sent the Cookie:language=en request header:
http://127.0.0.1/welcome
Returns the English version.
These queries look like retrieving (a resource with a cookie), idempotents (doesn't change a bit when send several times) and safe (server-modification less) queries to me: am I right to use GET requests even if they involve cookies?
Two GET requests have the same URI http://127.0.0.1/welcome
but different results: how does caching (browser and proxy) handle
this?
GET response for http://127.0.0.1/welcome?lang=en could be cached too: will (proxy/CDN, browser) cached responses include the language=en cookie (so user language for the website switches to en)?

Is Ruby's net/http request visible to user?

I'm using Ruby's library for getting http pages (net/http), for example:
Net::HTTP.get URI.parse(uri)
Is this visible for user somehow? I mean, can the user use firebug (for example) to obtain uri or is this is only handled and visible by the server?
No, Net::HTTP requests are on the server running Ruby. The user cannot monitor those requests unless they had access to the server or the server's network.

How do I send HTTP requests over a TCPSocket?

I've written a class in Ruby that acts as an HTTP client. The code is minimal but the reason I'm not using 'net/http' is because this method allowes me to have more control over the requests being made and documentation for the HTTP is not helpful at all.
Anyway, the problem is the socket will only work for one request and response. Sending a second or subsequent request gives me an empty response.
For example:
Open connection to google
GET "/"
Response is the google.ca html
GET "/"
Response is empty
I tried closing and opening the connection between the requests but that only slowed it down and didn't fix the problem. I still got empty responses.
So what is the problem here?
Is there a method that lets me check to see if the TCPSocket object has an open connection so I don't accidentally open a new one?
Try:
require "socket"
host = "google.com"
port = 80
socket = TCPSocket.new host,port
request = "GET / HTTP/1.0\r\nHost:#{host}\r\n\r\n"
socket.print request
response = socket.read
This will return Google's main page. If you want to send request after request then change it to "HTTP/1.1" and read a response, then send the next request.

Resources