Using Ruby, how do I convert the short URLs (tinyURL, bitly etc) to the corresponding long URLs?
I don't use Ruby but the general idea is to send an HTTP HEAD request to the server which in turn will return a 301 response (Moved Permanently) with the Location header which contains the URI.
HEAD /5b2su2 HTTP/1.1
Host: tinyurl.com
Accept: */*
RESPONSE:
HTTP/1.1 301 Moved Permanently
Location: http://stackoverflow.com
Content-type: text/html
Date: Sat, 23 May 2009 18:58:24 GMT
Server: TinyURL/1.6
This is much faster than opening the actual URL and you don't really want to fetch the redirected URL. It also plays nice with the tinyurl service.
Look into any HTTP or curl APIs within ruby. It should be fairly easy.
You can use the httpclient rubygem to get the headers
#!/usr/bin/env ruby
require 'rubygems'
require 'httpclient'
client = HTTPClient.new
result = client.head(ARGV[0])
puts result.header['Location']
There is a great wrapper for the bitly API in Python available here:
http://code.google.com/p/python-bitly/
So there must be something similar for Ruby.
Related
I was experimenting with the Ruby rest-client gem and ran into an "issue" so to speak. I noticed when I would hit a certain URL that should just return HTML, I would get a 404 error unless I specifically specified:
RestClient.get('http://www.example.com/path/path', accept: 'text/html')
However, pretty much any other page that I would hit without specifying the Accept header explicitly would return HTML just fine.
I looked at the source for the Request object located here and in the default_headers method around line 486 it appears that the default Accept header is */*. I also found the relevant pull request here.
I'm not quite sure why on a particular site (not all) I have to explicitly specify Accept: text/html when any other site that returns HTML by default does it without any extra work. I should note that other pages on this same site work fine when requesting the page without explicitly specifying text/html.
It's not a huge issue and I can easily work around it using text/html but I just thought it was a bit odd.
I should also note that when I use another REST client, such as IntelliJ's built-in one and specify Accept: */* it returns HTML no problem...
EDIT: Ok, this is a bit strange...when I do this:
RestClient.get('http://www.example.com/path/path', accept: '*/*')
Then it returns HTML as I expect it to but leaving off that accept: */* parameter doesn't work even though by default that header should be */* according to the source code...
I wonder if because my URL has /path/path in it, RestClient thinks it's an endpoint to some API so it tries to retrieve XML instead...
EDIT 2: Doing a bit more experimenting...I was able to pass a block to the GET request as follows:
RestClient.get('http://example.com/path/path') {
|response, request, result|
puts response.code
puts request.processed_headers
}
And I get a 404 error and the processed_headers returns:
{"Accept"=>"*/*; q=0.5, application/xml", "Accept-Encoding"=>"gzip, deflate"}
The response body is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<hash>
<errors>Not Found</errors>
</hash>
So, it is sending a */* header but for some reason it looks like the application/xml gets priority. Maybe this is just something on the server-side and out of my control? I guess I'm just not sure how that application/xml is even being added into the Accept header. I can't find anything skimming through the source code.
Found the "problem". It looks like the PR I mentioned in my original post wasn't actually released until rest-client2.0.0.rc1 which is still a release candidate so it isn't actually out yet or at least obtainable via my gem update rest-client.
I used the following command to install 2.0.0rc2:
gem install rest-client -v 2.0.0.rc2 --pre
Then referenced it in my code and it works now:
#request = RestClient::Request.new(:method => :get, :url => 'http://some/resource')
puts #request.default_headers[:accept]
Prints...
*/*
As expected now.
I have a program that uses an XMLHTTPRequest to gather contents from another web page.
Problem is, that web page has cloaking custom errors set-up (ie. /thisurl doesn't literally exist as a file on their web server, it is being generated by the custom 404 error file.), so its not returning the page it shows in the browser, instead its showing its default 404 error response from that custom error page, in my HTTPRequest response.
By using this website http://web-sniffer.net/ I have narrowed down what the problem may be, but I don't know how to fix it.
Web-sniffer has 3 different versions to submit the request:
HTTP version: HTTP/1.1
HTTP/1.0 (with Host header)
HTTP/1.0 (without Host header)`
When I use HTTP/1.1 or HTTP/1.0 (with Host header) I get the correct response (html) from the page. But when I use HTTP/1.0 (without Host header) it does not return the content, instead it returns a 404 error script (showing the custom error page).
So I have concluded that the problem may be due to the Host header not being present in the request.
But I am using MSXML2.XMLHTTP.3.0 and haven't been able to read the page using HTTP/1.1 or HTTP/1.0 (with Host header). The code looks like this:
Set objXML = Server.CreateObject("MSXML2.XMLHTTP.3.0")
objXML.Open "GET", URL, False
objXML.setRequestHeader "Host", MyDomain '< Doesnt work with or w/out this line
objXML.Send
Even after adding a Host header to the request, I still get the template of the 404 error returned by that custom error script in my response, the same as HTTP/1.0 (without Host Header) option on that web-sniffer site. This should be returning 200 OK like it does on the first two options on web-sniffer, and like in a web browser.
So I guess my question is, what is that website (web-sniffer.net) able to get the proper response with their first two HTTP version options, so I can emulate this in my app. I want to get the right page, but its only returning the 404 error from their 404 error template.
In response to an answerer, I have provided screen shots from 2 seperate cUrl requests below, one from each one of my servers.
I executed the same cURL command, same url (that points to a site on the main host), which is cURL -v -I www.site.com/cloakedfile . But looks like its not working on the main server, where it needs to be. It can't be a self-residing issue, because from secondary to secondary it works fine, these are both identical applications/sites, just different ip's/host names. It appears to be an internal issue, that may not be about the application side of things.
I dont have any idea bout MSXML2.XMLHTTP.3.0. But from you problem statement i understand that the issues is certainly due to some HTTP header field that is wrongly set or missed out in your request.
By default HTTP 1.1 clients set Host header. For example if you are connecting to google.com then the request will look like this
GET / HTTP/1.1
Host: google.com
The "Host" header should have the domain name of the server in which the requested resource is residing. Severs that has virtual hosting will get confused if "Host:" header is not present. This is what happens with groups.yahoo.com if you havent specified Host header
$ nc groups.yahoo.com 80
GET / HTTP/1.1
HTTP/1.1 400 Host Header Required
Date: Fri, 06 Dec 2013 05:40:26 GMT
Connection: close
Via: http/1.1 r08.ycpi.inc.yahoo.net (ApacheTrafficServer/4.0.2 [c s f ])
Server: ATS/4.0.2
Cache-Control: no-store
Content-Type: text/html; charset=utf-8
Content-Language: en
Content-Length: 447
And this should be the same issue you are facing with. And also make sure that you are sending the domain name of the server from which you are trying to fetch the resource. And the Host header should have a colon ":" to delimit the value like "Host: www.example.com".
I already did some research in this field, but didn't find any solution. I have a site, where asynchron ajax calls are made to facebook (using JSONP). I'm recording all my HTTP requests on the Ruby side with VCR, so I thought it would be cool, to use this feature for AJAX calls as well.
So I played a little bit around, and came up with a proxy attempt. I'm using PhantomJS as a headless browser and poltergeist for the integration inside Capybara. Poltergeist is now configured to use a proxy like this:
Capybara.register_driver :poltergeist_vcr do |app|
options = {
:phantomjs_options => [
"--proxy=127.0.0.1:9100",
"--proxy-type=http",
"--ignore-ssl-errors=yes",
"--web-security=no"
],
:inspector => true
}
Capybara::Poltergeist::Driver.new(app, options)
end
Capybara.javascript_driver = :poltergeist_vcr
For testing purposes, I wrote a proxy server based on WEbrick, that integrates VCR:
require 'io/wait'
require 'webrick'
require 'webrick/httpproxy'
require 'rubygems'
require 'vcr'
module WEBrick
class VCRProxyServer < HTTPProxyServer
def service(*args)
VCR.use_cassette('proxied') { super(*args) }
end
end
end
VCR.configure do |c|
c.stub_with :webmock
c.cassette_library_dir = '.'
c.default_cassette_options = { :record => :new_episodes }
c.ignore_localhost = true
end
IP = '127.0.0.1'
PORT = 9100
reader, writer = IO.pipe
#pid = fork do
reader.close
$stderr = writer
server = WEBrick::VCRProxyServer.new(:BindAddress => IP, :Port => PORT)
trap('INT') { server.shutdown }
server.start
end
raise 'VCR Proxy did not start in 10 seconds' unless reader.wait(10)
This works well with every localhost call, and they get well recorded. The HTML, JS and CSS files are recorded by VCR. Then I enabled the c.ignore_localhost = true option, cause it's useless (in my opinion) to record localhost calls.
Then I tried again, but I had to figure out, that the AJAX calls that are made on the page aren't recorded. Even worse, they doesn't work inside the tests anymore.
So to come to the point, my question is: Why are all calls to JS files on the localhost recorded, and JSONP calls to external ressources not? It can't be the jsonP thing, cause it's a "normal" ajax request. Or is there a bug inside phantomjs, that AJAX calls aren't proxied? If so, how could we fix that?
If it's running, I want to integrate the start and stop procedure inside
------- UPDATE -------
I did some research and came to the following point: the proxy has some problems with HTTPS calls and binary data through HTTPS calls.
I started the server, and made some curl calls:
curl --proxy 127.0.0.1:9100 http://d3jgo56a5b0my0.cloudfront.net/images/v7/application/stories_view/icons/bug.png
This call gets recorded as it should. The request and response output from the proxy is
GET http://d3jgo56a5b0my0.cloudfront.net/images/v7/application/stories_view/icons/bug.png HTTP/1.1
User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
Host: d3jgo56a5b0my0.cloudfront.net
Accept: */*
Proxy-Connection: Keep-Alive
HTTP/1.1 200 OK
Server: WEBrick/1.3.1 (Ruby/1.9.3/2012-10-12)
Date: Tue, 20 Nov 2012 10:13:10 GMT
Content-Length: 0
Connection: Keep-Alive
But this call doesn't gets recorded, there must be some problem with HTTPS:
curl --proxy 127.0.0.1:9100 https://d3jgo56a5b0my0.cloudfront.net/images/v7/application/stories_view/icons/bug.png
The header output is:
CONNECT d3jgo56a5b0my0.cloudfront.net:443 HTTP/1.1
Host: d3jgo56a5b0my0.cloudfront.net:443
User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
Proxy-Connection: Keep-Alive
HTTP/1.1 200 OK
Server: WEBrick/1.3.1 (Ruby/1.9.3/2012-10-12)
Date: Tue, 20 Nov 2012 10:15:48 GMT
Content-Length: 0
Connection: close
So, I thought maybe the proxy can't handle HTTPS, but it can (as long as I'm getting the output on the console after the cURL call). Then I thought, maybe VCR can't mock HTTPS requests. But using this script, VCR mocks out HTTPS requests, when I don't use it inside the proxy:
require 'vcr'
VCR.configure do |c|
c.hook_into :webmock
c.cassette_library_dir = 'cassettes'
end
uri = URI("https://d3jgo56a5b0my0.cloudfront.net/images/v7/application/stories_view/icons/bug.png")
VCR.use_cassette('https', :record => :new_episodes) do
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
response = http.request_get(uri.path)
puts response.body
end
So what is the problem? VCR handles HTTPS and the proxy handles HTTPS. Why they don't play together?
So I did some research and now I have a very basic example of a working VCR proxy server, that handles HTTPS calls as a MITM proxyserver (if you deactivate the security check in your client). I would be very happy if someone could contribute and help me to bring this thing to life.
Here is the github repo: https://github.com/23tux/vcr_proxy
Puffing Billy is a very nice tool. You need to specify which domains to bypass, and which urls you need to stub. It is also a bit tricky stubbing https urls. You need to stub https urls as https://www.example.com:443/path/
I have a list of ~150 URLs. I need to find out whether each domain resolves to www.domain.com or just domain.com.
There are multiple ways that a domain name could 'resolve' or 'redirect' to another:
Making an HTTP request for foo.com could respond with an HTTP redirect response code like 301, sending the browser to www.foo.com.
phrogz$ curl -I http://adobe.com
HTTP/1.1 301 Moved Permanently
Date: Mon, 30 Apr 2012 22:19:33 GMT
Server: Apache
Location: http://www.adobe.com/
Content-Type: text/html; charset=iso-8859-1
The web page sent back by the server might include a <meta> redirect:
<meta http-equiv="refresh" content="0; url=http://www.adobe.com/">
The web page sent back by the server might include JavaScript redirection:
location.href = 'http://www.adobe.com';
Which of these do you need to test for?
Reading HTTP Response Header
To detect #1 use the net/http library built into Ruby:
require "net/http"
req = Net::HTTP.new('adobe.com', 80)
response = req.request_head('/')
p response.code, response['Location']
#=> "301"
#=> "http://www.adobe.com/"
Reading HTML Meta Headers
To detect #2, you'll need to actually fetch the page, parse it, and look at the contents. I'd use Nokogiri for this:
require 'open-uri' # …if you don't need #1 also, this is easier
html = open('http://adobe.com').read
require 'nokogiri'
doc = Nokogiri.HTML(html)
if meta = doc.at_xpath('//meta[#http-equiv="refresh"]')
# Might give you "abobe.com" or "www.adobe.com"
domain = meta['content'][%r{url=([^/"]+(\.[^/"])+)},1]
end
Reading JavaScript
…you're on your own, here. :) You could attempt to parse the JavaScript code yourself, but you'd need to actually run the JS to find out if it ever actually redirects to another page or not.
I've seen this done very successfully with the resolv std library.
require 'resolv'
["google.com", "ruby-lang.org"].map do |domain|
[domain, Resolv.getaddress(domain)]
end
The mechanize way:
require 'mechanize'
Mechanize.new.head('http://google.com').uri.host
#=> "www.google.com.ph"
I have a ruby script that goes and saves web pages from various sites, how do i make sure that it checks if the server can send gzipped files and saves them if available...
any help would be great!
One can send custom headers as hashes ...
custom_request = Net::HTTP::Get.new(url.path, {"Accept-Encoding" => "gzip"})
you can then check the response by defining a response object as :
response = Net::HTTP.new(url.host, url.port).start do |http|
http.request(custom_request)
end
p [response['Content-Encoding']
Thanks to those who responded...
You need to send the following header with your request:
Accept-Encoding: gzip,deflate
However, I am still reading how to code ruby and dont know how to do the header syntax in the net/http library (which I assume you are using to make the request)
Edit:
Actually, according to the ruby doc it appears the this header is part of the default header sent if you dont specify other 'accept-encoding' headers.
Then again, like I said in my original answer, I am still just reading the subject so I could be wrong.
For grabbing web pages and doing stuff with them, ScrubyIt is terrific.