How to request for gzipped pages from web servers through ruby scripts? - ruby

I have a ruby script that goes and saves web pages from various sites, how do i make sure that it checks if the server can send gzipped files and saves them if available...
any help would be great!

One can send custom headers as hashes ...
custom_request = Net::HTTP::Get.new(url.path, {"Accept-Encoding" => "gzip"})
you can then check the response by defining a response object as :
response = Net::HTTP.new(url.host, url.port).start do |http|
http.request(custom_request)
end
p [response['Content-Encoding']
Thanks to those who responded...

You need to send the following header with your request:
Accept-Encoding: gzip,deflate
However, I am still reading how to code ruby and dont know how to do the header syntax in the net/http library (which I assume you are using to make the request)
Edit:
Actually, according to the ruby doc it appears the this header is part of the default header sent if you dont specify other 'accept-encoding' headers.
Then again, like I said in my original answer, I am still just reading the subject so I could be wrong.

For grabbing web pages and doing stuff with them, ScrubyIt is terrific.

Related

Copied and pasted Ruby code from Hubspot API but I get an HTTPUnsupportedMediaType415

I am simply trying to do an HTTP PUT request using a Ruby script, and I am literally copying and pasting 100% of the same thing from Hubspot's example. It's working in Hubspot's example, but not mine.
For example, here's the 99% full code from HubSpot API (with my API key redacted):
# https://rubygems.org/gems/hubspot-api-client
require 'uri'
require 'net/http'
require 'openssl'
url = URI("https://api.hubapi.com/crm/v3/objects/deals/4104381XXXX/associations/company/530997XXXX/deal_to_company?hapikey=XXXX")
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Put.new(url)
request["accept"] = 'application/json'
response = http.request(request)
puts response.read_body
When initiated by hubspot, the response is an HTTP 201, but in my Ruby script it's giving me the following error:
=> #<Net::HTTPUnsupportedMediaType 415 Unsupported Media Type readbody=true>
I have tried directly copying and pasting the exact same thing, but no luck. I would copy what I'm using, but it's 100% the same code as above except for the redacted API, deal, and company IDs. I have copied and pasted HubSpot's example directly into my rails console, but I get an unsupported media type error.
I have also tried adding a body to the request, such as request.body = "hello" and nothing.
Any suggestion would be greatly appreciated.
After analyzing a working cURL request and the ruby script via BurpSuite, I determined that the following HTTP header in the request was the culprit:
Content-Type: application/x-www-form-urlencoded
For whatever reason, the Ruby code in the original post uses this content-type by default, even though the user doesn't specify it. Makes no sense.

Does jsoup.connect().get() return cached Document?

I use jsoup and following code to get the HTML content of a website Document doc = Jsoup.connect(this.getUrl()).get();.
Does I get a cached version of the website? Is it possible to request a non-cached version? I knew I could set a header request. Something like:
header("Cache-control", "no-cache");
header("Cache-store", "no-store");
But I’m not sure if that works. I just knew that these tags are used for the client browser.
It would be awesome if someone could clarify. Greetings.
Any headers that you correctly (HTTP spec) specify will be sent to target host via java.net.URLConnection.addRequestProperty(String, String). You should get a cached version of the page if server supports this header, end-to-end. jSoup just supplies the headers as the request it made and when I looked through the source, it does not make any explicit effort to cache off the response content.

Duplicated "set-cookie: ci-session" fields in header by codeigniter

For each time $this->session->set_userdata() or $this->session->set_flashdata() is used in my controller, another identical "Set-Cookie: ci_session=..." is added to the http header the server sends.
Multiple Set-Cookie fields, with the same cookie name, in the http header is not okay according to rfc6265.
So is there a way to use codeigniter sessions without it creating multiple identical "set-cookie:"s?
(I've used curl to verify the http header)
check https://github.com/EllisLab/CodeIgniter/pull/1780
By default when using the cookie session handler (encrypted or unencrypted), CI sends the entire "Set-Cookie" header each time a new value is written to the session. This results in multiple headers being sent to the client.
This is a problem because if too many values are written to the session, the HTTP headers can grow quite large, and some web servers will reject the response. (see http://wiki.nginx.org/HttpProxyModule#proxy_buffer_size)
The solution is to only run 'sess_save()' one time right after all other headers are sent before outputting the page contents.
I believe you can pass an array to $this->session->set_userdata(); I haven't tested this code so it is merely a suggestion to try something along these lines:
$data = array(
'whatever' => 'somevalue',
'youget' => 'theidea'
);
$this->session->set_userdata($data);
NB: When I say I haven't tested the code.. I have used this code and I know it works, I mean I havent tested if it will reduce the amount of headers sent.
In my case, the error is in the browser (Chrome). It stored 2 cookie and send both to server, this make server create new session all the time.
I fixed it by clear the cookies in browser.
Hope it help someone. :)

VBScript: Disable caching of response from server to HTTP GET URL request

I want to turn off the cache used when a URL call to a server is made from VBScript running within an application on a Windows machine. What function/method/object do I use to do this?
When the call is made for the first time, my Linux based Apache server returns a response back from the CGI Perl script that it is running. However, subsequent runs of the script seem to be using the same response as for the first time, so the data is being cached somewhere. My server logs confirm that the server is not being called in those subsequent times, only in the first time.
This is what I am doing. I am using the following code from within a commercial application (don't wish to mention this application, probably not relevant to my problem):
With CreateObject("MSXML2.XMLHTTP")
.open "GET", "http://myserver/cgi-bin/nsr/nsr.cgi?aparam=1", False
.send
nsrresponse =.responseText
End With
Is there a function/method on the above object to turn off caching, or should I be calling a method/function to turn off the caching on a response object before making the URL?
I looked here for a solution: http://msdn.microsoft.com/en-us/library/ms535874(VS.85).aspx - not quite helpful enough. And here: http://www.w3.org/TR/XMLHttpRequest/ - very unfriendly and hard to read.
I am also trying to force not using the cache using http header settings and html document header meta data:
Snippet of server-side Perl CGI script that returns the response back to the calling client, set expiry to 0.
print $httpGetCGIRequest->header(
-type => 'text/html',
-expires => '+0s',
);
Http header settings in response sent back to client:
<html><head><meta http-equiv="CACHE-CONTROL" content="NO-CACHE"></head>
<body>
response message generated from server
</body>
</html>
The above http header and html document head settings haven't worked, hence my question.
I don't think that the XMLHTTP object itself does even implement caching.
You send a fresh request as soon as you call .send() on it. The whole point of caching is to avoid sending requests, but that does not happen here (as far as your code sample goes).
But if the object is used in a browser of some sort, then the browser may implement caching. In this case the common approach is to include a cache-breaker into the statement: a random URL parameter you change every time you make a new request (like, appending the current time to the URL).
Alternatively, you can make your server send a Cache-Control: no-cache, no-store HTTP-header and see if that helps.
The <meta http-equiv="CACHE-CONTROL" content="NO-CACHE> is probably useless and you can drop it entirely.
You could use WinHTTP, which does not cache HTTP responses. You should still add the cache control directive (Cache-control: no-cache) using the SetRequestHeader method, because it instructs intermediate proxies and servers not to return a previously cached response.
If you have control over the application targeted by the XMLHTTP Request (which is true in your case), you could let it send no-cache headers in the Response. This solved the issue in my case.
Response.AppendHeader("pragma", "no-cache");
Response.AppendHeader("Cache-Control", "no-cache, no-store");
As alternative, you could also append a querystring containing a random number to each requested url.

How can I print information about a NET:HTTPRequest for debug purposes?

I'm new to Ruby coming from Java. I'm trying to make a http get request and I'm getting an http response code of 400. The service I'm calling over http is very particular and I'm pretty sure that my request isn't exactly correct. It'd be helpful to "look inside" the req object after I do the head request (below) to double check that the request_headers that are being sent are what I think I'm sending. Is there a way to print out the req object?
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.head(pathWithScope, request_headers)
code = res.code.to_i
puts "Response code: #{code}"
I tried this: puts "Request Debug: #{req.inspect}" but it only prints this: #<Net::HTTP www.blah.com:443 open=false>
Use set_debug_output.
http = Net::HTTP.new(url.host, url.port)
http.set_debug_output($stdout) # Logger.new("foo.log") works too
That and more in http://github.com/augustl/net-http-cheat-sheet :)
If you want to see & debug exactly what your app is sending, not just see its log output, I've just released an open-source tool for exactly this: http://httptoolkit.tech/view/ruby/
It supports almost all Ruby HTTP libraries so it'll work perfectly for this case, but also many other tools & languages too (Python, Node, Chrome, Firefox, etc).
As noted in the other answer you can configure Net::HTTP to print its logs to work out what it's doing, but that only shows you what it's trying to do, it won't help you if you use any other HTTP libraries or tools (or use modules that do), and it requires you to change your actual application code (and remember to change it back).
With HTTP Toolkit you can just click a button to open a terminal, run your Ruby code from there as normal, and every HTTP request sent gets collected automatically.

Resources