Can I get the date when an HTTP file was modified? - ruby

I'm trying to check if a file (on web) was modified since the last time I checked. Is it possible to do this by getting http headers to read the last time the file was modified (or uploaded)?

You can use the built-in Net::HTTP library to do most of this for you:
require 'net/http'
Net::HTTP.start('stackoverflow.com') do |http|
response = http.request_head('/robots.txt')
response['Last-Modified']
# => Sat, 04 Jun 2011 08:51:44 GMT
end
If you want, you can convert that to a proper date using Time.parse.

As #tadman says in his answer, a HTTP "HEAD" request is the proper way to check the last modification date.
You can also do it using a conditional GET request using the "IF-*" modifier headers.
Which to use depends on whether you intend to immediately download the page. If you just want the date use HEAD. If you want the content if there has been a change use GET with the "IF-*" headers.

Related

Ruby - How can I follow a .php link through a request and get the redirect link?

Firstly I want to make clear that I am not familiar with Ruby, at all.
I'm building a Discord Bot in Go as an exercise, the bot fetches UrbanDictionary definitions and sends them to whoever asked in Discord.
However, UD doesn't have an official API, and so I'm using this. It's an Heroku App written in Ruby. From what I understood, it scrapes the UD page for the given search.
I want to add random to my Bot, however the API doesn't support it and I want to add it.
As I see it, it's not hard since http://www.urbandictionary.com/random.php only redirects you to a normal link of the site. This way if I can follow the link to the "normal" one, get the link and pass it on the built scraper it can return just as any other link.
I have no idea how to follow it and I was hoping I could get some pointers, samples or whatsoever.
Here's the "ruby" way using net/http and uri
require 'net/http'
require 'uri'
uri = URI('http://www.urbandictionary.com/random.php')
response = Net::HTTP.get_response(uri)
response['Location']
# => "http://www.urbandictionary.com/define.php?term=water+bong"
Urban Dictionary is using an HTTP redirect (302 status code, in this case), so the "new" URL is being passed back as an http header (Location). To get a better idea of what the above is doing, here's a way just using curl and a system call
`curl -I 'http://www.urbandictionary.com/random.php'`. # Get the headers using curl -I
split("\r\n"). # Split on line breaks
find{|header| header =~ /^Location/}. # Get the 'Location' header
split(' '). # Split on spaces
last # Get the last element in the split array

What is the "accept" part for?

When connecting to a website using Net::HTTP you can parse the URL and output each of the URL headers by using #.each_header. I understand what the encoding and the user agent and such means, but not what the "accept"=>["*/*"] part is. Is this the accepted payload? Or is it something else?
require 'net/http'
uri = URI('http://www.bible-history.com/subcat.php?id=2')
http://www.bible-history.com/subcat.php?id=2>
http_request = Net::HTTP::Get.new(uri)
http_request.each_header { |header| puts header }
# => {"accept-encoding"=>["gzip;q=1.0,deflate;q=0.6,identity;q=0.3"], "accept"=>["*/*"], "user-agent"=>["Ruby"], "host"=>["www.bible-history.com"]}
From https://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#z3
This field contains a semicolon-separated list of representation schemes ( Content-Type metainformation values) which will be accepted in the response to this request.
Basically, it specifies what kinds of content you can read back. If you write an api client, you may only be interested in application/json, for example (and you couldn't care less about text/html).
In this case, your header would look like this:
Accept: application/json
And the app will know not to send any html your way.
Using the Accept header, the client can specify MIME types they are willing to accept for the requested URL. If the requested resource is e.g. available in multiple representations (e.g an image as PNG, JPG or SVG), the user agent can specify that they want the PNG version only. It is up to the server to honor this request.
In your example, the request header specifies that you are willing to accept any content type.
The header is defined in RFC 2616.

How can access the headers of an incoming request in tritium?

I would like to be able to add some logic to my tritium project based on the incoming request header. Is it possible to access the header information and then perform match() with() logic?
My plan is to take an existing URL (that can be accessed via a normal GET request) and give it a second mode of functionality so that it can be turned into an AJAX API. When the JavaScript makes the API request, I could set a custom header flag so that the platform knows to interpret the request differently.
You should be able to access headers in the incoming HTTP request using the global variable syntax. For example, to access the site's hostname:
$host
# => yourwebsite.com
I believe that most of the standard headers are accessible as global variables in Tritium. However, I'm not sure if all headers are accessible as global vars.
Inside your project folder, on your development machine, there should be a tmp folder that contains the HTTP request/response bundles. Each bundle should be time stamped with the request's date and time. I think if you peek inside one of these folders, you should see a bunch of files:
incoming_request
incoming_response
outgoing_request
outgoing_response
And possibly a fifth file. I can't remember if this is still the case in the current version of the platform, but there's a chance you'll find a fifth file containing the global variables that the Tritium server creates to store HTTP request header values. So you can peek inside that file (if it exists) and find out what variable name your HTTP headers are using.
Hope that helps!
I'm late on this one, but I figured I would lend a hand to anyone else who needs help on this one.
you need to create two files in your scripts directory, one called
request_main.ts
and
response_main.ts
You can then use things such as the parse_headers function, which iterates through the request/ response headers, depending on the file which you put the code in.
parse_headers() { # iterate over all the incoming/outgoing headers
log(name()) # log the name of the current cookie in the iteration
log(value()) # log the value of the current cookie in the iteration
}
parse_headers(/Set-Cookie/) { # iterate over the Set-Cookie headers only.
log(this())
}
This will log all of your header names, to make modifications, you can then use "setter" functions, which you can read about here:
http://developer.moovweb.com/docs/local/configuration/headers
Good luck.

Can't get updated text file from another server. What is the cause of this?

I am trying to get a frequently updated text file from another server like http://site2.com/state.txt with cURL or PHP's file_get_contents() function.
With both two ways, after a few requests I'm getting previous file instead of getting the updated one.
If I change the file path like http://www.site2.com/state.txt it gets the updated file for a while and again starts to get the old content.
What can I do for getting the updated file countinuosly?
Thanks for help
I dont know php nor do I know curl, but I believe I know a browser caching issue when I see one.
When your browser sees the same get request request multiple times it serves you up a cached version instead of going and actually performing the request.
Two ways to fix:
Clear your browser cache every time you want to get the updated file. (That's a joke)
When you make your get request you should append some sort of time stamp to the end of your url.
I would do this in javascript.
var url = "http://www.site2.com/state.txt?_=" + now();
So basically I am appending a parameter named '_' to the end of the request that has the value of the current timestamp. This will cause the browser to perform a new get request and not give you the cached version.

caching problem

I wrote one script which is running on the linux machine.It fetches data from one url and displays the content on a page.
The problem I am facing is some time if I refresh the page 4-5 times it displays the old content and not the latest one.
The problem could be because of caching proxy which is still caching old content.
Please tell me what to write in the script which automatically delete the caching proxy.
You should try using the Cache-Control HTTP header in your request, to tell the proxy (if there is one) not to cache the result.
See RFC 2616 for an explanation.
Take a look here: http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/topic/com.ibm.websphere.express.doc/info/exp/ae/twbs_cookie.html
and set the following HTTP headers:
Expires with the value a hard coded GMT date in the past
Last-Modified with the value the current date in GMT formatted "EEE, d MMM yyyy HH:mm:ss"
Cache-Control with the following value 'no-store, no-cache, must-revalidate'
Cache-Control with the following value 'post-check=0, pre-check=0'
Pragma with the following value 'no-cache'

Resources